5. Data Structures of the Pandas Library
Pandas Library
The most used library for data analysis in Python is pandas. Pandas provides fast, easy, and versatile tools for handling tabular data, especially the Series and DataFrame data structures. Pandas is built on top of NumPy, so the principles of handling NumPy arrays also work in pandas, such as element-wise arithmetic operations without for-loops.
The biggest difference between the libraries is pandas' usability with tabular, possibly heterogeneous data, while NumPy is best suited for handling numerical homogeneous data.
Importing the Pandas Library
The pandas library is used in the code with the import pandas
command. The established practice is to use the alias pd
:
import pandas as pd
After this, pandas functions can be called in the code using the abbreviation pd
.
Pandas - Series
Series is a one-dimensional data structure similar to a list, where in addition to the elements, an index (index) is defined for them, i.e., a kind of labels (label).
The simplest Series can be made from a regular list (or NumPy array):
import pandas as pd
import numpy as np
# From a list
ser1 = pd.Series([1, 5, 3.4, 6, 8])
print(ser1)
# From a Numpy array
ser2 = pd.Series(np.array([1, 5, 6, 8]))
print(ser2)
0 1.0
1 5.0
2 3.4
3 6.0
4 8.0
dtype: float64
0 1
1 5
2 6
3 8
dtype: int32
In the output, the first column is the index. Since the index was not specified when defining the Series, pandas created it from the numbers 0, 1, 2, etc.
The values of the Series can be obtained as an array with the values function and the index with the index function:
ser1 = pd.Series([1, 5, 3.4, 6, 8])
print(ser1.values)
print(ser1.index)
[1. 5. 3.4 6. 8. ]
RangeIndex(start=0, stop=5, step=1)
When creating a Series, an index can also be defined:
ser1 = pd.Series([1, 5, 3.4, 6, 8], index=[2, 4 ,6 ,8, 10])
print(ser1)
print('\n------------------\n')
ser2 = pd.Series(["a", "b", "dd", 6, "joo"], index=['a', 'b' ,'c' ,'d', 'e'])
print(ser2)
print('\n------------------\n')
2 1.0
4 5.0
6 3.4
8 6.0
10 8.0
dtype: float64
------------------
a a b b c dd d 6 e yes dtype: object
------------------
You can also create a Pandas Series from a Python dictionary (dict), in which case the dictionary keys become the index in sorted order:
cities = {'Rovaniemi': 62000, 'Jyväskylä': 140000, 'Helsinki': 645000 , 'Kuopio': 118000} #Dictionary
ser1 = pd.Series(cities)
print(ser1)
Rovaniemi 62000
Jyväskylä 140000
Helsinki 645000
Kuopio 118000
dtype: int64
You can also specify the index separately. In the population figures dictionary below, the key for Oulu is not found, so its value becomes NaN (Not a Number), which is pandas' way of representing a missing value. And since Rovaniemi is not found in the given index, it is not included in the Series.
population = {'Rovaniemi': 62000, 'Jyväskylä': 140000, 'Helsinki': 645000 , 'Kuopio': 118000} #Dictionary
cities = ['Helsinki', 'Kuopio', 'Oulu', 'Jyväskylä'] # List
ser1 = pd.Series(population, index=cities)
print(ser1)
Helsinki 645000.0
Kuopio 118000.0
Oulu NaN
Jyväskylä 140000.0
dtype: float64
You can use the index label to select values from the Series:
ser2 = pd.Series(np.linspace(0, 10 ,5), index=['a', 'b' ,'c' ,'d', 'e'])
print(ser2)
print('\n------------------\n')
print(ser2['c'])
print('\n------------------\n')
print(ser2[['c', 'a', 'd']])
print('\n------------------\n')
# Setting values
ser2[['d', 'a']] = 100
print(ser2)
a 0.0
b 2.5
c 5.0
d 7.5
e 10.0
dtype: float64
------------------
5.0
------------------
c 5.0
a 0.0
d 7.5
dtype: float64
------------------
a 100.0
b 2.5
c 5.0
d 100.0
e 10.0
dtype: float64
Arithmetic operations combine Series by index:
population = pd.Series({'Rovaniemi': 62000, 'Jyväskylä': 140000, 'Helsinki': 645000 , 'Kuopio': 118000})
areas = pd.Series({'Oulu': 3050, 'Helsinki': 215, 'Jyväskylä': 1466, 'Kuopio': 4326})
population_density = population/areas
print(population_density)
Helsinki 3000.000000
Jyväskylä 95.497954
Kuopio 27.276930
Oulu NaN
Rovaniemi NaN
dtype: float64
Both the Series and its index have a name
attribute:
population = pd.Series({'Rovaniemi': 62000, 'Jyväskylä': 140000, 'Helsinki': 645000 , 'Kuopio': 118000})
population.name = "Population"
population.index.name = "City"
print(population)
City
Rovaniemi 62000
Jyväskylä 140000
Helsinki 645000
Kuopio 118000
Name: Population, dtype: int64
Pandas - DataFrame
The most useful data structure for data analysis is the DataFrame, which is a rectangular data table. A DataFrame consists of rows and columns, and the columns can be thought of as Series with the same index (row index, row headers). The names of the columns (column headers) form the "column index".
You can think of the DataFrame data structure as an Excel spreadsheet with rows and columns.
DataFrames can be created from dictionaries, lists, NumPy arrays, or Series.
# DataFrame from a NumPy 2D array
matrix = [[1, 2, 3],
[10, 20, 30],
[5, 10, 15]]
df = pd.DataFrame(matrix)
df
0 | 1 | 2 | |
---|---|---|---|
0 | 1 | 2 | 3 |
1 | 10 | 20 | 30 |
2 | 5 | 10 | 15 |
# DataFrame from Series
series1 = pd.Series((1, 2, 3))
series2 = pd.Series((10, 20, 30))
series3 = pd.Series((5, 10, 15))
# Each Series is now an individual column of the DataFrame
df = pd.DataFrame((series1, series2, series3))
df
column1 | column2 | column4 | |
---|---|---|---|
0 | 1 | 10 | 5 |
1 | 2 | 20 | 10 |
2 | 3 | 30 | 15 |
data = {'name': ['Rovaniemi', 'Jyväskylä', 'Helsinki', 'Kuopio'],
'population': [62000, 140000, 645000 , 118000],
'area': [8016, 1466, 215 , 4326],
'province': ['Lapland', 'Central Finland', 'Uusimaa', 'North Savo']}
df = pd.DataFrame(data)
print(df) # index was created automatically
name population area province
0 Rovaniemi 62000 8016 Lapland
1 Jyväskylä 140000 1466 Central Finland
2 Helsinki 645000 215 Uusimaa
3 Kuopio 118000 4326 North Savo
# Jupyter displays DataFrame as a nice table without the print command
df
<style scoped>
.dataframe thead th {
text-align: right;
}
</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>population</th>
<th>area</th>
<th>province</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>Rovaniemi</td>
<td>62000</td>
<td>8016</td>
<td>Lapland</td>
</tr>
<tr>
<th>1</th>
<td>Jyväskylä</td>
<td>140000</td>
<td>1466</td>
<td>Central Finland</td>
</tr>
<tr>
<th>2</th>
<td>Helsinki</td>
<td>645000</td>
<td>215</td>
<td>Uusimaa</td>
</tr>
<tr>
<th>3</th>
<td>Kuopio</td>
<td>118000</td>
<td>4326</td>
<td>North Savo</td>
</tr>
</tbody>
</table>
</div>
```python
data = {'name': ['Rovaniemi', 'Jyväskylä', 'Helsinki', 'Kuopio'],
'population': [62000, 140000, 645000 , 118000],
'area': [8016, 1466, 215 , 4326],
'province': ['Lapland', 'Central Finland', 'Uusimaa', 'North Savo']}
df = pd.DataFrame(data, columns=['name', 'population', 'area', 'unemployment'])
df # The columns parameter specified during creation determines the columns
name | population | area | unemployment | |
---|---|---|---|---|
0 | Rovaniemi | 62000 | 8016 | NaN |
1 | Jyväskylä | 140000 | 1466 | NaN |
2 | Helsinki | 645000 | 215 | NaN |
3 | Kuopio | 118000 | 4326 | NaN |
data = { 'population': [62000, 140000, 645000 , 118000],
'area': [8016, 1466, 215 , 4326],
'province': ['Lapland', 'Central Finland', 'Uusimaa', 'North Savo']}
df = pd.DataFrame(data, index=['Rovaniemi', 'Jyväskylä', 'Helsinki', 'Kuopio'])
population | area | province | |
---|---|---|---|
Rovaniemi | 62000 | 8016 | Lapland |
Jyväskylä | 140000 | 1466 | Central Finland |
Helsinki | 645000 | 215 | Uusimaa |
Kuopio | 118000 | 4326 | North Savo |