5. Data Structures of the Pandas Library

Pandas Library

The most used library for data analysis in Python is pandas. Pandas provides fast, easy, and versatile tools for handling tabular data, especially the Series and DataFrame data structures. Pandas is built on top of NumPy, so the principles of handling NumPy arrays also work in pandas, such as element-wise arithmetic operations without for-loops.

The biggest difference between the libraries is pandas' usability with tabular, possibly heterogeneous data, while NumPy is best suited for handling numerical homogeneous data.

Pandas documentation pages

Importing the Pandas Library

The pandas library is used in the code with the import pandas command. The established practice is to use the alias pd:

import pandas as pd

After this, pandas functions can be called in the code using the abbreviation pd.

Pandas - Series

Series is a one-dimensional data structure similar to a list, where in addition to the elements, an index (index) is defined for them, i.e., a kind of labels (label).

The simplest Series can be made from a regular list (or NumPy array):

import pandas as pd
import numpy as np


# From a list
ser1 = pd.Series([1, 5, 3.4, 6, 8])
print(ser1)


# From a Numpy array
ser2 = pd.Series(np.array([1, 5, 6, 8]))
print(ser2)

0    1.0
1    5.0
2    3.4
3    6.0
4    8.0
dtype: float64
0    1
1    5
2    6
3    8
dtype: int32

In the output, the first column is the index. Since the index was not specified when defining the Series, pandas created it from the numbers 0, 1, 2, etc.

The values of the Series can be obtained as an array with the values function and the index with the index function:

ser1 = pd.Series([1, 5, 3.4, 6, 8])
print(ser1.values)
print(ser1.index)

[1.  5.  3.4 6.  8. ]
RangeIndex(start=0, stop=5, step=1)

When creating a Series, an index can also be defined:

ser1 = pd.Series([1, 5, 3.4, 6, 8], index=[2, 4 ,6 ,8, 10])
print(ser1)


print('\n------------------\n')


ser2 = pd.Series(["a", "b", "dd", 6, "joo"], index=['a', 'b' ,'c' ,'d', 'e'])
print(ser2)


print('\n------------------\n')

2     1.0
4     5.0
6     3.4
8     6.0
10    8.0
dtype: float64


------------------

a a b b c dd d 6 e yes dtype: object

------------------

You can also create a Pandas Series from a Python dictionary (dict), in which case the dictionary keys become the index in sorted order:

cities = {'Rovaniemi': 62000, 'Jyväskylä': 140000, 'Helsinki': 645000 , 'Kuopio': 118000} #Dictionary


ser1 = pd.Series(cities)


print(ser1)

Rovaniemi     62000
Jyväskylä    140000
Helsinki     645000
Kuopio       118000
dtype: int64

You can also specify the index separately. In the population figures dictionary below, the key for Oulu is not found, so its value becomes NaN (Not a Number), which is pandas' way of representing a missing value. And since Rovaniemi is not found in the given index, it is not included in the Series.

population = {'Rovaniemi': 62000, 'Jyväskylä': 140000, 'Helsinki': 645000 , 'Kuopio': 118000} #Dictionary
cities = ['Helsinki', 'Kuopio', 'Oulu', 'Jyväskylä'] # List


ser1 = pd.Series(population, index=cities)


print(ser1)

Helsinki     645000.0
Kuopio       118000.0
Oulu              NaN
Jyväskylä    140000.0
dtype: float64

You can use the index label to select values from the Series:

ser2 = pd.Series(np.linspace(0, 10 ,5), index=['a', 'b' ,'c' ,'d', 'e'])
print(ser2)


print('\n------------------\n')


print(ser2['c']) 


print('\n------------------\n')


print(ser2[['c', 'a', 'd']])


print('\n------------------\n')


# Setting values
ser2[['d', 'a']]  = 100
print(ser2)

a     0.0
b     2.5
c     5.0
d     7.5
e    10.0
dtype: float64


------------------


5.0


------------------


c    5.0
a    0.0
d    7.5
dtype: float64


------------------


a    100.0
b      2.5
c      5.0
d    100.0
e     10.0
dtype: float64

Arithmetic operations combine Series by index:

population = pd.Series({'Rovaniemi': 62000, 'Jyväskylä': 140000, 'Helsinki': 645000 , 'Kuopio': 118000})
areas = pd.Series({'Oulu': 3050, 'Helsinki': 215, 'Jyväskylä': 1466, 'Kuopio': 4326})


population_density = population/areas


print(population_density)

Helsinki     3000.000000
Jyväskylä      95.497954
Kuopio         27.276930
Oulu                 NaN
Rovaniemi            NaN
dtype: float64

Both the Series and its index have a name attribute:

population = pd.Series({'Rovaniemi': 62000, 'Jyväskylä': 140000, 'Helsinki': 645000 , 'Kuopio': 118000})
population.name = "Population"
population.index.name = "City"
print(population)

City
Rovaniemi     62000
Jyväskylä    140000
Helsinki     645000
Kuopio       118000
Name: Population, dtype: int64

Pandas - DataFrame

The most useful data structure for data analysis is the DataFrame, which is a rectangular data table. A DataFrame consists of rows and columns, and the columns can be thought of as Series with the same index (row index, row headers). The names of the columns (column headers) form the "column index".

You can think of the DataFrame data structure as an Excel spreadsheet with rows and columns.

DataFrames can be created from dictionaries, lists, NumPy arrays, or Series.

# DataFrame from a NumPy 2D array
matrix = [[1, 2, 3],
          [10, 20, 30],
          [5, 10, 15]]


df = pd.DataFrame(matrix)


df

	0	1	2
0	1	2	3
1	10	20	30
2	5	10	15

# DataFrame from Series
series1 = pd.Series((1, 2, 3))
series2 = pd.Series((10, 20, 30))
series3 = pd.Series((5, 10, 15))


# Each Series is now an individual column of the DataFrame
df = pd.DataFrame((series1, series2, series3))


df

	column1	column2	column4
0	1	10	5
1	2	20	10
2	3	30	15

data = {'name': ['Rovaniemi', 'Jyväskylä', 'Helsinki', 'Kuopio'],
         'population': [62000, 140000, 645000 , 118000],
         'area': [8016, 1466, 215 , 4326],
         'province': ['Lapland', 'Central Finland', 'Uusimaa', 'North Savo']}

df = pd.DataFrame(data)

print(df)  # index was created automatically

        name  population  area      province
0  Rovaniemi     62000   8016       Lapland
1  Jyväskylä    140000   1466  Central Finland
2   Helsinki    645000    215         Uusimaa
3     Kuopio    118000   4326    North Savo

# Jupyter displays DataFrame as a nice table without the print command
df

<style scoped>
    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>name</th>
      <th>population</th>
      <th>area</th>
      <th>province</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Rovaniemi</td>
      <td>62000</td>
      <td>8016</td>
      <td>Lapland</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Jyväskylä</td>
      <td>140000</td>
      <td>1466</td>
      <td>Central Finland</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Helsinki</td>
      <td>645000</td>
      <td>215</td>
      <td>Uusimaa</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Kuopio</td>
      <td>118000</td>
      <td>4326</td>
      <td>North Savo</td>
    </tr>
  </tbody>
</table>
</div>

```python
data = {'name': ['Rovaniemi', 'Jyväskylä', 'Helsinki', 'Kuopio'],
         'population': [62000, 140000, 645000 , 118000],
         'area': [8016, 1466, 215 , 4326],
         'province': ['Lapland', 'Central Finland', 'Uusimaa', 'North Savo']}


df = pd.DataFrame(data, columns=['name', 'population', 'area', 'unemployment'])


df  # The columns parameter specified during creation determines the columns

	name	population	area	unemployment
0	Rovaniemi	62000	8016	NaN
1	Jyväskylä	140000	1466	NaN
2	Helsinki	645000	215	NaN
3	Kuopio	118000	4326	NaN

data = { 'population': [62000, 140000, 645000 , 118000],
         'area': [8016, 1466, 215 , 4326],
         'province': ['Lapland', 'Central Finland', 'Uusimaa', 'North Savo']}


df = pd.DataFrame(data, index=['Rovaniemi', 'Jyväskylä', 'Helsinki', 'Kuopio'])

df # index given at creation ```

	population	area	province
Rovaniemi	62000	8016	Lapland
Jyväskylä	140000	1466	Central Finland
Helsinki	645000	215	Uusimaa
Kuopio	118000	4326	North Savo