7. Handling a DataFrame - part 1
Once data has been imported into a DataFrame, one must know how to handle it. Handling data in a DataFrame becomes faster the more you do it. This section covers the basics related to handling a DataFrame.
A DataFrame can thus be thought of as a table with rows and columns. Now one must be able to examine, modify, delete, etc., these.
A column can be "selected" as a Series in two ways: df['column']
or df.column
.
The name of the Series becomes the title of the column.
import pandas as pd
data = { 'population': [62000, 140000, 645000 , 118000],
'area': [8016, 1466, 215 , 4326],
'province': ['Lapland', 'Central Finland', 'Uusimaa', 'North Savo']}
df = pd.DataFrame(data, index=['Rovaniemi', 'Jyväskylä', 'Helsinki', 'Kuopio'])
print(df)
print('\n----------------\n')
ser1 = df['area']
print(ser1) # index given at creation
print('\n----------------\n')
ser2 = df.population
print(ser2)
population area province
Rovaniemi 62000 8016 Lapland
Jyväskylä 140000 1466 Central Finland
Helsinki 645000 215 Uusimaa
Kuopio 118000 4326 North Savo
----------------
Rovaniemi 8016
Jyväskylä 1466
Helsinki 215
Kuopio 4326
Name: area, dtype: int64
----------------
Rovaniemi 62000
Jyväskylä 140000
Helsinki 645000
Kuopio 118000
Name: population, dtype: int64
A row can be selected with the loc
attribute (more on this below):
print(df.loc['Rovaniemi'])
population 62000
area 8016
province Lapland
Name: Rovaniemi, dtype: object
Columns can be modified or added (if the column name is not found) with an assignment operation and removed with the del
keyword:
data = { 'population': [62000, 140000, 645000 , 118000],
'area': [8016, 1466, 215 , 4326],
'province': ['Lapland', 'Central Finland', 'Uusimaa', 'North Savo']}
df = pd.DataFrame(data, index=['Rovaniemi', 'Jyväskylä', 'Helsinki', 'Kuopio'])
import numpy as np
# Adding a new column
print("Adding a new column")
df['newcolumn'] = 15
print(df)
print('\n--------------------\n')
# Adding values to an existing column with a numpy array
df['newcolumn'] = np.arange(0,4) # np.arange(0,4) gives a NumPy array [0, 1, 2, 3]
print(df)
```markdown
print('\n--------------------\n')
# New column with a Boolean type value
print("New Boolean column")
df['large'] = df['population'] > 100000 # gives a Boolean-type Series with values depending on whether the population is >100000
print(df)
print('\n--------------------\n')
# Deleting a column
print("Deleting a column")
del df['newcolumn']
print(df)
print('\n--------------------\n')
# Defining the teams
teams = pd.Series({'Helsinki': 'HIFK', 'Jyväskylä': 'Jyp', 'Kuopio': 'KalPa'})
df['team'] = teams
print(df)
Adding a new column
population area region newcolumn
Rovaniemi 62000 8016 Lapland 15
Jyväskylä 140000 1466 Central Finland 15
Helsinki 645000 215 Uusimaa 15
Kuopio 118000 4326 North Savo 15
--------------------
population area region newcolumn
Rovaniemi 62000 8016 Lapland 0
Jyväskylä 140000 1466 Central Finland 1
Helsinki 645000 215 Uusimaa 2
Kuopio 118000 4326 North Savo 3
--------------------
New Boolean column
population area region newcolumn large
Rovaniemi 62000 8016 Lapland 0 False
Jyväskylä 140000 1466 Central Finland 1 True
Helsinki 645000 215 Uusimaa 2 True
Kuopio 118000 4326 North Savo 3 True
--------------------
Deleting a column
population area region large
Rovaniemi 62000 8016 Lapland False
Jyväskylä 140000 1466 Central Finland True
Helsinki 645000 215 Uusimaa True
Kuopio 118000 4326 North Savo True
--------------------
population area region large team
Rovaniemi 62000 8016 Lapland False NaN
Jyväskylä 140000 1466 Central Finland True Jyp
Helsinki 645000 215 Uusimaa True HIFK
Kuopio 118000 4326 North Savo True KalPa
A DataFrame can also be created from nested dictionaries. In this case, the outer keys become column headers and the inner keys become row headers (index). Rows and columns can be swapped with the T method (transpose).
```python
df = pd.DataFrame({'Kärpät': {2014: 1, 2015: 1, 2016: 3, 2017: 10 , 2018: 1, 2019: 2, 2020: 1},
'Sport' : {2015: 14, 2016: 10, 2017: 14, 2018: 15, 2019: 11, 2020: 15 },
'Jukurit' : {2017: 11, 2018: 13, 2019: 13, 2020: 14}
})
print(df)
print('\n---------\n')
df2 = df.T
print(df2)
Kärpät Sport Jukurit
2014 1 NaN NaN
2015 1 14.0 NaN
2016 3 10.0 NaN
2017 10 14.0 11.0
2018 1 15.0 13.0
2019 2 11.0 13.0
2020 1 15.0 14.0
---------
2014 2015 2016 2017 2018 2019 2020
Kärpät 1.0 1.0 3.0 10.0 1.0 2.0 1.0
Sport NaN 14.0 10.0 14.0 15.0 11.0 15.0
Jukurit NaN NaN NaN 11.0 13.0 13.0 14.0
Reindexing
The reindex
method can be used to select rows from a DataFrame (or Series) according to a given new index.
df = pd.DataFrame({'Kärpät': {2014: 1, 2015: 1, 2016: 3, 2017: 10 , 2018: 1, 2019: 2, 2020: 1},
'Sport' : {2015: 14, 2016: 10, 2017: 14, 2018: 15, 2019: 11, 2020: 15 },
'Jukurit' : {2017: 11, 2018: 13, 2019: 13, 2020: 14}
})
print(df)
print('\n---------\n')
df2 = df.reindex([2020, 2019, 2018])
print(df2)
Kärpät Sport Jukurit
2014 1 NaN NaN
2015 1 14.0 NaN
2016 3 10.0 NaN
2017 10 14.0 11.0
2018 1 15.0 13.0
2019 2 11.0 13.0
2020 1 15.0 14.0
---------
Kärpät Sport Jukurit
2020 1 15.0 14.0
2019 2 11.0 13.0
2018 1 15.0 13.0
With the columns
parameter, you can reindex according to the columns:
df = pd.DataFrame({'Kärpät': {2014: 1, 2015: 1, 2016: 3, 2017: 10 , 2018: 1, 2019: 2, 2020: 1},
'Sport' : {2015: 14, 2016: 10, 2017: 14, 2018: 15, 2019: 11, 2020: 15 },
'Jukurit' : {2017: 11, 2018: 13, 2019: 13, 2020: 14}
})
print(df)
print('\n---------\n')
df2 = df.reindex(columns = ['Tappara', 'Jukurit', 'Kärpät'])
print(df2)
Kärpät Sport Jukurit
2014 1 NaN NaN
2015 1 14.0 NaN
2016 3 10.0 NaN
2017 10 14.0 11.0
2018 1 15.0 13.0
2019 2 11.0 13.0
2020 1 15.0 14.0
---------
Tappara Jukurit Kärpät
2014 NaN NaN 1
2015 NaN NaN 1
2016 NaN NaN 3
2017 NaN 11.0 10
2018 NaN 13.0 1
2019 NaN 13.0 2
2020 NaN 14.0 1
Removing rows or columns (drop)
The drop
method can be used to remove rows (default) or columns with the parameter axis=1
:
df = pd.DataFrame({'Kärpät': {2014: 1, 2015: 1, 2016: 3, 2017: 10 , 2018: 1, 2019: 2, 2020: 1},
'Sport' : {2015: 14, 2016: 10, 2017: 14, 2018: 15, 2019: 11, 2020: 15 },
'Jukurit' : {2017: 11, 2018: 13, 2019: 13, 2020: 14}
})
print(df)
Kärpät Sport Jukurit
2014 1 NaN NaN
2015 1 14.0 NaN
2016 3 10.0 NaN
2017 10 14.0 11.0
2018 1 15.0 13.0
2019 2 11.0 13.0
2020 1 15.0 14.0
print('\n---------\n')
# Removing row 2015
print("Removing row 2015")
print(df.drop(2015))
print('\n---------\n')
print("Original")
print(df)
print('\n---------\n')
# Removing rows 2015 and 2016
print("Removing rows 2015 and 2016")
print(df.drop([2015, 2016]))
print('\n---------\n')
# Removing column Jukurit
print("Removing column Jukurit")
print(df.drop('Jukurit', axis=1))
print("Original")
print(df)
print('\n---------\n')
---------
Remove row 2015
Kärpät Sport Jukurit
2014 1 NaN NaN
2016 3 10.0 NaN
2017 10 14.0 11.0
2018 1 15.0 13.0
2019 2 11.0 13.0
2020 1 15.0 14.0
---------
Original
Kärpät Sport Jukurit
2014 1 NaN NaN
2015 1 14.0 NaN
2016 3 10.0 NaN
2017 10 14.0 11.0
2018 1 15.0 13.0
2019 2 11.0 13.0
2020 1 15.0 14.0
---------
Remove rows 2015 and 2016
Kärpät Sport Jukurit
2014 1 NaN NaN
2017 10 14.0 11.0
2018 1 15.0 13.0
2019 2 11.0 13.0
2020 1 15.0 14.0
---------
Remove column Jukurit
Kärpät Sport
2014 1 NaN
2015 1 14.0
2016 3 10.0
2017 10 14.0
2018 1 15.0
2019 2 11.0
2020 1 15.0
Original
Kärpät Sport Jukurit
2014 1 NaN NaN
2015 1 14.0 NaN
2016 3 10.0 NaN
2017 10 14.0 11.0
2018 1 15.0 13.0
2019 2 11.0 13.0
2020 1 15.0 14.0
---------
The `drop` method thus returns a new DataFrame (or Series) from which rows have been removed. With the parameter `inplace=True`, the removal can be done on the original DataFrame.
<a id="15"></a>
### Selection and filtering
In Series, selection works in the same way as in NumPy arrays, but you can also use the Series index for selection.
The last example shows that when using the index, slicing `['s':'d']` also includes the element corresponding to the end index, unlike when using "positions" or row numbers `[2:4]`
```python
ser1 = pd.Series(np.arange(0,12,2), index=['p','s','v','d','j','f'])
print(ser1)
print('---------')
print(ser1['s']) # index s
print('---------')
print(ser1[2]) # row number 2
print('---------')
print(ser1[2:4]) # rows 2 and 3, 4 not included
print('---------')
print(ser1[['j', 'v']]) # from index j and v
print('---------')
print(ser1[[0,5]]) # row numbers 0 and 5
---------
print(ser1['s':'d']) # d also included
p 0
s 2
v 4
d 6
j 8
f 10
dtype: int32
---------
2
---------
4
---------
v 4
d 6
dtype: int32
---------
j 8
v 4
dtype: int32
---------
p 0
f 10
dtype: int32
---------
s 2
v 4
d 6
dtype: int32
In a DataFrame, the df[column]
syntax typically gives a column or columns:
df = pd.DataFrame({'Kärpät': {2014: 1, 2015: 1, 2016: 3, 2017: 10 , 2018: 1, 2019: 2, 2020: 1},
'Sport' : {2015: 14, 2016: 10, 2017: 14, 2018: 15, 2019: 11, 2020: 15 },
'Jukurit' : {2017: 11, 2018: 13, 2019: 13, 2020: 14}
})
print(df)
print('\n---------\n')
print(df['Kärpät'])
print('\n---------\n')
print(df[['Kärpät', 'Sport']])
Kärpät Sport Jukurit
2014 1 NaN NaN
2015 1 14.0 NaN
2016 3 10.0 NaN
2017 10 14.0 11.0
2018 1 15.0 13.0
2019 2 11.0 13.0
2020 1 15.0 14.0
---------
2014 1
2015 1
2016 3
2017 10
2018 1
2019 2
2020 1
Name: Kärpät, dtype: int64
---------
Kärpät Sport
2014 1 NaN
2015 1 14.0
2016 3 10.0
2017 10 14.0
2018 1 15.0
2019 2 11.0
2020 1 15.0
However, there are some special cases:
* df[2]
does not work if there is no 2 in the column headers
* df[1:3]
actually gives rows 1 and 2 (the end row 3 is not included)
* df[df['Kärpät']<4]
gives the rows that meet the condition
df = pd.DataFrame({'Kärpät': {2014: 1, 2015: 1, 2016: 3, 2017: 10 , 2018: 1, 2019: 2, 2020: 1},
'Sport' : {2015: 14, 2016: 10, 2017: 14, 2018: 15, 2019: 11, 2020: 15 },
'Jukurit' : {2017: 11, 2018: 13, 2019: 13, 2020: 14}
})
print(df)
print('\n---------\n')
print(df[:2])
---------
```python
print(df[df['Kärpät']<4])
print(df['Kärpät']<4) # this gives a True/False Series, which filters the rows by index in df[df['Kärpät']<4]
Kärpät Sport Jukurit
2014 1 NaN NaN
2015 1 14.0 NaN
2016 3 10.0 NaN
2017 10 14.0 11.0
2018 1 15.0 13.0
2019 2 11.0 13.0
2020 1 15.0 14.0
---------
Kärpät Sport Jukurit
2014 1 NaN NaN
2015 1 14.0 NaN
---------
Kärpät Sport Jukurit
2014 1 NaN NaN
2015 1 14.0 NaN
2016 3 10.0 NaN
2018 1 15.0 13.0
2019 2 11.0 13.0
2020 1 15.0 14.0
---------
2014 True
2015 True
2016 True
2017 False
2018 True
2019 True
2020 True
Name: Kärpät, dtype: bool
loc and iloc operators
Selection by row headers can be done with the loc
operator, which offers a NumPy-style notation df[row,column]
.
loc uses row and column headers, while iloc uses row and column numbers (1st row = 0, etc).
df = pd.DataFrame({'Kärpät': {2014: 1, 2015: 1, 2016: 3, 2017: 10 , 2018: 1, 2019: 2, 2020: 1},
'Sport' : {2015: 14, 2016: 10, 2017: 14, 2018: 15, 2019: 11, 2020: 15 },
'Jukurit' : {2017: 11, 2018: 13, 2019: 13, 2020: 14}
})
print(df)
print("\ndf.loc[2015, 'Kärpät']")
print(df.loc[2015, 'Kärpät'])
print("\ndf.loc[[2015, 2016], ['Kärpät', 'Sport']]")
print(df.loc[[2015, 2016], ['Kärpät', 'Sport']])
print("\ndf.iloc[[1,2], 2]")
print(df.iloc[[1,2], 2])
print("\ndf.iloc[[1,2], 2] = 100")
df.iloc[[1,2], 2] = 100
print(df)
Kärpät Sport Jukurit
2014 1 NaN NaN
2015 1 14.0 NaN
2016 3 10.0 NaN
2017 10 14.0 11.0
2018 1 15.0 13.0
2019 2 11.0 13.0
2020 1 15.0 14.0
python
df.loc[2015, 'Kärpät']
1
df.loc[[2015, 2016], ['Kärpät', 'Sport']]
Kärpät Sport
2015 1 14.0
2016 3 10.0
df.iloc[[1,2], 2]
2015 NaN
2016 NaN
Name: Jukurit, dtype: float64
df.iloc[[1,2], 2] = 100
Kärpät Sport Jukurit
2014 1 NaN NaN
2015 1 14.0 100.0
2016 3 10.0 100.0
2017 10 14.0 11.0
2018 1 15.0 13.0
2019 2 11.0 13.0
2020 1 15.0 14.0
Slicing also works with loc and iloc operators. Note that when slicing with labels (loc), the end row/column is included, but with row/column numbers (iloc), it is not.
A plain
:
means all rows/columns.
You can also omit the column part, then only rows are examined. If you want to examine only columns, you can mark the row part with :
python
df = pd.DataFrame({'Kärpät': {2014: 1, 2015: 1, 2016: 3, 2017: 10 , 2018: 1, 2019: 2, 2020: 1},
'Sport' : {2015: 14, 2016: 10, 2017: 14, 2018: 15, 2019: 11, 2020: 15 },
'Jukurit' : {2017: 11, 2018: 13, 2019: 13, 2020: 14}
})
print(df)
print('\n---------\n')
print(df.loc[:2015, 'Kärpät']) # The row for 2015 is included.
print('\n---------\n')
print(df.loc[2016:2019, ['Kärpät', 'Sport']])
df2 = df.loc[2016:2019, ['Kärpät', 'Sport']]
print('\n---------\n')
print(df.iloc[0:3, :]) # The 4th row (row no. 3) is not included
print('\n---------\n')
Kärpät Sport Jukurit
2014 1 NaN NaN 2015 1 14.0 NaN 2016 3 10.0 NaN 2017 10 14.0 11.0 2018 1 15.0 13.0 2019 2 11.0 13.0 2020 1 15.0 14.0
2014 1 2015 1 Name: Kärpät, dtype: int64
Kärpät Sport
2016 3 10.0 2017 10 14.0 2018 1 15.0 2019 2 11.0
Kärpät Sport Jukurit
2014 1 NaN NaN 2015 1 14.0 NaN 2016 3 10.0 NaN
A Boolean array can also be used for selection:
```python
df = pd.DataFrame({'Kärpät': {2014: 1, 2015: 1, 2016: 3, 2017: 10 , 2018: 1, 2019: 2, 2020: 1},
'Sport' : {2015: 14, 2016: 10, 2017: 14, 2018: 15, 2019: 11, 2020: 15 },
'Jukurit' : {2017: 11, 2018: 13, 2019: 13, 2020: 14}
})
print(df)
print('\n---------\n')
# all rows where Kärpät < 4. All columns
print(df[df['Kärpät']<4])
print('\n---------\n')
# all rows where Jyp < 4. Columns Kärpät and Jukurit
print(df.loc[df['Kärpät']<4, ['Kärpät', 'Jukurit']])
print('\n---------\n')
# all columns where the 2017 rank is > 10. All rows.
print(df.loc[ : , df.loc[2017]>10])
print('\n---------\n')
# rows where Sport>10, columns where 2017>10
print(df.loc[ df['Sport']>10 , df.loc[2017]>10])
print('\n---------\n')
# the last 2 columns, rows where Kärpät<4
print(df.iloc[:,-2:][df['Kärpät']<4])
Kärpät Sport Jukurit
2014 1 NaN NaN
2015 1 14.0 NaN
2016 3 10.0 NaN
2017 10 14.0 11.0
2018 1 15.0 13.0
2019 2 11.0 13.0
2020 1 15.0 14.0
---------
Kärpät Sport Jukurit
2014 1 NaN NaN
2015 1 14.0 NaN
2016 3 10.0 NaN
2018 1 15.0 13.0
2019 2 11.0 13.0
2020 1 15.0 14.0
---------
Kärpät Jukurit
2014 1 NaN
2015 1 NaN
2016 3 NaN
2018 1 13.0
2019 2 13.0
2020 1 14.0
---------
Sport Jukurit
2014 NaN NaN
2015 14.0 NaN
2016 10.0 NaN
2017 14.0 11.0
2018 15.0 13.0
2019 11.0 13.0
2020 15.0 14.0
---------
Sport Jukurit
2015 14.0 NaN
2017 14.0 11.0
2018 15.0 13.0
2019 11.0 13.0
2020 15.0 14.0
---------
Sport Jukurit 2014 NaN NaN 2015 14.0 NaN 2016 10.0 NaN 2018 15.0 13.0 2019 11.0 13.0 2020 15.0 14.0
Individual elements can also be accessed with at
and iat
operators by row and column headers (at) or row and column numbers (iat).
Sorting
DataFrames and Series can be sorted using the sort_values
and sort_index
functions.
With the parameter ascending = False
, sorting is done in descending order.
With the parameter axis = 1
, sorting is done according to row headers.
df = pd.DataFrame({'Kärpät': {2014: 1, 2015: 1, 2016: 3, 2017: 10 , 2018: 1, 2019: 2, 2020: 1},
'Sport' : {2015: 14, 2016: 10, 2017: 14, 2018: 15, 2019: 11, 2020: 15 },
'Jukurit' : {2017: 11, 2018: 13, 2019: 13, 2020: 14}
})
print(df)
print('\n Sort_index (Descending) \n')
# Descending
df2 = df.sort_index(ascending = False)
print(df2)
print('\n Sort_values (Descending) \n')
#Descending
df3 = df.sort_values('Kärpät', ascending = False)
print(df3)
print('\n Sort_values2 (Ascending) \n')
#Ascending, by default ascending = True
df.sort_values(['Kärpät', 'Sport'], inplace = True)
print(df)
print('\n Sort_values3 (Ascending)\n')
df4 = df.sort_values('Jukurit')
print(df4[:2]) # two smallest Jukurit rankings
Kärpät Sport Jukurit
2014 1 NaN NaN
2015 1 14.0 NaN
2016 3 10.0 NaN
2017 10 14.0 11.0
2018 1 15.0 13.0
2019 2 11.0 13.0
2020 1 15.0 14.0
Sort_index (Descending)
Kärpät Sport Jukurit
2020 1 15.0 14.0
2019 2 11.0 13.0
2018 1 15.0 13.0
2017 10 14.0 11.0
2016 3 10.0 NaN
2015 1 14.0 NaN
2014 1 NaN NaN
Sort_values (Descending)
Kärpät Sport Jukurit
2017 10 14.0 11.0
2016 3 10.0 NaN
2019 2 11.0 13.0
2014 1 NaN NaN
2015 1 14.0 NaN
2018 1 15.0 13.0
2020 1 15.0 14.0
Sort_values2 (Ascending)
Kärpät Sport Jukurit
2015 1 14.0 NaN
2018 1 15.0 13.0
2020 1 15.0 14.0
2014 1 NaN NaN
2019 2 11.0 13.0
2016 3 10.0 NaN
2017 10 14.0 11.0
Sort_values3 (Ascending)
Kärpät Sport Jukurit
2017 10 14.0 11.0
2018 1 15.0 13.0
```python