9. Data Quality and Key Figures
Handling Missing Values
Often, the data read is incomplete: there may be missing values, duplicates, etc.
Pandas generally handles missing values well, for example, functions like mean, standard deviation, etc., ignore missing values. However, this can distort the dataset and cause incorrect interpretations when using data analytics and machine learning algorithms. Therefore, in all projects based on data, attention must be paid to data quality.
Rows (or columns) containing missing values are usually wanted to be removed from the DataFrame during the data preprocessing stage. Missing values can be listed using the isnull
method (conversely, the notnull()
method returns True where a value is not missing)
For removal, DataFrame has a method dropna()
which by default removes all rows where even one value is NaN.
With the parameter how='all'
, only those rows where all values are missing are removed,
with the parameter thresh
you can define a "threshold" for how many values must be missing from a row for it to be removed. And with the parameter axis=1
, columns are examined/removed instead of rows
import pandas as pd
df = pd.DataFrame({'Kärpät': {2014: 1, 2015: 1, 2016: 3, 2017: 10 , 2018: 1, 2019: 2, 2020: 1},
'Sport' : {2015: 14, 2016: 10, 2017: 14, 2018: 15, 2019: 11, 2020: 15 },
'Jukurit' : {2017: 11, 2018: 13, 2019: 13, 2020: 14}
})
print("\nDF")
print(df)
print("\n--------------\n")
print(df[df['Jukurit'].isnull()]) # filter using the Boolean Series provided by df['Jukurit'].isnull()
print('\n------------------\n')
df2 = df.dropna(thresh=2) # remove rows with at least 2 missing values
print(df2)
DF
Kärpät Sport Jukurit
2014 1 NaN NaN
2015 1 14.0 NaN
2016 3 10.0 NaN
2017 10 14.0 11.0
2018 1 15.0 13.0
2019 2 11.0 13.0
2020 1 15.0 14.0
--------------
Kärpät Sport Jukurit
2014 1 NaN NaN
2015 1 14.0 NaN
2016 3 10.0 NaN
------------------
Kärpät Sport Jukurit 2015 1 14.0 NaN 2016 3 10.0 NaN 2017 10 14.0 11.0 2018 1 15.0 13.0 2019 2 11.0 13.0 2020 1 15.0 14.0
Gaps or missing values can also be filled with desired values, and this is done using the fillna
method.
df = pd.DataFrame({'Kärpät': {2014: 1, 2015: 1, 2016: 3, 2017: 10 , 2018: 1, 2019: 2, 2020: 1},
'Sport' : {2015: 14, 2016: 10, 2017: 14, 2018: 15, 2019: 11, 2020: 15 },
'Jukurit' : {2017: 11, 2018: 13, 2019: 13, 2020: 14}
})
print(df)
print('\n------------------\n')
df2 = df.fillna('Mestis')
print(df2)
Kärpät Sport Jukurit
2014 1 NaN NaN
2015 1 14.0 NaN
2016 3 10.0 NaN
2017 10 14.0 11.0
2018 1 15.0 13.0
2019 2 11.0 13.0
2020 1 15.0 14.0
------------------
Kärpät Sport Jukurit
2014 1 Mestis Mestis
2015 1 14.0 Mestis
2016 3 10.0 Mestis
2017 10 14.0 11.0
2018 1 15.0 13.0
2019 2 11.0 13.0
2020 1 15.0 14.0
The fillna
method's method='ffill'
parameter causes pandas to fill missing values with the previous valid value in the column, and method='bfill'
with the next valid value.
Removing Duplicates
The drop_duplicates
method removes rows that contain exactly the same values. By default, it leaves the first row, with the parameter keep='last'
it leaves the last one.
Data Reshaping
New columns have already been calculated in the DataFrame using previous columns:
df = pd.read_csv('Datasets/weather_2018.csv', sep=',', decimal='.')
df