9. Data Quality and Key Figures

Handling Missing Values

Often, the data read is incomplete: there may be missing values, duplicates, etc.

Pandas generally handles missing values well, for example, functions like mean, standard deviation, etc., ignore missing values. However, this can distort the dataset and cause incorrect interpretations when using data analytics and machine learning algorithms. Therefore, in all projects based on data, attention must be paid to data quality.

Rows (or columns) containing missing values are usually wanted to be removed from the DataFrame during the data preprocessing stage. Missing values can be listed using the isnull method (conversely, the notnull() method returns True where a value is not missing)

For removal, DataFrame has a method dropna() which by default removes all rows where even one value is NaN.
With the parameter how='all', only those rows where all values are missing are removed,
with the parameter thresh you can define a "threshold" for how many values must be missing from a row for it to be removed. And with the parameter axis=1, columns are examined/removed instead of rows

import pandas as pd 


df = pd.DataFrame({'Kärpät': {2014: 1, 2015: 1,  2016: 3,  2017: 10 , 2018: 1, 2019: 2, 2020: 1},
                   'Sport' : {2015: 14, 2016: 10, 2017: 14, 2018: 15, 2019: 11, 2020: 15 },
                    'Jukurit' : {2017: 11, 2018: 13, 2019: 13, 2020: 14}   
                  })


print("\nDF")
print(df)


print("\n--------------\n")


print(df[df['Jukurit'].isnull()])  # filter using the Boolean Series provided by df['Jukurit'].isnull()


print('\n------------------\n')


df2 = df.dropna(thresh=2)  # remove rows with at least 2 missing values


print(df2)

DF
      Kärpät  Sport  Jukurit
2014       1    NaN      NaN
2015       1   14.0      NaN
2016       3   10.0      NaN
2017      10   14.0     11.0
2018       1   15.0     13.0
2019       2   11.0     13.0
2020       1   15.0     14.0


--------------


      Kärpät  Sport  Jukurit
2014       1    NaN      NaN
2015       1   14.0      NaN
2016       3   10.0      NaN


------------------

Kärpät Sport Jukurit 2015 1 14.0 NaN 2016 3 10.0 NaN 2017 10 14.0 11.0 2018 1 15.0 13.0 2019 2 11.0 13.0 2020 1 15.0 14.0

Gaps or missing values can also be filled with desired values, and this is done using the fillna method.

df = pd.DataFrame({'Kärpät': {2014: 1, 2015: 1,  2016: 3,  2017: 10 , 2018: 1, 2019: 2, 2020: 1},
                   'Sport' : {2015: 14, 2016: 10, 2017: 14, 2018: 15, 2019: 11, 2020: 15 },
                    'Jukurit' : {2017: 11, 2018: 13, 2019: 13, 2020: 14}   
                  })


print(df)


print('\n------------------\n')


df2 = df.fillna('Mestis')


print(df2)

      Kärpät  Sport  Jukurit
2014       1    NaN      NaN
2015       1   14.0      NaN
2016       3   10.0      NaN
2017      10   14.0     11.0
2018       1   15.0     13.0
2019       2   11.0     13.0
2020       1   15.0     14.0


------------------


      Kärpät   Sport Jukurit
2014       1  Mestis  Mestis
2015       1    14.0  Mestis
2016       3    10.0  Mestis
2017      10    14.0    11.0
2018       1    15.0    13.0
2019       2    11.0    13.0
2020       1    15.0    14.0

The fillna method's method='ffill' parameter causes pandas to fill missing values with the previous valid value in the column, and method='bfill' with the next valid value.

Removing Duplicates

The drop_duplicates method removes rows that contain exactly the same values. By default, it leaves the first row, with the parameter keep='last' it leaves the last one.

Data Reshaping

New columns have already been calculated in the DataFrame using previous columns:

df = pd.read_csv('Datasets/weather_2018.csv', sep=',', decimal='.')


df