Skip to content

7. Handling a DataFrame - part 1

Once data has been imported into a DataFrame, one must know how to handle it. Handling data in a DataFrame becomes faster the more you do it. This section covers the basics related to handling a DataFrame.

A DataFrame can thus be thought of as a table with rows and columns. Now one must be able to examine, modify, delete, etc., these.

A column can be "selected" as a Series in two ways: df['column'] or df.column.
The name of the Series becomes the title of the column.

import pandas as pd 


data = { 'population': [62000, 140000, 645000 , 118000],
         'area': [8016, 1466, 215 , 4326],
         'province': ['Lapland', 'Central Finland', 'Uusimaa', 'North Savo']}


df = pd.DataFrame(data, index=['Rovaniemi', 'Jyväskylä', 'Helsinki', 'Kuopio'])


print(df)


print('\n----------------\n')


ser1 = df['area']
print(ser1)  # index given at creation


print('\n----------------\n')


ser2 = df.population
print(ser2)
           population  area      province
Rovaniemi      62000   8016       Lapland
Jyväskylä     140000   1466  Central Finland
Helsinki      645000    215        Uusimaa
Kuopio        118000   4326     North Savo


----------------


Rovaniemi    8016
Jyväskylä    1466
Helsinki      215
Kuopio       4326
Name: area, dtype: int64


----------------


Rovaniemi     62000
Jyväskylä    140000
Helsinki     645000
Kuopio       118000
Name: population, dtype: int64

A row can be selected with the loc attribute (more on this below):

print(df.loc['Rovaniemi'])
population     62000
area           8016
province      Lapland
Name: Rovaniemi, dtype: object

Columns can be modified or added (if the column name is not found) with an assignment operation and removed with the del keyword:

data = { 'population': [62000, 140000, 645000 , 118000],
         'area': [8016, 1466, 215 , 4326],
         'province': ['Lapland', 'Central Finland', 'Uusimaa', 'North Savo']}


df = pd.DataFrame(data, index=['Rovaniemi', 'Jyväskylä', 'Helsinki', 'Kuopio'])
import numpy as np


# Adding a new column
print("Adding a new column")
df['newcolumn'] = 15
print(df)


print('\n--------------------\n')


# Adding values to an existing column with a numpy array
df['newcolumn'] = np.arange(0,4)  # np.arange(0,4) gives a NumPy array [0, 1, 2, 3]
print(df)

```markdown
print('\n--------------------\n')


# New column with a Boolean type value
print("New Boolean column")
df['large'] = df['population'] > 100000 # gives a Boolean-type Series with values depending on whether the population is >100000
print(df)


print('\n--------------------\n')


# Deleting a column
print("Deleting a column")
del df['newcolumn']
print(df)


print('\n--------------------\n')
# Defining the teams
teams = pd.Series({'Helsinki': 'HIFK', 'Jyväskylä': 'Jyp', 'Kuopio': 'KalPa'})


df['team'] = teams              
print(df)
Adding a new column
           population  area          region  newcolumn
Rovaniemi       62000   8016           Lapland        15
Jyväskylä      140000   1466     Central Finland        15
Helsinki       645000    215         Uusimaa        15
Kuopio         118000   4326  North Savo        15


--------------------


           population  area          region  newcolumn
Rovaniemi       62000   8016           Lapland         0
Jyväskylä      140000   1466     Central Finland         1
Helsinki       645000    215         Uusimaa         2
Kuopio         118000   4326  North Savo         3


--------------------


New Boolean column
           population  area          region  newcolumn  large
Rovaniemi       62000   8016           Lapland         0  False
Jyväskylä      140000   1466     Central Finland         1   True
Helsinki       645000    215         Uusimaa         2   True
Kuopio         118000   4326  North Savo         3   True


--------------------


Deleting a column
           population  area          region  large
Rovaniemi       62000   8016           Lapland  False
Jyväskylä      140000   1466     Central Finland   True
Helsinki       645000    215         Uusimaa   True
Kuopio         118000   4326  North Savo   True


--------------------


           population  area          region  large team
Rovaniemi       62000   8016           Lapland  False   NaN
Jyväskylä      140000   1466     Central Finland   True   Jyp
Helsinki       645000    215         Uusimaa   True  HIFK
Kuopio         118000   4326  North Savo   True KalPa

A DataFrame can also be created from nested dictionaries. In this case, the outer keys become column headers and the inner keys become row headers (index). Rows and columns can be swapped with the T method (transpose).

```python
df = pd.DataFrame({'Kärpät': {2014: 1, 2015: 1,  2016: 3,  2017: 10 , 2018: 1, 2019: 2, 2020: 1},
                   'Sport' : {2015: 14, 2016: 10, 2017: 14, 2018: 15, 2019: 11, 2020: 15 },
                    'Jukurit' : {2017: 11, 2018: 13, 2019: 13, 2020: 14}   
                  })


print(df)


print('\n---------\n')


df2 = df.T


print(df2)

      Kärpät  Sport  Jukurit
2014       1    NaN      NaN
2015       1   14.0      NaN
2016       3   10.0      NaN
2017      10   14.0     11.0
2018       1   15.0     13.0
2019       2   11.0     13.0
2020       1   15.0     14.0


---------


         2014  2015  2016  2017  2018  2019  2020
Kärpät    1.0   1.0   3.0  10.0   1.0   2.0   1.0
Sport     NaN  14.0  10.0  14.0  15.0  11.0  15.0
Jukurit   NaN   NaN   NaN  11.0  13.0  13.0  14.0

Reindexing

The reindex method can be used to select rows from a DataFrame (or Series) according to a given new index.

df = pd.DataFrame({'Kärpät': {2014: 1, 2015: 1,  2016: 3,  2017: 10 , 2018: 1, 2019: 2, 2020: 1},
                   'Sport' : {2015: 14, 2016: 10, 2017: 14, 2018: 15, 2019: 11, 2020: 15 },
                    'Jukurit' : {2017: 11, 2018: 13, 2019: 13, 2020: 14}   
                  })
print(df)


print('\n---------\n')


df2 = df.reindex([2020, 2019, 2018])


print(df2)
      Kärpät  Sport  Jukurit
2014       1    NaN      NaN
2015       1   14.0      NaN
2016       3   10.0      NaN
2017      10   14.0     11.0
2018       1   15.0     13.0
2019       2   11.0     13.0
2020       1   15.0     14.0


---------


      Kärpät  Sport  Jukurit
2020       1   15.0     14.0
2019       2   11.0     13.0
2018       1   15.0     13.0

With the columns parameter, you can reindex according to the columns:

df = pd.DataFrame({'Kärpät': {2014: 1, 2015: 1,  2016: 3,  2017: 10 , 2018: 1, 2019: 2, 2020: 1},
                   'Sport' : {2015: 14, 2016: 10, 2017: 14, 2018: 15, 2019: 11, 2020: 15 },
                    'Jukurit' : {2017: 11, 2018: 13, 2019: 13, 2020: 14}   
                  })
print(df)


print('\n---------\n')


df2 = df.reindex(columns = ['Tappara', 'Jukurit', 'Kärpät'])


print(df2)
      Kärpät  Sport  Jukurit
2014       1    NaN      NaN
2015       1   14.0      NaN
2016       3   10.0      NaN
2017      10   14.0     11.0
2018       1   15.0     13.0
2019       2   11.0     13.0
2020       1   15.0     14.0


---------


      Tappara  Jukurit  Kärpät
2014      NaN      NaN       1
2015      NaN      NaN       1
2016      NaN      NaN       3
2017      NaN     11.0      10
2018      NaN     13.0       1
2019      NaN     13.0       2
2020      NaN     14.0       1

Removing rows or columns (drop)

The drop method can be used to remove rows (default) or columns with the parameter axis=1:

df = pd.DataFrame({'Kärpät': {2014: 1, 2015: 1,  2016: 3,  2017: 10 , 2018: 1, 2019: 2, 2020: 1},
                   'Sport' : {2015: 14, 2016: 10, 2017: 14, 2018: 15, 2019: 11, 2020: 15 },
                    'Jukurit' : {2017: 11, 2018: 13, 2019: 13, 2020: 14}   
                  })
print(df)
      Kärpät  Sport  Jukurit
2014       1    NaN      NaN
2015       1   14.0      NaN
2016       3   10.0      NaN
2017      10   14.0     11.0
2018       1   15.0     13.0
2019       2   11.0     13.0
2020       1   15.0     14.0
print('\n---------\n')


# Removing row 2015
print("Removing row 2015")
print(df.drop(2015))


print('\n---------\n')


print("Original")
print(df) 


print('\n---------\n')


# Removing rows 2015 and 2016
print("Removing rows 2015 and 2016")
print(df.drop([2015, 2016]))


print('\n---------\n')


# Removing column Jukurit
print("Removing column Jukurit")
print(df.drop('Jukurit', axis=1))


print("Original")
print(df) 


print('\n---------\n')
---------
    Remove row 2015
          Kärpät  Sport  Jukurit
    2014       1    NaN      NaN
    2016       3   10.0      NaN
    2017      10   14.0     11.0
    2018       1   15.0     13.0
    2019       2   11.0     13.0
    2020       1   15.0     14.0


    ---------


    Original
          Kärpät  Sport  Jukurit
    2014       1    NaN      NaN
    2015       1   14.0      NaN
    2016       3   10.0      NaN
    2017      10   14.0     11.0
    2018       1   15.0     13.0
    2019       2   11.0     13.0
    2020       1   15.0     14.0


    ---------


    Remove rows 2015 and 2016
          Kärpät  Sport  Jukurit
    2014       1    NaN      NaN
    2017      10   14.0     11.0
    2018       1   15.0     13.0
    2019       2   11.0     13.0
    2020       1   15.0     14.0


    ---------


    Remove column Jukurit
          Kärpät  Sport
    2014       1    NaN
    2015       1   14.0
    2016       3   10.0
    2017      10   14.0
    2018       1   15.0
    2019       2   11.0
    2020       1   15.0
    Original
          Kärpät  Sport  Jukurit
    2014       1    NaN      NaN
    2015       1   14.0      NaN
    2016       3   10.0      NaN
    2017      10   14.0     11.0
    2018       1   15.0     13.0
    2019       2   11.0     13.0
    2020       1   15.0     14.0


    ---------






The `drop` method thus returns a new DataFrame (or Series) from which rows have been removed. With the parameter `inplace=True`, the removal can be done on the original DataFrame.


<a id="15"></a>
### Selection and filtering
In Series, selection works in the same way as in NumPy arrays, but you can also use the Series index for selection. 
The last example shows that when using the index, slicing `['s':'d']` also includes the element corresponding to the end index, unlike when using "positions" or row numbers `[2:4]`




```python
ser1 = pd.Series(np.arange(0,12,2), index=['p','s','v','d','j','f'])
print(ser1)


print('---------')


print(ser1['s']) # index s


print('---------')


print(ser1[2]) # row number 2


print('---------')


print(ser1[2:4])  # rows 2 and 3, 4 not included


print('---------')


print(ser1[['j', 'v']]) # from index j and v


print('---------')


print(ser1[[0,5]]) # row numbers 0 and 5
---------
print(ser1['s':'d'])  # d also included


    p     0
    s     2
    v     4
    d     6
    j     8
    f    10
    dtype: int32
    ---------
    2
    ---------
    4
    ---------
    v    4
    d    6
    dtype: int32
    ---------
    j    8
    v    4
    dtype: int32
    ---------
    p     0
    f    10
    dtype: int32
    ---------
    s    2
    v    4
    d    6
    dtype: int32

In a DataFrame, the df[column] syntax typically gives a column or columns:

df = pd.DataFrame({'Kärpät': {2014: 1, 2015: 1,  2016: 3,  2017: 10 , 2018: 1, 2019: 2, 2020: 1},
                   'Sport' : {2015: 14, 2016: 10, 2017: 14, 2018: 15, 2019: 11, 2020: 15 },
                    'Jukurit' : {2017: 11, 2018: 13, 2019: 13, 2020: 14}   
                  })

print(df)


print('\n---------\n')


print(df['Kärpät'])


print('\n---------\n')


print(df[['Kärpät', 'Sport']])
      Kärpät  Sport  Jukurit
2014       1    NaN      NaN
2015       1   14.0      NaN
2016       3   10.0      NaN
2017      10   14.0     11.0
2018       1   15.0     13.0
2019       2   11.0     13.0
2020       1   15.0     14.0


---------

2014     1
2015     1
2016     3
2017    10
2018     1
2019     2
2020     1
Name: Kärpät, dtype: int64


---------

      Kärpät  Sport
2014       1    NaN
2015       1   14.0
2016       3   10.0
2017      10   14.0
2018       1   15.0
2019       2   11.0
2020       1   15.0

However, there are some special cases: * df[2] does not work if there is no 2 in the column headers * df[1:3] actually gives rows 1 and 2 (the end row 3 is not included) * df[df['Kärpät']<4] gives the rows that meet the condition

df = pd.DataFrame({'Kärpät': {2014: 1, 2015: 1,  2016: 3,  2017: 10 , 2018: 1, 2019: 2, 2020: 1},
                   'Sport' : {2015: 14, 2016: 10, 2017: 14, 2018: 15, 2019: 11, 2020: 15 },
                    'Jukurit' : {2017: 11, 2018: 13, 2019: 13, 2020: 14}   
                  })
print(df)


print('\n---------\n')


print(df[:2])
---------

```python

print(df[df['Kärpät']<4]) 

print(df['Kärpät']<4) # this gives a True/False Series, which filters the rows by index in df[df['Kärpät']<4]
      Kärpät  Sport  Jukurit
2014       1    NaN      NaN
2015       1   14.0      NaN
2016       3   10.0      NaN
2017      10   14.0     11.0
2018       1   15.0     13.0
2019       2   11.0     13.0
2020       1   15.0     14.0


---------

      Kärpät  Sport  Jukurit
2014       1    NaN      NaN
2015       1   14.0      NaN


---------

      Kärpät  Sport  Jukurit
2014       1    NaN      NaN
2015       1   14.0      NaN
2016       3   10.0      NaN
2018       1   15.0     13.0
2019       2   11.0     13.0
2020       1   15.0     14.0


---------

2014     True
2015     True
2016     True
2017    False
2018     True
2019     True
2020     True
Name: Kärpät, dtype: bool

loc and iloc operators

Selection by row headers can be done with the loc operator, which offers a NumPy-style notation df[row,column].
loc uses row and column headers, while iloc uses row and column numbers (1st row = 0, etc).

df = pd.DataFrame({'Kärpät': {2014: 1, 2015: 1,  2016: 3,  2017: 10 , 2018: 1, 2019: 2, 2020: 1},
                   'Sport' : {2015: 14, 2016: 10, 2017: 14, 2018: 15, 2019: 11, 2020: 15 },
                    'Jukurit' : {2017: 11, 2018: 13, 2019: 13, 2020: 14}   
                  })


print(df)


print("\ndf.loc[2015, 'Kärpät']")


print(df.loc[2015, 'Kärpät'])




print("\ndf.loc[[2015, 2016], ['Kärpät', 'Sport']]")
print(df.loc[[2015, 2016], ['Kärpät', 'Sport']])


print("\ndf.iloc[[1,2], 2]")
print(df.iloc[[1,2], 2])


print("\ndf.iloc[[1,2], 2] = 100")
df.iloc[[1,2], 2] = 100


print(df)
      Kärpät  Sport  Jukurit
2014       1    NaN      NaN
2015       1   14.0      NaN
2016       3   10.0      NaN
2017      10   14.0     11.0
2018       1   15.0     13.0
2019       2   11.0     13.0
2020       1   15.0     14.0

python df.loc[2015, 'Kärpät'] 1 df.loc[[2015, 2016], ['Kärpät', 'Sport']] Kärpät Sport 2015 1 14.0 2016 3 10.0 df.iloc[[1,2], 2] 2015 NaN 2016 NaN Name: Jukurit, dtype: float64 df.iloc[[1,2], 2] = 100 Kärpät Sport Jukurit 2014 1 NaN NaN 2015 1 14.0 100.0 2016 3 10.0 100.0 2017 10 14.0 11.0 2018 1 15.0 13.0 2019 2 11.0 13.0 2020 1 15.0 14.0 Slicing also works with loc and iloc operators. Note that when slicing with labels (loc), the end row/column is included, but with row/column numbers (iloc), it is not. A plain : means all rows/columns. You can also omit the column part, then only rows are examined. If you want to examine only columns, you can mark the row part with : python df = pd.DataFrame({'Kärpät': {2014: 1, 2015: 1, 2016: 3, 2017: 10 , 2018: 1, 2019: 2, 2020: 1}, 'Sport' : {2015: 14, 2016: 10, 2017: 14, 2018: 15, 2019: 11, 2020: 15 }, 'Jukurit' : {2017: 11, 2018: 13, 2019: 13, 2020: 14} }) print(df) print('\n---------\n') print(df.loc[:2015, 'Kärpät']) # The row for 2015 is included. print('\n---------\n') print(df.loc[2016:2019, ['Kärpät', 'Sport']]) df2 = df.loc[2016:2019, ['Kärpät', 'Sport']] print('\n---------\n') print(df.iloc[0:3, :]) # The 4th row (row no. 3) is not included print('\n---------\n')

  Kärpät  Sport  Jukurit

2014 1 NaN NaN 2015 1 14.0 NaN 2016 3 10.0 NaN 2017 10 14.0 11.0 2018 1 15.0 13.0 2019 2 11.0 13.0 2020 1 15.0 14.0


2014 1 2015 1 Name: Kärpät, dtype: int64


  Kärpät  Sport

2016 3 10.0 2017 10 14.0 2018 1 15.0 2019 2 11.0


  Kärpät  Sport  Jukurit

2014 1 NaN NaN 2015 1 14.0 NaN 2016 3 10.0 NaN


A Boolean array can also be used for selection:




```python
df = pd.DataFrame({'Kärpät': {2014: 1, 2015: 1,  2016: 3,  2017: 10 , 2018: 1, 2019: 2, 2020: 1},
                   'Sport' : {2015: 14, 2016: 10, 2017: 14, 2018: 15, 2019: 11, 2020: 15 },
                    'Jukurit' : {2017: 11, 2018: 13, 2019: 13, 2020: 14}   
                  })


print(df)


print('\n---------\n')


# all rows where Kärpät < 4. All columns
print(df[df['Kärpät']<4]) 


print('\n---------\n')


# all rows where Jyp < 4. Columns Kärpät and Jukurit
print(df.loc[df['Kärpät']<4, ['Kärpät', 'Jukurit']]) 


print('\n---------\n')


# all columns where the 2017 rank is > 10. All rows.
print(df.loc[ : , df.loc[2017]>10]) 


print('\n---------\n')


# rows where Sport>10, columns where 2017>10
print(df.loc[ df['Sport']>10 , df.loc[2017]>10]) 


print('\n---------\n')


# the last 2 columns, rows where Kärpät<4 
print(df.iloc[:,-2:][df['Kärpät']<4]) 
      Kärpät  Sport  Jukurit
2014       1    NaN      NaN
2015       1   14.0      NaN
2016       3   10.0      NaN
2017      10   14.0     11.0
2018       1   15.0     13.0
2019       2   11.0     13.0
2020       1   15.0     14.0


---------


      Kärpät  Sport  Jukurit
2014       1    NaN      NaN
2015       1   14.0      NaN
2016       3   10.0      NaN
2018       1   15.0     13.0
2019       2   11.0     13.0
2020       1   15.0     14.0


---------


      Kärpät  Jukurit
2014       1      NaN
2015       1      NaN
2016       3      NaN
2018       1     13.0
2019       2     13.0
2020       1     14.0


---------


      Sport  Jukurit
2014    NaN      NaN
2015   14.0      NaN
2016   10.0      NaN
2017   14.0     11.0
2018   15.0     13.0
2019   11.0     13.0
2020   15.0     14.0


---------


      Sport  Jukurit
2015   14.0      NaN
2017   14.0     11.0
2018   15.0     13.0
2019   11.0     13.0
2020   15.0     14.0


---------

Sport Jukurit 2014 NaN NaN 2015 14.0 NaN 2016 10.0 NaN 2018 15.0 13.0 2019 11.0 13.0 2020 15.0 14.0

Individual elements can also be accessed with at and iat operators by row and column headers (at) or row and column numbers (iat).

Sorting

DataFrames and Series can be sorted using the sort_values and sort_index functions.
With the parameter ascending = False, sorting is done in descending order.
With the parameter axis = 1, sorting is done according to row headers.

df = pd.DataFrame({'Kärpät': {2014: 1, 2015: 1,  2016: 3,  2017: 10 , 2018: 1, 2019: 2, 2020: 1},
                   'Sport' : {2015: 14, 2016: 10, 2017: 14, 2018: 15, 2019: 11, 2020: 15 },
                    'Jukurit' : {2017: 11, 2018: 13, 2019: 13, 2020: 14}   
                  })


print(df)


print('\n Sort_index (Descending) \n')
# Descending
df2 = df.sort_index(ascending = False)


print(df2)


print('\n Sort_values (Descending) \n')
#Descending
df3 = df.sort_values('Kärpät', ascending = False)


print(df3)


print('\n Sort_values2 (Ascending) \n')
#Ascending, by default ascending = True
df.sort_values(['Kärpät', 'Sport'], inplace = True)


print(df)


print('\n Sort_values3 (Ascending)\n')


df4 = df.sort_values('Jukurit')


print(df4[:2]) # two smallest Jukurit rankings
      Kärpät  Sport  Jukurit
2014       1    NaN      NaN
2015       1   14.0      NaN
2016       3   10.0      NaN
2017      10   14.0     11.0
2018       1   15.0     13.0
2019       2   11.0     13.0
2020       1   15.0     14.0


 Sort_index (Descending)


      Kärpät  Sport  Jukurit
2020       1   15.0     14.0
2019       2   11.0     13.0
2018       1   15.0     13.0
2017      10   14.0     11.0
2016       3   10.0      NaN
2015       1   14.0      NaN
2014       1    NaN      NaN


 Sort_values (Descending)


      Kärpät  Sport  Jukurit
2017      10   14.0     11.0
2016       3   10.0      NaN
2019       2   11.0     13.0
2014       1    NaN      NaN
2015       1   14.0      NaN
2018       1   15.0     13.0
2020       1   15.0     14.0


 Sort_values2 (Ascending)
          Kärpät  Sport  Jukurit
    2015       1   14.0      NaN
    2018       1   15.0     13.0
    2020       1   15.0     14.0
    2014       1    NaN      NaN
    2019       2   11.0     13.0
    2016       3   10.0      NaN
    2017      10   14.0     11.0


     Sort_values3 (Ascending)


          Kärpät  Sport  Jukurit
    2017      10   14.0     11.0
    2018       1   15.0     13.0






```python