8. Handling DataFrame - Part 2

Combining DataFrames

Often data is spread across several different files and even in different formats, so combining the data into one DataFrame (or Series) may need to be done before starting the analysis.

pandas.merge

The pandas.merge() function combines DataFrames based on a key in a similar way to the join operation in relational databases.

If the function is not told which field to combine on, merge uses overlapping (same) column names as the merge field. However, it is smarter to specify the field used for merging with the on parameter.

import pandas as pd


df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                    'data1': range(7)})


df2 = pd.DataFrame({'key': ['a', 'b', 'd'],
                    'data2': range(3)})


print(df1)


print('\n------------------\n')


print(df2)


print('\n------------------\n')


df3 = pd.merge(df1, df2)


# better: df3 = pd.merge(df1, df2, on = 'key')


print(df3)

  key  data1
0     b      0
1     b      1
2     a      2
3     c      3
4     a      4
5     a      5
6     b      6


------------------


  key  data2
0     a      0
1     b      1
2     d      2


------------------


  key  data1  data2
0     b      0      1
1     b      1      1
2     b      6      1
3     a      2      0
4     a      4      0
5     a      5      0

If the names of the columns used for merging differ in the DataFrames, they can be specified separately with the left_on and right_on parameters:

df1 = pd.DataFrame({'key1': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                    'data1': range(7)})


df2 = pd.DataFrame({'key2': ['a', 'b', 'd'],
                    'data2': range(3)})


print(df1)


print('\n------------------\n')


print(df2)


print('\n------------------\n')


df3 = pd.merge(df1, df2, left_on = 'key1', right_on = 'key2')


print(df3)

  key1  data1
0      b      0
1      b      1
2      a      2
3      c      3
4      a      4
5      a      5
6      b      6


------------------


  key2  data2
0      a      0
1      b      1
2      d      2


------------------

key1 data1 key2 data2 0 b 0 b 1 1 b 1 b 1 2 b 6 b 1 3 a 2 a 0 4 a 4 a 0 5 a 5 a 0

In the previous examples, keys c and d did not come into the combination at all, because they were not present in both DataFrames. By default, merge performs an "inner"-type join, where the result includes only those keys that are in both DataFrames.

And because df1 had multiple rows with keys a and b, the combination also included multiple of these.

The type of join can be specified with the how parameter, the options are:

inner keys that are in both
left keys that are in the left (first) DataFrame
right keys that are in the right (second) DataFrame
outer keys that are in either of the DataFrames

df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                    'data1': range(7)})


df2 = pd.DataFrame({'key': ['a', 'b', 'd'],
                    'data2': range(3)})


print(df1)


print('\n------------------\n')


print(df2)


print('\n------------------\n')


df3 = pd.merge(df1, df2, on = 'key', how = 'outer')


print(df3)

  key  data1
0     b      0
1     b      1
2     a      2
3     c      3
4     a      4
5     a      5
6     b      6


------------------


  key  data2
0     a      0
1     b      1
2     d      2


------------------


  key  data1  data2
0     b    0.0    1.0
1     b    1.0    1.0
2     b    6.0    1.0
3     a    2.0    0.0
4     a    4.0    0.0
5     a    5.0    0.0
6     c    3.0    NaN
7     d    NaN    2.0

If both have several rows with the same key, merge will create all possible "combinations" of these, i.e., if the 1st DataFrame has, for example, 4 rows with key a and the second has 3, the combination will have 4 x 3 = 12 rows with key a.

df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'a', 'a', 'a', 'b'],
                    'data1': range(7)})


df2 = pd.DataFrame({'key': ['a', 'a', 'b'],
                    'data2': range(3)})


print(df1)


print('\n------------------\n')


print(df2)


print('\n------------------\n')


df3 = pd.merge(df1, df2, on = 'key')


print(df3)

key data1 0 b 0 1 b 1 2 a 2 3 a 3 4 a 4 5 a 5 6 b 6

------------------


  key  data2
0     a      0
1     a      1
2     b      2


------------------


   key  data1  data2
0      b      0      2
1      b      1      2
2      b      6      2
3      a      2      0
4      a      2      1
5      a      3      0
6      a      3      1
7      a      4      0
8      a      4      1
9      a      5      0
10     a      5      1

What if the DataFrames to be merged have columns with the same name? pandas adds a suffix (by default _x and _y) to the column names. You can define your own suffixes with the parameter suffixes=('_1','_2').

df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                    'data1': range(7)})


df2 = pd.DataFrame({'key': ['a', 'b', 'd'],
                    'data1': range(3)})


print(df1)


print('\n------------------\n')


print(df2)


print('\n------------------\n')


df3 = pd.merge(df1, df2, on = 'key')


print(df3)


print('\n------------------\n')


df4 = pd.merge(df1, df2, on = 'key', suffixes=('_1','_2'))


print(df4)

  key  data1
0     b      0
1     b      1
2     a      2
3     c      3
4     a      4
5     a      5
6     b      6


------------------


  key  data1
0     a      0
1     b      1
2     d      2


------------------


  key  data1_x  data1_y
0     b        0        1
1     b        1        1
2     b        6        1
3     a        2        0
4     a        4        0
5     a        5        0


------------------


  key  data1_1  data1_2
0     b        0        1
1     b        1        1
2     b        6        1
3     a        2        0
4     a        4        0
5     a        5        0

Merging can also be done using multiple fields:

df1 = pd.DataFrame({'key1': ['foo', 'foo', 'bar'],
                    'key2': ['one', 'two', 'one'],
                    'lval': [1, 2, 3]})




df2 =  right = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],
                             'key2': ['one', 'one', 'one', 'two'],
                             'rval': [4, 5, 6, 7]})


print(df1)


print('\n-------------\n')


print(df2)


print('\n-------------\n')


pd.merge(df1, df2, on=['key1', 'key2'], how='outer')

  key1 key2  lval
0  foo  one     1
1  foo  two     2
2  bar  one     3


-------------


  key1 key2  rval
0  foo  one     4
1  foo  one     5
2  bar  one     6
3  bar  two     7


-------------

	key1	key2	lval	rval
0	foo	one	1.0	4.0
1	foo	one	1.0	5.0
2	foo	two	2.0	NaN
3	bar	one	3.0	6.0
4	bar	two	NaN	7.0

Merging is often done based on the index (row header) instead of a data field. For this, there is the parameter left_index=True, which indicates that in the first DataFrame, the merging field is the index. Similarly, right_index=True.

join

Merging based on the index can also be done with the DataFrame's join method. It defaults to a left-type merge, but the how parameter can be used to specify the type (left, right, outer, inner).

If there are columns with the same name in the DataFrame, join will not succeed without the lsuffix and/or rsuffix parameters.

The join method can also be used to merge several DataFrames at once df.join([df1, df2, df3])

import pandas as pd
technologies = {
    'Courses':["Python","R","Matlab","C++"],
    'Credits' :[4,5,3,6],
    'Duration':['40days','50days','30days','60days'],
              }
index_labels=['r1','r2','r3','r4']
df1 = pd.DataFrame(technologies,index=index_labels)


print("DF1\n",df1)


technologies2 = {
    'Courses':["Python","Java","Python","Go"],
    'Price':[2000,2500,1200,3000]
              }
index_labels2=['r1','r6','r3','r5']
df2 = pd.DataFrame(technologies2,index=index_labels2)


print("\nDF2\n",df2)

DF1 Courses Credits Duration r1 Python 4 40 days r2 R 5 50 days r3 Matlab 3 30 days r4 C++ 6 60 days

DF2
    Courses  Price
r1  Python   2000
r6    Java   2500
r3  Python   1200
r5      Go   3000

# pandas join 
df3=df1.join(df2, lsuffix="_left", rsuffix="_right")
print(df3)

   Courses_left  Credits Duration Courses_right   Price
r1       Python        4   40 days        Python  2000.0
r2            R        5   50 days           NaN     NaN
r3       Matlab        3   30 days        Python  1200.0
r4          C++        6   60 days           NaN     NaN

# pandas Inner join DataFrames
df3=df1.join(df2, lsuffix="_left", rsuffix="_right", how='inner')
print(df3)


# Outputs
#   Courses_left    Fee Duration Courses_right  Discount
#r1        Spark  20000   30 days         Spark      2000
#r3       Python  22000   35 days        Python      1200

   Courses_left  Credits Duration Courses_right  Price
r1       Python        4   40 days        Python   2000
r3       Matlab        3   30 days        Python   1200

pandas.concat

DataFrames (and Series) can also be combined ("chained") using the concat function. By default, concat chains the rows, but with the parameter axis=1, the columns are chained.

The keys parameter can be used to define titles for the original rows/columns, thus obtaining a multi-level index for either rows or columns.

import numpy as np
df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'], columns=['one', 'two'])
df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), index=['a', 'd'], columns=['three', 'two'])


print('\n-------DF1---------\n')
print(df1)


print('\n-------DF2---------\n')


print(df2)


print('\n------------------\n')


print(pd.concat([df1, df2]))


print('\n------------------\n')


print(pd.concat([df1, df2], axis=1))

-------DF1---------


   one  two
a    0    1
b    2    3
c    4    5


-------DF2---------


   three  two
a      5    6
d      7    8


------------------


   one  two  three
a  0.0    1    NaN
b  2.0    3    NaN
c  4.0    5    NaN
a  NaN    6    5.0
d  NaN    8    7.0


------------------


   one  two  three  two
a  0.0  1.0    5.0  6.0
b  2.0  3.0    NaN  NaN
c  4.0  5.0    NaN  NaN
d  NaN  NaN    7.0  8.0

combine_first

The combine_first method of a DataFrame can be used to "fill in the gaps", meaning that NaN values in a DataFrame are replaced with values found in another DataFrame.

df1 = pd.DataFrame({'a':[4,6,7,np.nan, 5], 'b':[5, np.nan, 87,2,np.nan]})
df2 = pd.DataFrame({'a':[1,np.nan,2,6, 5, 5], 'b':[5, 1, 45,2,7, 8], 'c':[5, np.nan, 87,2,0, 9]})


print('\n-------DF1---------\n')
print(df1)


print('\n-------DF2---------\n')


print(df2)


print('\n------------------\n')


print(df1.combine_first(df2))

-------DF1---------


     a     b
0  4.0   5.0
1  6.0   NaN
2  7.0  87.0
3  NaN   2.0
4  5.0   NaN


-------DF2---------


     a   b     c
0  1.0   5   5.0
1  NaN   1   NaN
2  2.0  45  87.0
3  6.0   2   2.0
4  5.0   7   0.0
5  5.0   8   9.0


------------------


     a     b     c
0  4.0   5.0   5.0
1  6.0   1.0   NaN
2  7.0  87.0  87.0
3  6.0   2.0   2.0
4  5.0   7.0   0.0
5  5.0   8.0   9.0

Data compilation

groupby

The groupby method can be used to group data according to a given field. The principle is referred to as split-apply-combine, which means first dividing the data into groups, then performing some operation on the groups, and finally combining the results:

split-apply-combine

In the Titanic example, the number of survivors was calculated by passenger class (Pclass):

df = pd.read_csv('Datasets/titanic.csv', index_col=0)


df

<style>
    .dataframe thead th {
        text-align: right;
    }
</style>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Survived</th>
      <th>Pclass</th>
      <th>Name</th>
      <th>Sex</th>
      <th>Age</th>
      <th>SibSp</th>
      <th>Parch</th>
      <th>Ticket</th>
      <th>Fare</th>
      <th>Cabin</th>
      <th>Embarked</th>
    </tr>
    <tr>
      <th>PassengerId</th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>1</th>
      <td>0</td>
      <td>3</td>
      <td>Braund, Mr. Owen Harris</td>
      <td>male</td>
      <td>22.0</td>
      <td>1</td>
      <td>0</td>
      <td>A/5 21171</td>
      <td>7.2500</td>
      <td>NaN</td>
      <td>S</td>
    </tr>
    <tr>
      <th>2</th>
      <td>1</td>
      <td>1</td>
      <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>
      <td>female</td>
      <td>38.0</td>
      <td>1</td>
      <td>0</td>
      <td>PC 17599</td>
      <td>71.2833</td>
      <td>C85</td>
      <td>C</td>
    </tr>
    <tr>
      <th>3</th>
      <td>1</td>
      <td>3</td>
      <td>Heikkinen, Miss. Laina</td>
      <td>female</td>
      <td>26.0</td>
      <td>0</td>
      <td>0</td>
      <td>STON/O2. 3101282</td>
      <td>7.9250</td>
      <td>NaN</td>
      <td>S</td>
    </tr>
    <tr>
      <th>4</th>
      <td>1</td>
      <td>1</td>
      <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>
      <td>female</td>
      <td>35.0</td>
      <td>1</td>
      <td>0</td>
      <td>113803</td>
      <td>53.1000</td>
      <td>C123</td>
      <td>S</td>
    </tr>
    <tr>
      <th>5</th>
      <td>0</td>
      <td>3</td>
      <td>Allen, Mr. William Henry</td>
      <td>male</td>
      <td>35.0</td>
      <td>0</td>
      <td>0</td>
      <td>373450</td>
      <td>8.0500</td>
      <td>NaN</td>
      <td>S</td>
    </tr>
    <tr>
      <th>...</th>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
      <td>...</td>
    </tr>
    <tr>
      <th>887</th>
      <td>0</td>
      <td>2</td>
      <td>Montvila, Rev. Juozas</td>
      <td>male</td>
      <td>27.0</td>
      <td>0</td>
      <td>0</td>
      <td>211536</td>
      <td>13.0000</td>
      <td>NaN</td>
      <td>S</td>
    </tr>
    <tr>
      <th>888</th>
      <td>1</td>
      <td>1</td>
      <td>Graham, Miss. Margaret Edith</td>
      <td>female</td>
      <td>19.0</td>
      <td>0</td>
      <td>0</td>
      <td>112053</td>
      <td>30.0000</td>
      <td>B42</td>
      <td>S</td>
    </tr>
    <tr>
      <th>889</th>
      <td>0</td>
      <td>3</td>
      <td>Johnston, Miss. Catherine Helen "Carrie"</td>
      <td>female</td>
      <td>NaN</td>
      <td>1</td>
      <td>2</td>
      <td>W./C. 6607</td>
      <td>23.4500</td>
      <td>NaN</td>
      <td>S</td>
    </tr>
    <tr>
      <th>890</th>
      <td>1</td>
      <td>1</td>
      <td>Behr, Mr. Karl Howell</td>
      <td>male</td>
      <td>26.0</td>
      <td>0</td>
      <td>0</td>
      <td>111369</td>
      <td>30.0000</td>
      <td>C148</td>
      <td>C</td>
    </tr>
    <tr>
      <th>891</th>
      <td>0</td>
      <td>3</td>
      <td>Dooley, Mr. Patrick</td>
      <td>male</td>
      <td>32.0</td>
      <td>0</td>
      <td>0</td>
      <td>370376</td>
      <td>7.7500</td>
      <td>NaN</td>
      <td>Q</td>
    </tr>
  </tbody>
</table>
<p>891 rows × 11 columns</p>
</div>

print(df['Survived'].groupby(df['Pclass']).value_counts())

Pclass  Survived
1       1           136
        0            80
2       0            97
        1            87
3       0           372
        1           119
Name: Survived, dtype: int64

Above, the df['Survived'] Series was grouped by Pclass, even though the df['Survived'] Series does not contain Pclass information. This is why the grouping criterion given was the df['Pclass'] Series, which provides the information in the grouping about which index belongs to which group, and because the df['Survived'] Series contains the same indices, the counts can be calculated by group.

The same could be done "the other way around", that is, by grouping the entire DataFrame and taking only the Survived column from the grouping. In this case, there is no need to give df['Pclass'] as the grouping criterion, it is enough to give just the column name 'Pclass' (since the column is found in the df).

print(df.groupby('Pclass')['Survived'].value_counts())

Pclass  Survived
1       1           136
        0            80
2       0            97
        1            87
3       0           372
        1           119
Name: Survived, dtype: int64

The object returned by groupby can also be iterated with a for loop, in which case a tuple in the form of name, data is obtained:

for name, data in df['Survived'].groupby(df['Pclass']):
      print("CLASS ",name)
      print(data.value_counts())

CLASS  1
1    136
0     80
Name: Survived, dtype: int64
CLASS  2
0    97
1    87
Name: Survived, dtype: int64
CLASS  3
0    372
1    119
Name: Survived, dtype: int64

The GroupBy object produced by grouping has a few optimized methods for calculating summary statistics:

count number of non-NA values
sum sum
mean average
median median
std, var standard deviation, variance
min, max minimum, maximum
prod product
first, last first, last non-NA value

	key1	key2	lval	rval
0	foo	one	1.0	4.0
1	foo	one	1.0	5.0
2	foo	two	2.0	NaN
3	bar	one	3.0	6.0
4	bar	two	NaN	7.0

	key1	key2	lval	rval
0	foo	one	1.0	4.0
1	foo	one	1.0	5.0
2	foo	two	2.0	NaN
3	bar	one	3.0	6.0
4	bar	two	NaN	7.0

	key1	key2	lval	rval
0	foo	one	1.0	4.0
1	foo	one	1.0	5.0
2	foo	two	2.0	NaN
3	bar	one	3.0	6.0
4	bar	two	NaN	7.0