• Martin Thoma
  • Home
  • Categories
  • Tags
  • Archives
  • Support me

Pandas: Merge, join, concatenate

Contents

  • Pandas: Merge, join, concatenate
    • Merge
    • Join
    • Concatenate
    • See also

I always get confused about the different methods how to join Pandas dataframes. So here are some examples that should make it crystal clear.

First, we create two dataframes:

#!/usr/bin/env python

import pandas as pd

countries = ["Germany", "France", "Indonesia"]
inhabitants = [82.5 * 10 ** 6, 66.9 * 10 ** 6, 255.5 * 10 ** 6]
capitals = ["Berlin", "Paris", "Jakarta"]

df1 = pd.DataFrame(
    {"country": countries, "inhabitant": inhabitants, "capital": capitals}
)
df1 = df1[["country", "capital", "inhabitant"]]
print(df1)

countries = ["Germany", "Italy", "Spain", "Austria"]
capitals = ["Berlin", "Rome", "Madrid", "Vienna"]
hdis = [0.926, 0.897, 0.844, 0.893]
df2 = pd.DataFrame({"country": countries, "capital": capitals, "HDI": hdis})
df2 = df2[["country", "capital", "HDI"]]
print(df2)

So df1 is:

     country  capital   inhabitant
0    Germany   Berlin   82500000.0
1     France    Paris   66900000.0
2  Indonesia  Jakarta  255500000.0

and df2 is

   country capital    HDI
0  Germany  Berlin  0.926
1    Italy    Rome  0.897
2    Spain  Madrid  0.844
3  Austria  Vienna  0.893

Merge

The pandas docs tell you that it has the syntax:

>>> df1.merge(df2, on='country', how='inner')
   country capital_x  inhabitant capital_y    HDI
0  Germany    Berlin  82500000.0    Berlin  0.926

Please note that the column capital was duplicated as it is in both tables

Then you can do all of the usual operations we know from SQL JOINs, e.g. a LEFT JOIN:

>>> df1.merge(df2, on='country', how='left')
     country capital_x   inhabitant capital_y    HDI
0    Germany    Berlin   82500000.0    Berlin  0.926
1     France     Paris   66900000.0       NaN    NaN
2  Indonesia   Jakarta  255500000.0       NaN    NaN

A RIGHT JOIN:

>>> df1.merge(df2, on='country', how='right')
   country capital_x  inhabitant capital_y    HDI
0  Germany    Berlin  82500000.0    Berlin  0.926
1    Italy       NaN         NaN      Rome  0.897
2    Spain       NaN         NaN    Madrid  0.844
3  Austria       NaN         NaN    Vienna  0.893

OUTER JOIN:

>>> df1.merge(df2, on='country', how='outer')
     country capital_x   inhabitant capital_y    HDI
0    Germany    Berlin   82500000.0    Berlin  0.926
1     France     Paris   66900000.0       NaN    NaN
2  Indonesia   Jakarta  255500000.0       NaN    NaN
3      Italy       NaN          NaN      Rome  0.897
4      Spain       NaN          NaN    Madrid  0.844
5    Austria       NaN          NaN    Vienna  0.893

Join

Join is just a convenience method, which uses merge and should be used if you want to merge on the index:

The related DataFrame.join method, uses merge internally for the index-on-index and index-on-column(s) joins, but joins on indexes by default rather than trying to join on common columns (the default behavior for merge). If you are joining on index, you may wish to use DataFrame.join to save yourself some typing.

The pandas join operation states:

DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)

Join columns with other DataFrame either on index or on a key column. Efficiently Join multiple DataFrame objects by index at once by passing a list.

Having a look at the following example:

>>> df1.join(df2, on='country', how='outer', lsuffix='_df1')
  country_df1 capital_df1   inhabitant country capital    HDI
0     Germany      Berlin   82500000.0     NaN     NaN    NaN
1      France       Paris   66900000.0     NaN     NaN    NaN
2   Indonesia     Jakarta  255500000.0     NaN     NaN    NaN
2         NaN         NaN          NaN       0  Berlin  0.926
2         NaN         NaN          NaN       1    Rome  0.897
2         NaN         NaN          NaN       2  Madrid  0.844
2         NaN         NaN          NaN       3  Vienna  0.893

I would say join and merge look extremely similar. You can notice differences in the function signature when you look at the help, but the difference in the output is more subtile. It's the index: For merge, you still have the typical index where each element is unique. For join, if you merge on a column, you don't have that anymore.

Concatenate

Pandas Documentation and the example

>>> pd.concat([df1, df2])
     HDI  capital    country   inhabitant
0    NaN   Berlin    Germany   82500000.0
1    NaN    Paris     France   66900000.0
2    NaN  Jakarta  Indonesia  255500000.0
0  0.926   Berlin    Germany          NaN
1  0.897     Rome      Italy          NaN
2  0.844   Madrid      Spain          NaN
3  0.893   Vienna    Austria          NaN

See also

  • Merge, join, and concatenate
  • What is the difference between join and merge in Pandas?

Published

Mär 14, 2018
by Martin Thoma

Category

Code

Tags

  • Pandas 3

Contact

  • Martin Thoma - A blog about Code, the Web and Cyberculture
  • E-mail subscription
  • RSS-Feed
  • Privacy/Datenschutzerklärung
  • Impressum
  • Powered by Pelican. Theme: Elegant by Talha Mansoor