pandas: Merge DataFrame with merge(), join() (INNER, OUTER JOIN)

Posted: | Tags: Python, pandas

The pandas.merge() function and the merge() method of pandas.DataFrame are used to merge multiple pandas.DataFrame objects based on columns or indexes.

If you want to merge based on the index, you can also use the join() method of pandas.DataFrame.

The pandas.concat() function can be used to concatenate pandas.DataFrame objects either vertically or horizontally.

The sample code in this article uses pandas version 2.0.3. The following two pandas.DataFrame objects are used as examples.

import pandas as pd

print(pd.__version__)
# 2.0.3

df_ab = pd.DataFrame({'a': ['a_1', 'a_2', 'a_3'], 'b': ['b_1', 'b_2', 'b_3']})
df_ac = pd.DataFrame({'a': ['a_1', 'a_2', 'a_4'], 'c': ['c_1', 'c_2', 'c_4']})

print(df_ab)
#      a    b
# 0  a_1  b_1
# 1  a_2  b_2
# 2  a_3  b_3

print(df_ac)
#      a    c
# 0  a_1  c_1
# 1  a_2  c_2
# 2  a_4  c_4

Basic usage of pandas.merge() and pandas.DataFrame.merge()

The pandas.merge() function requires the two DataFrame objects as its first left and second right arguments.

print(pd.merge(df_ab, df_ac))
#      a    b    c
# 0  a_1  b_1  c_1
# 1  a_2  b_2  c_2

For the merge() method, call the method on the DataFrame that corresponds to left, and specify the DataFrame that corresponds to right as an argument.

print(df_ab.merge(df_ac))
#      a    b    c
# 0  a_1  b_1  c_1
# 1  a_2  b_2  c_2

Both methods return a new, merged DataFrame.

The arguments explained below are common to both the pandas.merge() function and the merge() method.

The column to be keyed: on, left_on, right_on

By default, the merge operation uses columns with the same names in the two DataFrame objects as the key.

To explicitly specify the column name to be used as the key, use the on argument.

print(pd.merge(df_ab, df_ac, on='a'))
#      a    b    c
# 0  a_1  b_1  c_1
# 1  a_2  b_2  c_2

The left_on and right_on arguments allow you to specify the column names to be used as keys in each DataFrame separately.

df_ac_ = df_ac.rename(columns={'a': 'a_'})
print(df_ac_)
#     a_    c
# 0  a_1  c_1
# 1  a_2  c_2
# 2  a_4  c_4

print(pd.merge(df_ab, df_ac_, left_on='a', right_on='a_'))
#      a    b   a_    c
# 0  a_1  b_1  a_1  c_1
# 1  a_2  b_2  a_2  c_2

In this case, both columns remain. You can delete the unused column with the drop() method.

print(pd.merge(df_ab, df_ac_, left_on='a', right_on='a_').drop(columns='a_'))
#      a    b    c
# 0  a_1  b_1  c_1
# 1  a_2  b_2  c_2

To use multiple columns as keys, you can provide a list of column names to the on, left_on, and right_on arguments. More on this later.

The merging method: how

Specify the merging method with the how argument, which defaults to 'inner'.

Elements without data are represented as NaN. For handling NaN, refer to the following articles.

INNER JOIN: how='inner'

Only rows with keys common to both left and right are retained. This is the default setting.

print(pd.merge(df_ab, df_ac, on='a', how='inner'))
#      a    b    c
# 0  a_1  b_1  c_1
# 1  a_2  b_2  c_2

LEFT JOIN: how='left'

Join based on left. Rows without corresponding keys in right are still included. This is also known as a "LEFT OUTER JOIN".

print(pd.merge(df_ab, df_ac, on='a', how='left'))
#      a    b    c
# 0  a_1  b_1  c_1
# 1  a_2  b_2  c_2
# 2  a_3  b_3  NaN

RIGHT JOIN: how='right'

Join based on right. Rows without corresponding keys in left are still included. This is also known as a "RIGHT OUTER JOIN".

print(pd.merge(df_ab, df_ac, on='a', how='right'))
#      a    b    c
# 0  a_1  b_1  c_1
# 1  a_2  b_2  c_2
# 2  a_4  NaN  c_4

OUTER JOIN: how='outer'

All rows from left and right remain. This is also known as "FULL OUTER JOIN".

print(pd.merge(df_ab, df_ac, on='a', how='outer'))
#      a    b    c
# 0  a_1  b_1  c_1
# 1  a_2  b_2  c_2
# 2  a_3  b_3  NaN
# 3  a_4  NaN  c_4

CROSS JOIN: how='cross'

This creates all possible combinations of left and right. By default, the suffixes _x and _y are added to overlapping column names. The suffixes argument can be used to specify these, which will be described later.

print(pd.merge(df_ab, df_ac, how='cross'))
#    a_x    b  a_y    c
# 0  a_1  b_1  a_1  c_1
# 1  a_1  b_1  a_2  c_2
# 2  a_1  b_1  a_4  c_4
# 3  a_2  b_2  a_1  c_1
# 4  a_2  b_2  a_2  c_2
# 5  a_2  b_2  a_4  c_4
# 6  a_3  b_3  a_1  c_1
# 7  a_3  b_3  a_2  c_2
# 8  a_3  b_3  a_4  c_4

You cannot specify the on, left_on, and right_on arguments when how='cross'.

# print(pd.merge(df_ab, df_ac, on='a', how='cross'))
# MergeError: Can not pass on, right_on, left_on or set right_index=True or left_index=True

Add an information column: indicator

Setting the indicator argument to True adds a column containing information about the original data. By default, a column named _merge is added, classifying each row as both, left_only, or right_only.

print(pd.merge(df_ab, df_ac, on='a', how='inner', indicator=True))
#      a    b    c _merge
# 0  a_1  b_1  c_1   both
# 1  a_2  b_2  c_2   both

print(pd.merge(df_ab, df_ac, on='a', how='outer', indicator=True))
#      a    b    c      _merge
# 0  a_1  b_1  c_1        both
# 1  a_2  b_2  c_2        both
# 2  a_3  b_3  NaN   left_only
# 3  a_4  NaN  c_4  right_only

To name the column differently from _merge, specify the desired name as a string in the indicator argument.

print(pd.merge(df_ab, df_ac, on='a', how='outer', indicator='indicator'))
#      a    b    c   indicator
# 0  a_1  b_1  c_1        both
# 1  a_2  b_2  c_2        both
# 2  a_3  b_3  NaN   left_only
# 3  a_4  NaN  c_4  right_only

Suffixes for overlapping column names: suffixes

If column names other than the key column overlap in left and right, suffixes _x and _y are added by default.

df_ac_b = df_ac.rename(columns={'c': 'b'})
print(df_ac_b)
#      a    b
# 0  a_1  c_1
# 1  a_2  c_2
# 2  a_4  c_4

print(pd.merge(df_ab, df_ac_b, on='a'))
#      a  b_x  b_y
# 0  a_1  b_1  c_1
# 1  a_2  b_2  c_2

Specify specific suffixes by passing a list or tuple of two elements like [suffix_for_left, suffix_for_right] in the suffixes argument.

print(pd.merge(df_ab, df_ac_b, on='a', suffixes=['_left', '_right']))
#      a b_left b_right
# 0  a_1    b_1     c_1
# 1  a_2    b_2     c_2

Use multiple columns as keys

The examples so far have only one column as a key, but you can use multiple columns as keys.

Consider following DataFrame objects with added new columns.

df_abx = df_ab.assign(x=['x_2', 'x_2', 'x_3'])
df_acx = df_ac.assign(x=['x_1', 'x_2', 'x_2'])

print(df_abx)
#      a    b    x
# 0  a_1  b_1  x_2
# 1  a_2  b_2  x_2
# 2  a_3  b_3  x_3

print(df_acx)
#      a    c    x
# 0  a_1  c_1  x_1
# 1  a_2  c_2  x_2
# 2  a_4  c_4  x_2

By default, if multiple columns share the same name, all these columns are treated as keys. To specify explicitly, pass a list of column names to the on argument.

print(pd.merge(df_abx, df_acx))
#      a    b    x    c
# 0  a_2  b_2  x_2  c_2

print(pd.merge(df_abx, df_acx, on=['a', 'x']))
#      a    b    x    c
# 0  a_2  b_2  x_2  c_2

Even if multiple columns have the same name, you can use only one of the columns as a key. Suffixes are added to the overlapping column names. As described above, you can specify any suffix with the suffixes argument.

print(pd.merge(df_abx, df_acx, on='a'))
#      a    b  x_x    c  x_y
# 0  a_1  b_1  x_2  c_1  x_1
# 1  a_2  b_2  x_2  c_2  x_2

If you want to use columns with different names as keys, provide lists of column names to the left_on and right_on arguments respectively.

df_acx_ = df_acx.rename(columns={'x': 'x_'})
print(df_acx_)
#      a    c   x_
# 0  a_1  c_1  x_1
# 1  a_2  c_2  x_2
# 2  a_4  c_4  x_2

print(pd.merge(df_abx, df_acx_, left_on=['a', 'x'], right_on=['a', 'x_']))
#      a    b    x    c   x_
# 0  a_2  b_2  x_2  c_2  x_2

The usage of the how argument is the same as when using a single column as a key.

print(pd.merge(df_abx, df_acx, on=['a', 'x'], how='inner'))
#      a    b    x    c
# 0  a_2  b_2  x_2  c_2

print(pd.merge(df_abx, df_acx, on=['a', 'x'], how='left'))
#      a    b    x    c
# 0  a_1  b_1  x_2  NaN
# 1  a_2  b_2  x_2  c_2
# 2  a_3  b_3  x_3  NaN

print(pd.merge(df_abx, df_acx, on=['a', 'x'], how='right'))
#      a    b    x    c
# 0  a_1  NaN  x_1  c_1
# 1  a_2  b_2  x_2  c_2
# 2  a_4  NaN  x_2  c_4

print(pd.merge(df_abx, df_acx, on=['a', 'x'], how='outer'))
#      a    b    x    c
# 0  a_1  b_1  x_2  NaN
# 1  a_2  b_2  x_2  c_2
# 2  a_3  b_3  x_3  NaN
# 3  a_1  NaN  x_1  c_1
# 4  a_4  NaN  x_2  c_4

print(pd.merge(df_abx, df_acx, how='cross'))
#    a_x    b  x_x  a_y    c  x_y
# 0  a_1  b_1  x_2  a_1  c_1  x_1
# 1  a_1  b_1  x_2  a_2  c_2  x_2
# 2  a_1  b_1  x_2  a_4  c_4  x_2
# 3  a_2  b_2  x_2  a_1  c_1  x_1
# 4  a_2  b_2  x_2  a_2  c_2  x_2
# 5  a_2  b_2  x_2  a_4  c_4  x_2
# 6  a_3  b_3  x_3  a_1  c_1  x_1
# 7  a_3  b_3  x_3  a_2  c_2  x_2
# 8  a_3  b_3  x_3  a_4  c_4  x_2

Use the index as a key: left_index, right_index

To use the index as a key, set the left_index and right_index arguments to True.

df_ac_i = df_ac.set_index('a')
print(df_ac_i)
#        c
# a       
# a_1  c_1
# a_2  c_2
# a_4  c_4

print(pd.merge(df_ab, df_ac_i, left_on='a', right_index=True))
#      a    b    c
# 0  a_1  b_1  c_1
# 1  a_2  b_2  c_2

You can also set both the left_index and right_index arguments to True.

df_ab_i = df_ab.set_index('a')
print(df_ab_i)
#        b
# a       
# a_1  b_1
# a_2  b_2
# a_3  b_3

print(pd.merge(df_ab_i, df_ac_i, left_index=True, right_index=True))
#        b    c
# a            
# a_1  b_1  c_1
# a_2  b_2  c_2

To use the index as a key, you can also use the join() method, which is described next.

Basic usage of pandas.DataFrame.join()

You can also use the join() method of pandas.DataFrame to merge based on the index. Note that there is no pandas.join() function. The usage is generally the same as the merge() method.

Consider following DataFrame objects with a specified index.

print(df_ab_i)
#        b
# a       
# a_1  b_1
# a_2  b_2
# a_3  b_3

print(df_ac_i)
#        c
# a       
# a_1  c_1
# a_2  c_2
# a_4  c_4

how

The join() method merges DataFrame objects based on their index and defaults to a left join (how='left').

print(df_ab_i.join(df_ac_i))
#        b    c
# a            
# a_1  b_1  c_1
# a_2  b_2  c_2
# a_3  b_3  NaN

print(df_ab_i.join(df_ac_i, how='inner'))
#        b    c
# a            
# a_1  b_1  c_1
# a_2  b_2  c_2

print(df_ab_i.join(df_ac_i, how='left'))
#        b    c
# a            
# a_1  b_1  c_1
# a_2  b_2  c_2
# a_3  b_3  NaN

print(df_ab_i.join(df_ac_i, how='right'))
#        b    c
# a            
# a_1  b_1  c_1
# a_2  b_2  c_2
# a_4  NaN  c_4

print(df_ab_i.join(df_ac_i, how='outer'))
#        b    c
# a            
# a_1  b_1  c_1
# a_2  b_2  c_2
# a_3  b_3  NaN
# a_4  NaN  c_4

print(df_ab_i.join(df_ac_i, how='cross'))
#      b    c
# 0  b_1  c_1
# 1  b_1  c_2
# 2  b_1  c_4
# 3  b_2  c_1
# 4  b_2  c_2
# 5  b_2  c_4
# 6  b_3  c_1
# 7  b_3  c_2
# 8  b_3  c_4

on

You can specify the column to be used as the key in the calling DataFrame using the on argument. The DataFrame specified as an argument always uses the index as the key.

print(df_ab)
#      a    b
# 0  a_1  b_1
# 1  a_2  b_2
# 2  a_3  b_3

print(df_ab.join(df_ac_i, on='a'))
#      a    b    c
# 0  a_1  b_1  c_1
# 1  a_2  b_2  c_2
# 2  a_3  b_3  NaN

lsuffix and rsuffix

If there are overlapping column names in the result, it will cause an error by default. You need to specify the lsuffix and rsuffix arguments.

df_ab_i2 = df_ac_i.rename(columns={'c': 'b'})
print(df_ab_i2)
#        b
# a       
# a_1  c_1
# a_2  c_2
# a_4  c_4

# print(df_ab_i.join(df_ab_i2))
# ValueError: columns overlap but no suffix specified: Index(['b'], dtype='object')

print(df_ab_i.join(df_ab_i2, lsuffix='_left', rsuffix='_right'))
#     b_left b_right
# a                 
# a_1    b_1     c_1
# a_2    b_2     c_2
# a_3    b_3     NaN

Specify multiple pandas.DataFrame in a list

You can pass a list of DataFrame objects as the first argument to the join() method.

df_ad_i = pd.DataFrame({'a': ['a_1', 'a_4', 'a_5'], 'd': ['d_1', 'd_4', 'd_5']}).set_index('a')
print(df_ad_i)
#        d
# a       
# a_1  d_1
# a_4  d_4
# a_5  d_5

print(df_ab_i.join([df_ac_i, df_ad_i]))
#        b    c    d
# a                 
# a_1  b_1  c_1  d_1
# a_2  b_2  c_2  NaN
# a_3  b_3  NaN  NaN

print(df_ac_i.join([df_ad_i, df_ab_i]))
#        c    d    b
# a                 
# a_1  c_1  d_1  b_1
# a_2  c_2  NaN  b_2
# a_4  c_4  d_4  NaN

Related Categories

Related Articles