pandas: Merge DataFrame with merge(), join() (INNER, OUTER JOIN)

Posted: 2023-08-01 | Tags: Python, pandas

The pandas.merge() function and the merge() method of pandas.DataFrame are used to merge multiple pandas.DataFrame objects based on columns or indexes.

If you want to merge based on the index, you can also use the join() method of pandas.DataFrame.

pandas.DataFrame.join — pandas 2.0.3 documentation

Contents

Basic usage of pandas.merge() and pandas.DataFrame.merge()
The column to be keyed: on, left_on, right_on
The merging method: how
Add an information column: indicator
Suffixes for overlapping column names: suffixes
Use multiple columns as keys
Use the index as a key: left_index, right_index
Basic usage of pandas.DataFrame.join()

The pandas.concat() function can be used to concatenate pandas.DataFrame objects either vertically or horizontally.

pandas: Concat multiple DataFrame/Series with concat()

The sample code in this article uses pandas version 2.0.3. The following two pandas.DataFrame objects are used as examples.

import pandas as pd

print(pd.__version__)
# 2.0.3

df_ab = pd.DataFrame({'a': ['a_1', 'a_2', 'a_3'], 'b': ['b_1', 'b_2', 'b_3']})
df_ac = pd.DataFrame({'a': ['a_1', 'a_2', 'a_4'], 'c': ['c_1', 'c_2', 'c_4']})

print(df_ab)
#      a    b
# 0  a_1  b_1
# 1  a_2  b_2
# 2  a_3  b_3

print(df_ac)
#      a    c
# 0  a_1  c_1
# 1  a_2  c_2
# 2  a_4  c_4

source: pandas_merge_join.py

Basic usage of `pandas.merge()` and `pandas.DataFrame.merge()`

The pandas.merge() function requires the two DataFrame objects as its first left and second right arguments.

print(pd.merge(df_ab, df_ac))
#      a    b    c
# 0  a_1  b_1  c_1
# 1  a_2  b_2  c_2

source: pandas_merge_join.py

For the merge() method, call the method on the DataFrame that corresponds to left, and specify the DataFrame that corresponds to right as an argument.

print(df_ab.merge(df_ac))
#      a    b    c
# 0  a_1  b_1  c_1
# 1  a_2  b_2  c_2

source: pandas_merge_join.py

Both methods return a new, merged DataFrame.

The arguments explained below are common to both the pandas.merge() function and the merge() method.

The column to be keyed: `on`, `left_on`, `right_on`

By default, the merge operation uses columns with the same names in the two DataFrame objects as the key.

To explicitly specify the column name to be used as the key, use the on argument.

print(pd.merge(df_ab, df_ac, on='a'))
#      a    b    c
# 0  a_1  b_1  c_1
# 1  a_2  b_2  c_2

source: pandas_merge_join.py

The left_on and right_on arguments allow you to specify the column names to be used as keys in each DataFrame separately.

df_ac_ = df_ac.rename(columns={'a': 'a_'})
print(df_ac_)
#     a_    c
# 0  a_1  c_1
# 1  a_2  c_2
# 2  a_4  c_4

print(pd.merge(df_ab, df_ac_, left_on='a', right_on='a_'))
#      a    b   a_    c
# 0  a_1  b_1  a_1  c_1
# 1  a_2  b_2  a_2  c_2

source: pandas_merge_join.py

In this case, both columns remain. You can delete the unused column with the drop() method.

pandas: Delete rows, columns from DataFrame with drop()

print(pd.merge(df_ab, df_ac_, left_on='a', right_on='a_').drop(columns='a_'))
#      a    b    c
# 0  a_1  b_1  c_1
# 1  a_2  b_2  c_2

source: pandas_merge_join.py

To use multiple columns as keys, you can provide a list of column names to the on, left_on, and right_on arguments. More on this later.

The merging method: `how`

Specify the merging method with the how argument, which defaults to 'inner'.

Elements without data are represented as NaN. For handling NaN, refer to the following articles.

INNER JOIN: `how='inner'`

Only rows with keys common to both left and right are retained. This is the default setting.

print(pd.merge(df_ab, df_ac, on='a', how='inner'))
#      a    b    c
# 0  a_1  b_1  c_1
# 1  a_2  b_2  c_2

source: pandas_merge_join.py

LEFT JOIN: `how='left'`

Join based on left. Rows without corresponding keys in right are still included. This is also known as a "LEFT OUTER JOIN".

print(pd.merge(df_ab, df_ac, on='a', how='left'))
#      a    b    c
# 0  a_1  b_1  c_1
# 1  a_2  b_2  c_2
# 2  a_3  b_3  NaN

source: pandas_merge_join.py

RIGHT JOIN: `how='right'`

Join based on right. Rows without corresponding keys in left are still included. This is also known as a "RIGHT OUTER JOIN".

print(pd.merge(df_ab, df_ac, on='a', how='right'))
#      a    b    c
# 0  a_1  b_1  c_1
# 1  a_2  b_2  c_2
# 2  a_4  NaN  c_4

source: pandas_merge_join.py

OUTER JOIN: `how='outer'`

All rows from left and right remain. This is also known as "FULL OUTER JOIN".

print(pd.merge(df_ab, df_ac, on='a', how='outer'))
#      a    b    c
# 0  a_1  b_1  c_1
# 1  a_2  b_2  c_2
# 2  a_3  b_3  NaN
# 3  a_4  NaN  c_4

source: pandas_merge_join.py

CROSS JOIN: `how='cross'`

This creates all possible combinations of left and right. By default, the suffixes _x and _y are added to overlapping column names. The suffixes argument can be used to specify these, which will be described later.

print(pd.merge(df_ab, df_ac, how='cross'))
#    a_x    b  a_y    c
# 0  a_1  b_1  a_1  c_1
# 1  a_1  b_1  a_2  c_2
# 2  a_1  b_1  a_4  c_4
# 3  a_2  b_2  a_1  c_1
# 4  a_2  b_2  a_2  c_2
# 5  a_2  b_2  a_4  c_4
# 6  a_3  b_3  a_1  c_1
# 7  a_3  b_3  a_2  c_2
# 8  a_3  b_3  a_4  c_4

source: pandas_merge_join.py

You cannot specify the on, left_on, and right_on arguments when how='cross'.

# print(pd.merge(df_ab, df_ac, on='a', how='cross'))
# MergeError: Can not pass on, right_on, left_on or set right_index=True or left_index=True

source: pandas_merge_join.py

Add an information column: `indicator`

Setting the indicator argument to True adds a column containing information about the original data. By default, a column named _merge is added, classifying each row as both, left_only, or right_only.

print(pd.merge(df_ab, df_ac, on='a', how='inner', indicator=True))
#      a    b    c _merge
# 0  a_1  b_1  c_1   both
# 1  a_2  b_2  c_2   both

print(pd.merge(df_ab, df_ac, on='a', how='outer', indicator=True))
#      a    b    c      _merge
# 0  a_1  b_1  c_1        both
# 1  a_2  b_2  c_2        both
# 2  a_3  b_3  NaN   left_only
# 3  a_4  NaN  c_4  right_only

source: pandas_merge_join.py

To name the column differently from _merge, specify the desired name as a string in the indicator argument.

print(pd.merge(df_ab, df_ac, on='a', how='outer', indicator='indicator'))
#      a    b    c   indicator
# 0  a_1  b_1  c_1        both
# 1  a_2  b_2  c_2        both
# 2  a_3  b_3  NaN   left_only
# 3  a_4  NaN  c_4  right_only

source: pandas_merge_join.py

Suffixes for overlapping column names: `suffixes`

If column names other than the key column overlap in left and right, suffixes _x and _y are added by default.

df_ac_b = df_ac.rename(columns={'c': 'b'})
print(df_ac_b)
#      a    b
# 0  a_1  c_1
# 1  a_2  c_2
# 2  a_4  c_4

print(pd.merge(df_ab, df_ac_b, on='a'))
#      a  b_x  b_y
# 0  a_1  b_1  c_1
# 1  a_2  b_2  c_2

source: pandas_merge_join.py

Specify specific suffixes by passing a list or tuple of two elements like [suffix_for_left, suffix_for_right] in the suffixes argument.

print(pd.merge(df_ab, df_ac_b, on='a', suffixes=['_left', '_right']))
#      a b_left b_right
# 0  a_1    b_1     c_1
# 1  a_2    b_2     c_2

source: pandas_merge_join.py

Use multiple columns as keys

The examples so far have only one column as a key, but you can use multiple columns as keys.

Consider following DataFrame objects with added new columns.

pandas: Add rows/columns to DataFrame with assign(), insert()

df_abx = df_ab.assign(x=['x_2', 'x_2', 'x_3'])
df_acx = df_ac.assign(x=['x_1', 'x_2', 'x_2'])

print(df_abx)
#      a    b    x
# 0  a_1  b_1  x_2
# 1  a_2  b_2  x_2
# 2  a_3  b_3  x_3

print(df_acx)
#      a    c    x
# 0  a_1  c_1  x_1
# 1  a_2  c_2  x_2
# 2  a_4  c_4  x_2

source: pandas_merge_join.py

By default, if multiple columns share the same name, all these columns are treated as keys. To specify explicitly, pass a list of column names to the on argument.

print(pd.merge(df_abx, df_acx))
#      a    b    x    c
# 0  a_2  b_2  x_2  c_2

print(pd.merge(df_abx, df_acx, on=['a', 'x']))
#      a    b    x    c
# 0  a_2  b_2  x_2  c_2

source: pandas_merge_join.py

Even if multiple columns have the same name, you can use only one of the columns as a key. Suffixes are added to the overlapping column names. As described above, you can specify any suffix with the suffixes argument.

print(pd.merge(df_abx, df_acx, on='a'))
#      a    b  x_x    c  x_y
# 0  a_1  b_1  x_2  c_1  x_1
# 1  a_2  b_2  x_2  c_2  x_2

source: pandas_merge_join.py

If you want to use columns with different names as keys, provide lists of column names to the left_on and right_on arguments respectively.

df_acx_ = df_acx.rename(columns={'x': 'x_'})
print(df_acx_)
#      a    c   x_
# 0  a_1  c_1  x_1
# 1  a_2  c_2  x_2
# 2  a_4  c_4  x_2

print(pd.merge(df_abx, df_acx_, left_on=['a', 'x'], right_on=['a', 'x_']))
#      a    b    x    c   x_
# 0  a_2  b_2  x_2  c_2  x_2

source: pandas_merge_join.py

The usage of the how argument is the same as when using a single column as a key.

print(pd.merge(df_abx, df_acx, on=['a', 'x'], how='inner'))
#      a    b    x    c
# 0  a_2  b_2  x_2  c_2

print(pd.merge(df_abx, df_acx, on=['a', 'x'], how='left'))
#      a    b    x    c
# 0  a_1  b_1  x_2  NaN
# 1  a_2  b_2  x_2  c_2
# 2  a_3  b_3  x_3  NaN

print(pd.merge(df_abx, df_acx, on=['a', 'x'], how='right'))
#      a    b    x    c
# 0  a_1  NaN  x_1  c_1
# 1  a_2  b_2  x_2  c_2
# 2  a_4  NaN  x_2  c_4

print(pd.merge(df_abx, df_acx, on=['a', 'x'], how='outer'))
#      a    b    x    c
# 0  a_1  b_1  x_2  NaN
# 1  a_2  b_2  x_2  c_2
# 2  a_3  b_3  x_3  NaN
# 3  a_1  NaN  x_1  c_1
# 4  a_4  NaN  x_2  c_4

print(pd.merge(df_abx, df_acx, how='cross'))
#    a_x    b  x_x  a_y    c  x_y
# 0  a_1  b_1  x_2  a_1  c_1  x_1
# 1  a_1  b_1  x_2  a_2  c_2  x_2
# 2  a_1  b_1  x_2  a_4  c_4  x_2
# 3  a_2  b_2  x_2  a_1  c_1  x_1
# 4  a_2  b_2  x_2  a_2  c_2  x_2
# 5  a_2  b_2  x_2  a_4  c_4  x_2
# 6  a_3  b_3  x_3  a_1  c_1  x_1
# 7  a_3  b_3  x_3  a_2  c_2  x_2
# 8  a_3  b_3  x_3  a_4  c_4  x_2

source: pandas_merge_join.py

Use the index as a key: `left_index`, `right_index`

To use the index as a key, set the left_index and right_index arguments to True.

df_ac_i = df_ac.set_index('a')
print(df_ac_i)
#        c
# a       
# a_1  c_1
# a_2  c_2
# a_4  c_4

print(pd.merge(df_ab, df_ac_i, left_on='a', right_index=True))
#      a    b    c
# 0  a_1  b_1  c_1
# 1  a_2  b_2  c_2

source: pandas_merge_join.py

You can also set both the left_index and right_index arguments to True.

df_ab_i = df_ab.set_index('a')
print(df_ab_i)
#        b
# a       
# a_1  b_1
# a_2  b_2
# a_3  b_3

print(pd.merge(df_ab_i, df_ac_i, left_index=True, right_index=True))
#        b    c
# a            
# a_1  b_1  c_1
# a_2  b_2  c_2

source: pandas_merge_join.py

To use the index as a key, you can also use the join() method, which is described next.

Basic usage of `pandas.DataFrame.join()`

You can also use the join() method of pandas.DataFrame to merge based on the index. Note that there is no pandas.join() function. The usage is generally the same as the merge() method.

Consider following DataFrame objects with a specified index.

print(df_ab_i)
#        b
# a       
# a_1  b_1
# a_2  b_2
# a_3  b_3

print(df_ac_i)
#        c
# a       
# a_1  c_1
# a_2  c_2
# a_4  c_4

source: pandas_merge_join.py

`how`

The join() method merges DataFrame objects based on their index and defaults to a left join (how='left').

print(df_ab_i.join(df_ac_i))
#        b    c
# a            
# a_1  b_1  c_1
# a_2  b_2  c_2
# a_3  b_3  NaN

print(df_ab_i.join(df_ac_i, how='inner'))
#        b    c
# a            
# a_1  b_1  c_1
# a_2  b_2  c_2

print(df_ab_i.join(df_ac_i, how='left'))
#        b    c
# a            
# a_1  b_1  c_1
# a_2  b_2  c_2
# a_3  b_3  NaN

print(df_ab_i.join(df_ac_i, how='right'))
#        b    c
# a            
# a_1  b_1  c_1
# a_2  b_2  c_2
# a_4  NaN  c_4

print(df_ab_i.join(df_ac_i, how='outer'))
#        b    c
# a            
# a_1  b_1  c_1
# a_2  b_2  c_2
# a_3  b_3  NaN
# a_4  NaN  c_4

print(df_ab_i.join(df_ac_i, how='cross'))
#      b    c
# 0  b_1  c_1
# 1  b_1  c_2
# 2  b_1  c_4
# 3  b_2  c_1
# 4  b_2  c_2
# 5  b_2  c_4
# 6  b_3  c_1
# 7  b_3  c_2
# 8  b_3  c_4

source: pandas_merge_join.py

`on`

You can specify the column to be used as the key in the calling DataFrame using the on argument. The DataFrame specified as an argument always uses the index as the key.

print(df_ab)
#      a    b
# 0  a_1  b_1
# 1  a_2  b_2
# 2  a_3  b_3

print(df_ab.join(df_ac_i, on='a'))
#      a    b    c
# 0  a_1  b_1  c_1
# 1  a_2  b_2  c_2
# 2  a_3  b_3  NaN

source: pandas_merge_join.py

`lsuffix` and `rsuffix`

If there are overlapping column names in the result, it will cause an error by default. You need to specify the lsuffix and rsuffix arguments.

df_ab_i2 = df_ac_i.rename(columns={'c': 'b'})
print(df_ab_i2)
#        b
# a       
# a_1  c_1
# a_2  c_2
# a_4  c_4

# print(df_ab_i.join(df_ab_i2))
# ValueError: columns overlap but no suffix specified: Index(['b'], dtype='object')

print(df_ab_i.join(df_ab_i2, lsuffix='_left', rsuffix='_right'))
#     b_left b_right
# a                 
# a_1    b_1     c_1
# a_2    b_2     c_2
# a_3    b_3     NaN

source: pandas_merge_join.py

Specify multiple `pandas.DataFrame` in a list

You can pass a list of DataFrame objects as the first argument to the join() method.

df_ad_i = pd.DataFrame({'a': ['a_1', 'a_4', 'a_5'], 'd': ['d_1', 'd_4', 'd_5']}).set_index('a')
print(df_ad_i)
#        d
# a       
# a_1  d_1
# a_4  d_4
# a_5  d_5

print(df_ab_i.join([df_ac_i, df_ad_i]))
#        b    c    d
# a                 
# a_1  b_1  c_1  d_1
# a_2  b_2  c_2  NaN
# a_3  b_3  NaN  NaN

print(df_ac_i.join([df_ad_i, df_ab_i]))
#        c    d    b
# a                 
# a_1  c_1  d_1  b_1
# a_2  c_2  NaN  b_2
# a_4  c_4  d_4  NaN

source: pandas_merge_join.py

pandas: Merge DataFrame with merge(), join() (INNER, OUTER JOIN)

Basic usage of `pandas.merge()` and `pandas.DataFrame.merge()`

The column to be keyed: `on`, `left_on`, `right_on`

The merging method: `how`

INNER JOIN: `how='inner'`

LEFT JOIN: `how='left'`

RIGHT JOIN: `how='right'`

OUTER JOIN: `how='outer'`

CROSS JOIN: `how='cross'`

Add an information column: `indicator`

Suffixes for overlapping column names: `suffixes`

Use multiple columns as keys

Use the index as a key: `left_index`, `right_index`

Basic usage of `pandas.DataFrame.join()`

`how`

`on`

`lsuffix` and `rsuffix`

Specify multiple `pandas.DataFrame` in a list

Related Categories

Related Articles

pandas: Merge DataFrame with merge(), join() (INNER, OUTER JOIN)

Basic usage of pandas.merge() and pandas.DataFrame.merge()

The column to be keyed: on, left_on, right_on

The merging method: how

INNER JOIN: how='inner'

LEFT JOIN: how='left'

RIGHT JOIN: how='right'

OUTER JOIN: how='outer'

CROSS JOIN: how='cross'

Add an information column: indicator

Suffixes for overlapping column names: suffixes

Use multiple columns as keys

Use the index as a key: left_index, right_index

Basic usage of pandas.DataFrame.join()

how

on

lsuffix and rsuffix

Specify multiple pandas.DataFrame in a list

Related Categories

Related Articles

Basic usage of `pandas.merge()` and `pandas.DataFrame.merge()`

The column to be keyed: `on`, `left_on`, `right_on`

The merging method: `how`

INNER JOIN: `how='inner'`

LEFT JOIN: `how='left'`

RIGHT JOIN: `how='right'`

OUTER JOIN: `how='outer'`

CROSS JOIN: `how='cross'`

Add an information column: `indicator`

Suffixes for overlapping column names: `suffixes`

Use the index as a key: `left_index`, `right_index`

Basic usage of `pandas.DataFrame.join()`

`how`

`on`

`lsuffix` and `rsuffix`

Specify multiple `pandas.DataFrame` in a list