pandas: Merge DataFrame with merge(), join() (INNER, OUTER JOIN)
The pandas.merge()
function and the merge()
method of pandas.DataFrame
are used to merge multiple pandas.DataFrame
objects based on columns or indexes.
If you want to merge based on the index, you can also use the join()
method of pandas.DataFrame
.
- Basic usage of
pandas.merge()
andpandas.DataFrame.merge()
- The column to be keyed:
on
,left_on
,right_on
- The merging method:
how
- Add an information column:
indicator
- Suffixes for overlapping column names:
suffixes
- Use multiple columns as keys
- Use the index as a key:
left_index
,right_index
- Basic usage of
pandas.DataFrame.join()
The pandas.concat()
function can be used to concatenate pandas.DataFrame
objects either vertically or horizontally.
The sample code in this article uses pandas version 2.0.3
. The following two pandas.DataFrame
objects are used as examples.
import pandas as pd
print(pd.__version__)
# 2.0.3
df_ab = pd.DataFrame({'a': ['a_1', 'a_2', 'a_3'], 'b': ['b_1', 'b_2', 'b_3']})
df_ac = pd.DataFrame({'a': ['a_1', 'a_2', 'a_4'], 'c': ['c_1', 'c_2', 'c_4']})
print(df_ab)
# a b
# 0 a_1 b_1
# 1 a_2 b_2
# 2 a_3 b_3
print(df_ac)
# a c
# 0 a_1 c_1
# 1 a_2 c_2
# 2 a_4 c_4
Basic usage of pandas.merge()
and pandas.DataFrame.merge()
The pandas.merge()
function requires the two DataFrame
objects as its first left
and second right
arguments.
print(pd.merge(df_ab, df_ac))
# a b c
# 0 a_1 b_1 c_1
# 1 a_2 b_2 c_2
For the merge()
method, call the method on the DataFrame
that corresponds to left
, and specify the DataFrame
that corresponds to right
as an argument.
print(df_ab.merge(df_ac))
# a b c
# 0 a_1 b_1 c_1
# 1 a_2 b_2 c_2
Both methods return a new, merged DataFrame
.
The arguments explained below are common to both the pandas.merge()
function and the merge()
method.
The column to be keyed: on
, left_on
, right_on
By default, the merge operation uses columns with the same names in the two DataFrame
objects as the key.
To explicitly specify the column name to be used as the key, use the on
argument.
print(pd.merge(df_ab, df_ac, on='a'))
# a b c
# 0 a_1 b_1 c_1
# 1 a_2 b_2 c_2
The left_on
and right_on
arguments allow you to specify the column names to be used as keys in each DataFrame
separately.
df_ac_ = df_ac.rename(columns={'a': 'a_'})
print(df_ac_)
# a_ c
# 0 a_1 c_1
# 1 a_2 c_2
# 2 a_4 c_4
print(pd.merge(df_ab, df_ac_, left_on='a', right_on='a_'))
# a b a_ c
# 0 a_1 b_1 a_1 c_1
# 1 a_2 b_2 a_2 c_2
In this case, both columns remain. You can delete the unused column with the drop()
method.
print(pd.merge(df_ab, df_ac_, left_on='a', right_on='a_').drop(columns='a_'))
# a b c
# 0 a_1 b_1 c_1
# 1 a_2 b_2 c_2
To use multiple columns as keys, you can provide a list of column names to the on
, left_on
, and right_on
arguments. More on this later.
The merging method: how
Specify the merging method with the how
argument, which defaults to 'inner'
.
Elements without data are represented as NaN
. For handling NaN
, refer to the following articles.
- pandas: Remove missing values (NaN) with dropna()
- pandas: Replace missing values (NaN) with fillna()
INNER JOIN: how='inner'
Only rows with keys common to both left
and right
are retained. This is the default setting.
print(pd.merge(df_ab, df_ac, on='a', how='inner'))
# a b c
# 0 a_1 b_1 c_1
# 1 a_2 b_2 c_2
LEFT JOIN: how='left'
Join based on left
. Rows without corresponding keys in right
are still included. This is also known as a "LEFT OUTER JOIN".
print(pd.merge(df_ab, df_ac, on='a', how='left'))
# a b c
# 0 a_1 b_1 c_1
# 1 a_2 b_2 c_2
# 2 a_3 b_3 NaN
RIGHT JOIN: how='right'
Join based on right
. Rows without corresponding keys in left
are still included. This is also known as a "RIGHT OUTER JOIN".
print(pd.merge(df_ab, df_ac, on='a', how='right'))
# a b c
# 0 a_1 b_1 c_1
# 1 a_2 b_2 c_2
# 2 a_4 NaN c_4
OUTER JOIN: how='outer'
All rows from left
and right
remain. This is also known as "FULL OUTER JOIN".
print(pd.merge(df_ab, df_ac, on='a', how='outer'))
# a b c
# 0 a_1 b_1 c_1
# 1 a_2 b_2 c_2
# 2 a_3 b_3 NaN
# 3 a_4 NaN c_4
CROSS JOIN: how='cross'
This creates all possible combinations of left
and right
. By default, the suffixes _x
and _y
are added to overlapping column names. The suffixes
argument can be used to specify these, which will be described later.
print(pd.merge(df_ab, df_ac, how='cross'))
# a_x b a_y c
# 0 a_1 b_1 a_1 c_1
# 1 a_1 b_1 a_2 c_2
# 2 a_1 b_1 a_4 c_4
# 3 a_2 b_2 a_1 c_1
# 4 a_2 b_2 a_2 c_2
# 5 a_2 b_2 a_4 c_4
# 6 a_3 b_3 a_1 c_1
# 7 a_3 b_3 a_2 c_2
# 8 a_3 b_3 a_4 c_4
You cannot specify the on
, left_on
, and right_on
arguments when how='cross'
.
# print(pd.merge(df_ab, df_ac, on='a', how='cross'))
# MergeError: Can not pass on, right_on, left_on or set right_index=True or left_index=True
Add an information column: indicator
Setting the indicator
argument to True
adds a column containing information about the original data. By default, a column named _merge
is added, classifying each row as both
, left_only
, or right_only
.
print(pd.merge(df_ab, df_ac, on='a', how='inner', indicator=True))
# a b c _merge
# 0 a_1 b_1 c_1 both
# 1 a_2 b_2 c_2 both
print(pd.merge(df_ab, df_ac, on='a', how='outer', indicator=True))
# a b c _merge
# 0 a_1 b_1 c_1 both
# 1 a_2 b_2 c_2 both
# 2 a_3 b_3 NaN left_only
# 3 a_4 NaN c_4 right_only
To name the column differently from _merge
, specify the desired name as a string in the indicator
argument.
print(pd.merge(df_ab, df_ac, on='a', how='outer', indicator='indicator'))
# a b c indicator
# 0 a_1 b_1 c_1 both
# 1 a_2 b_2 c_2 both
# 2 a_3 b_3 NaN left_only
# 3 a_4 NaN c_4 right_only
Suffixes for overlapping column names: suffixes
If column names other than the key column overlap in left
and right
, suffixes _x
and _y
are added by default.
df_ac_b = df_ac.rename(columns={'c': 'b'})
print(df_ac_b)
# a b
# 0 a_1 c_1
# 1 a_2 c_2
# 2 a_4 c_4
print(pd.merge(df_ab, df_ac_b, on='a'))
# a b_x b_y
# 0 a_1 b_1 c_1
# 1 a_2 b_2 c_2
Specify specific suffixes by passing a list or tuple of two elements like [suffix_for_left, suffix_for_right]
in the suffixes
argument.
print(pd.merge(df_ab, df_ac_b, on='a', suffixes=['_left', '_right']))
# a b_left b_right
# 0 a_1 b_1 c_1
# 1 a_2 b_2 c_2
Use multiple columns as keys
The examples so far have only one column as a key, but you can use multiple columns as keys.
Consider following DataFrame
objects with added new columns.
df_abx = df_ab.assign(x=['x_2', 'x_2', 'x_3'])
df_acx = df_ac.assign(x=['x_1', 'x_2', 'x_2'])
print(df_abx)
# a b x
# 0 a_1 b_1 x_2
# 1 a_2 b_2 x_2
# 2 a_3 b_3 x_3
print(df_acx)
# a c x
# 0 a_1 c_1 x_1
# 1 a_2 c_2 x_2
# 2 a_4 c_4 x_2
By default, if multiple columns share the same name, all these columns are treated as keys. To specify explicitly, pass a list of column names to the on
argument.
print(pd.merge(df_abx, df_acx))
# a b x c
# 0 a_2 b_2 x_2 c_2
print(pd.merge(df_abx, df_acx, on=['a', 'x']))
# a b x c
# 0 a_2 b_2 x_2 c_2
Even if multiple columns have the same name, you can use only one of the columns as a key. Suffixes are added to the overlapping column names. As described above, you can specify any suffix with the suffixes
argument.
print(pd.merge(df_abx, df_acx, on='a'))
# a b x_x c x_y
# 0 a_1 b_1 x_2 c_1 x_1
# 1 a_2 b_2 x_2 c_2 x_2
If you want to use columns with different names as keys, provide lists of column names to the left_on
and right_on
arguments respectively.
df_acx_ = df_acx.rename(columns={'x': 'x_'})
print(df_acx_)
# a c x_
# 0 a_1 c_1 x_1
# 1 a_2 c_2 x_2
# 2 a_4 c_4 x_2
print(pd.merge(df_abx, df_acx_, left_on=['a', 'x'], right_on=['a', 'x_']))
# a b x c x_
# 0 a_2 b_2 x_2 c_2 x_2
The usage of the how
argument is the same as when using a single column as a key.
print(pd.merge(df_abx, df_acx, on=['a', 'x'], how='inner'))
# a b x c
# 0 a_2 b_2 x_2 c_2
print(pd.merge(df_abx, df_acx, on=['a', 'x'], how='left'))
# a b x c
# 0 a_1 b_1 x_2 NaN
# 1 a_2 b_2 x_2 c_2
# 2 a_3 b_3 x_3 NaN
print(pd.merge(df_abx, df_acx, on=['a', 'x'], how='right'))
# a b x c
# 0 a_1 NaN x_1 c_1
# 1 a_2 b_2 x_2 c_2
# 2 a_4 NaN x_2 c_4
print(pd.merge(df_abx, df_acx, on=['a', 'x'], how='outer'))
# a b x c
# 0 a_1 b_1 x_2 NaN
# 1 a_2 b_2 x_2 c_2
# 2 a_3 b_3 x_3 NaN
# 3 a_1 NaN x_1 c_1
# 4 a_4 NaN x_2 c_4
print(pd.merge(df_abx, df_acx, how='cross'))
# a_x b x_x a_y c x_y
# 0 a_1 b_1 x_2 a_1 c_1 x_1
# 1 a_1 b_1 x_2 a_2 c_2 x_2
# 2 a_1 b_1 x_2 a_4 c_4 x_2
# 3 a_2 b_2 x_2 a_1 c_1 x_1
# 4 a_2 b_2 x_2 a_2 c_2 x_2
# 5 a_2 b_2 x_2 a_4 c_4 x_2
# 6 a_3 b_3 x_3 a_1 c_1 x_1
# 7 a_3 b_3 x_3 a_2 c_2 x_2
# 8 a_3 b_3 x_3 a_4 c_4 x_2
Use the index as a key: left_index
, right_index
To use the index as a key, set the left_index
and right_index
arguments to True
.
df_ac_i = df_ac.set_index('a')
print(df_ac_i)
# c
# a
# a_1 c_1
# a_2 c_2
# a_4 c_4
print(pd.merge(df_ab, df_ac_i, left_on='a', right_index=True))
# a b c
# 0 a_1 b_1 c_1
# 1 a_2 b_2 c_2
You can also set both the left_index
and right_index
arguments to True
.
df_ab_i = df_ab.set_index('a')
print(df_ab_i)
# b
# a
# a_1 b_1
# a_2 b_2
# a_3 b_3
print(pd.merge(df_ab_i, df_ac_i, left_index=True, right_index=True))
# b c
# a
# a_1 b_1 c_1
# a_2 b_2 c_2
To use the index as a key, you can also use the join()
method, which is described next.
Basic usage of pandas.DataFrame.join()
You can also use the join()
method of pandas.DataFrame
to merge based on the index. Note that there is no pandas.join()
function. The usage is generally the same as the merge()
method.
Consider following DataFrame
objects with a specified index.
print(df_ab_i)
# b
# a
# a_1 b_1
# a_2 b_2
# a_3 b_3
print(df_ac_i)
# c
# a
# a_1 c_1
# a_2 c_2
# a_4 c_4
how
The join()
method merges DataFrame
objects based on their index and defaults to a left join (how='left'
).
print(df_ab_i.join(df_ac_i))
# b c
# a
# a_1 b_1 c_1
# a_2 b_2 c_2
# a_3 b_3 NaN
print(df_ab_i.join(df_ac_i, how='inner'))
# b c
# a
# a_1 b_1 c_1
# a_2 b_2 c_2
print(df_ab_i.join(df_ac_i, how='left'))
# b c
# a
# a_1 b_1 c_1
# a_2 b_2 c_2
# a_3 b_3 NaN
print(df_ab_i.join(df_ac_i, how='right'))
# b c
# a
# a_1 b_1 c_1
# a_2 b_2 c_2
# a_4 NaN c_4
print(df_ab_i.join(df_ac_i, how='outer'))
# b c
# a
# a_1 b_1 c_1
# a_2 b_2 c_2
# a_3 b_3 NaN
# a_4 NaN c_4
print(df_ab_i.join(df_ac_i, how='cross'))
# b c
# 0 b_1 c_1
# 1 b_1 c_2
# 2 b_1 c_4
# 3 b_2 c_1
# 4 b_2 c_2
# 5 b_2 c_4
# 6 b_3 c_1
# 7 b_3 c_2
# 8 b_3 c_4
on
You can specify the column to be used as the key in the calling DataFrame
using the on
argument. The DataFrame
specified as an argument always uses the index as the key.
print(df_ab)
# a b
# 0 a_1 b_1
# 1 a_2 b_2
# 2 a_3 b_3
print(df_ab.join(df_ac_i, on='a'))
# a b c
# 0 a_1 b_1 c_1
# 1 a_2 b_2 c_2
# 2 a_3 b_3 NaN
lsuffix
and rsuffix
If there are overlapping column names in the result, it will cause an error by default. You need to specify the lsuffix
and rsuffix
arguments.
df_ab_i2 = df_ac_i.rename(columns={'c': 'b'})
print(df_ab_i2)
# b
# a
# a_1 c_1
# a_2 c_2
# a_4 c_4
# print(df_ab_i.join(df_ab_i2))
# ValueError: columns overlap but no suffix specified: Index(['b'], dtype='object')
print(df_ab_i.join(df_ab_i2, lsuffix='_left', rsuffix='_right'))
# b_left b_right
# a
# a_1 b_1 c_1
# a_2 b_2 c_2
# a_3 b_3 NaN
Specify multiple pandas.DataFrame
in a list
You can pass a list of DataFrame
objects as the first argument to the join()
method.
df_ad_i = pd.DataFrame({'a': ['a_1', 'a_4', 'a_5'], 'd': ['d_1', 'd_4', 'd_5']}).set_index('a')
print(df_ad_i)
# d
# a
# a_1 d_1
# a_4 d_4
# a_5 d_5
print(df_ab_i.join([df_ac_i, df_ad_i]))
# b c d
# a
# a_1 b_1 c_1 d_1
# a_2 b_2 c_2 NaN
# a_3 b_3 NaN NaN
print(df_ac_i.join([df_ad_i, df_ab_i]))
# c d b
# a
# a_1 c_1 d_1 b_1
# a_2 c_2 NaN b_2
# a_4 c_4 d_4 NaN