pandas: Find, count, drop duplicates (duplicated, drop_duplicates)
In pandas, the duplicated()
method is used to find, extract, and count duplicate rows in a DataFrame
, while drop_duplicates()
is used to remove these duplicates.
This article also briefly explains the groupby()
method, which aggregates values based on duplicates.
The pandas version used in this article is as follows. Note that functionality may vary between versions. The following DataFrame
is used as an example.
import pandas as pd
print(pd.__version__)
# 2.1.4
df = pd.read_csv('data/src/sample_pandas_normal.csv')
df.loc[6] = ['Dave', 68, 'TX', 70]
print(df)
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
# 4 Ellen 24 CA 88
# 5 Frank 30 NY 57
# 6 Dave 68 TX 70
The following examples use DataFrame
, but Series
also supports the duplicated()
and drop_duplicates()
methods in the same manner.
Find, extract, and count duplicate rows: duplicated()
Use the duplicated()
method to find, extract, and count duplicate rows in a DataFrame
, or duplicate elements in a Series
.
- pandas.DataFrame.duplicated — pandas 2.1.4 documentation
- pandas.Series.duplicated — pandas 2.1.4 documentation
Basic usage
duplicated()
returns a Boolean Series
that marks duplicate rows as True
. By default, rows are considered duplicates if all column values are equal.
print(df.duplicated())
# 0 False
# 1 False
# 2 False
# 3 False
# 4 False
# 5 False
# 6 True
# dtype: bool
You can extract duplicates from the DataFrame
using the resulting Series
.
print(df[df.duplicated()])
# name age state point
# 6 Dave 68 TX 70
Choose duplicates to keep: keep
By default(keep='first'
), duplicated()
marks all duplicate rows as True
, except for the first occurrence, which is marked as False
. Conversely, setting keep='last'
marks all duplicates as True
except for the last occurrence.
print(df.duplicated())
# 0 False
# 1 False
# 2 False
# 3 False
# 4 False
# 5 False
# 6 True
# dtype: bool
print(df.duplicated(keep='last'))
# 0 False
# 1 False
# 2 False
# 3 True
# 4 False
# 5 False
# 6 False
# dtype: bool
Setting keep=False
marks all duplicates as True
.
print(df.duplicated(keep=False))
# 0 False
# 1 False
# 2 False
# 3 True
# 4 False
# 5 False
# 6 True
# dtype: bool
Specify columns for duplicate detection: subset
To identify duplicates based on specific columns, use the subset
argument. By default, duplicates are identified across all columns.
print(df.duplicated(subset='state'))
# 0 False
# 1 False
# 2 True
# 3 False
# 4 True
# 5 True
# 6 True
# dtype: bool
Multiple columns can also be specified in a list. Rows matching in all these columns are marked as duplicates.
print(df.duplicated(subset=['state', 'point']))
# 0 False
# 1 False
# 2 False
# 3 False
# 4 False
# 5 False
# 6 True
# dtype: bool
Count duplicate and non-duplicate rows
You can count duplicate rows by counting True
in the Series
returned by duplicated()
. True
can be counted with the sum()
method.
print(df.duplicated().sum())
# 1
To count False
(representing non-duplicate rows), negate with ~
and then use the sum()
method.
print(~df.duplicated())
# 0 True
# 1 True
# 2 True
# 3 True
# 4 True
# 5 True
# 6 False
# dtype: bool
print((~df.duplicated()).sum())
# 6
Both duplicate and non-duplicate counts can be obtained using value_counts()
.
print(df.duplicated().value_counts())
# False 6
# True 1
# Name: count, dtype: int64
The results vary based on the keep
argument, so choose the appropriate setting for your needs.
print(df.duplicated(keep=False).value_counts())
# False 5
# True 2
# Name: count, dtype: int64
Remove duplicate rows: drop_duplicates()
Use the drop_duplicates()
method to remove duplicate rows from a DataFrame
, or duplicate elements from a Series
.
- pandas.DataFrame.drop_duplicates — pandas 2.1.4 documentation
- pandas.Series.drop_duplicates — pandas 2.1.4 documentation
Basic usage
By default, rows are considered duplicates if all column values are equal. The first duplicate row is kept, while the others are removed.
print(df.drop_duplicates())
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
# 4 Ellen 24 CA 88
# 5 Frank 30 NY 57
Choose duplicates to keep: keep
The keep
argument in drop_duplicates()
behaves similarly to its use in duplicated()
.
The default setting keep='first'
retains the first occurrence of each duplicate row, while keep='last'
retains the last. Setting keep=False
removes all duplicates.
print(df.drop_duplicates(keep='last'))
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 4 Ellen 24 CA 88
# 5 Frank 30 NY 57
# 6 Dave 68 TX 70
print(df.drop_duplicates(keep=False))
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 4 Ellen 24 CA 88
# 5 Frank 30 NY 57
Specify columns for duplicate detection: subset
Specify columns for duplicate detection with the subset
argument, similar to duplicated()
.
print(df.drop_duplicates(subset='state'))
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 3 Dave 68 TX 70
print(df.drop_duplicates(subset=['state', 'point']))
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
# 4 Ellen 24 CA 88
# 5 Frank 30 NY 57
Reset the index
: ignore_index
Setting ignore_index=True
resets the index
to a 0-based sequence.
print(df.drop_duplicates(subset='state', keep='last'))
# name age state point
# 4 Ellen 24 CA 88
# 5 Frank 30 NY 57
# 6 Dave 68 TX 70
print(df.drop_duplicates(subset='state', keep='last', ignore_index=True))
# name age state point
# 0 Ellen 24 CA 88
# 1 Frank 30 NY 57
# 2 Dave 68 TX 70
Modify the original object: inplace
By default, drop_duplicates()
returns a new object without changing the original. Setting inplace=True
modifies the original object.
df.drop_duplicates(subset='state', keep='last', inplace=True)
print(df)
# name age state point
# 4 Ellen 24 CA 88
# 5 Frank 30 NY 57
# 6 Dave 68 TX 70
Aggregate based on duplicates: groupby()
Use groupby()
to aggregate values based on duplicates.
In the following examples, the average values of the numeric columns (age
and point
) are calculated for duplicated values in the state
column.
df = pd.read_csv('data/src/sample_pandas_normal.csv')
print(df)
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
# 4 Ellen 24 CA 88
# 5 Frank 30 NY 57
print(df.groupby('state').mean(numeric_only=True))
# age point
# state
# CA 28.0 83.333333
# NY 27.0 60.500000
# TX 68.0 70.000000
String concatenation or conversion to lists is also possible.
print(
df.groupby('state').agg(
{'name': lambda x: ','.join(x), 'age': 'mean', 'point': 'sum'}
)
)
# name age point
# state
# CA Bob,Charlie,Ellen 28.0 250
# NY Alice,Frank 27.0 121
# TX Dave 68.0 70
print(df.groupby('state').agg({'name': list, 'age': 'mean', 'point': 'sum'}))
# name age point
# state
# CA [Bob, Charlie, Ellen] 28.0 250
# NY [Alice, Frank] 27.0 121
# TX [Dave] 68.0 70
For more details on groupby()
, refer to the following article.
String concatenation can be done using lambda expressions to apply the join()
method.