pandas: Find, count, drop duplicates (duplicated, drop_duplicates)
In pandas, the duplicated() method is used to find, extract, and count duplicate rows in a DataFrame, while drop_duplicates() is used to remove these duplicates.
This article also briefly explains the groupby() method, which aggregates values based on duplicates.
The pandas version used in this article is as follows. Note that functionality may vary between versions. The following DataFrame is used as an example.
import pandas as pd
print(pd.__version__)
# 2.1.4
df = pd.read_csv('data/src/sample_pandas_normal.csv')
df.loc[6] = ['Dave', 68, 'TX', 70]
print(df)
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
# 4 Ellen 24 CA 88
# 5 Frank 30 NY 57
# 6 Dave 68 TX 70
The following examples use DataFrame, but Series also supports the duplicated() and drop_duplicates() methods in the same manner.
Find, extract, and count duplicate rows: duplicated()
Use the duplicated() method to find, extract, and count duplicate rows in a DataFrame, or duplicate elements in a Series.
- pandas.DataFrame.duplicated — pandas 2.1.4 documentation
- pandas.Series.duplicated — pandas 2.1.4 documentation
Basic usage
duplicated() returns a Boolean Series that marks duplicate rows as True. By default, rows are considered duplicates if all column values are equal.
print(df.duplicated())
# 0 False
# 1 False
# 2 False
# 3 False
# 4 False
# 5 False
# 6 True
# dtype: bool
You can extract duplicates from the DataFrame using the resulting Series.
print(df[df.duplicated()])
# name age state point
# 6 Dave 68 TX 70
Choose duplicates to keep: keep
By default(keep='first'), duplicated() marks all duplicate rows as True, except for the first occurrence, which is marked as False. Conversely, setting keep='last' marks all duplicates as True except for the last occurrence.
print(df.duplicated())
# 0 False
# 1 False
# 2 False
# 3 False
# 4 False
# 5 False
# 6 True
# dtype: bool
print(df.duplicated(keep='last'))
# 0 False
# 1 False
# 2 False
# 3 True
# 4 False
# 5 False
# 6 False
# dtype: bool
Setting keep=False marks all duplicates as True.
print(df.duplicated(keep=False))
# 0 False
# 1 False
# 2 False
# 3 True
# 4 False
# 5 False
# 6 True
# dtype: bool
Specify columns for duplicate detection: subset
To identify duplicates based on specific columns, use the subset argument. By default, duplicates are identified across all columns.
print(df.duplicated(subset='state'))
# 0 False
# 1 False
# 2 True
# 3 False
# 4 True
# 5 True
# 6 True
# dtype: bool
Multiple columns can also be specified in a list. Rows matching in all these columns are marked as duplicates.
print(df.duplicated(subset=['state', 'point']))
# 0 False
# 1 False
# 2 False
# 3 False
# 4 False
# 5 False
# 6 True
# dtype: bool
Count duplicate and non-duplicate rows
You can count duplicate rows by counting True in the Series returned by duplicated(). True can be counted with the sum() method.
print(df.duplicated().sum())
# 1
To count False (representing non-duplicate rows), negate with ~ and then use the sum() method.
print(~df.duplicated())
# 0 True
# 1 True
# 2 True
# 3 True
# 4 True
# 5 True
# 6 False
# dtype: bool
print((~df.duplicated()).sum())
# 6
Both duplicate and non-duplicate counts can be obtained using value_counts().
print(df.duplicated().value_counts())
# False 6
# True 1
# Name: count, dtype: int64
The results vary based on the keep argument, so choose the appropriate setting for your needs.
print(df.duplicated(keep=False).value_counts())
# False 5
# True 2
# Name: count, dtype: int64
Remove duplicate rows: drop_duplicates()
Use the drop_duplicates() method to remove duplicate rows from a DataFrame, or duplicate elements from a Series.
- pandas.DataFrame.drop_duplicates — pandas 2.1.4 documentation
- pandas.Series.drop_duplicates — pandas 2.1.4 documentation
Basic usage
By default, rows are considered duplicates if all column values are equal. The first duplicate row is kept, while the others are removed.
print(df.drop_duplicates())
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
# 4 Ellen 24 CA 88
# 5 Frank 30 NY 57
Choose duplicates to keep: keep
The keep argument in drop_duplicates() behaves similarly to its use in duplicated().
The default setting keep='first' retains the first occurrence of each duplicate row, while keep='last' retains the last. Setting keep=False removes all duplicates.
print(df.drop_duplicates(keep='last'))
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 4 Ellen 24 CA 88
# 5 Frank 30 NY 57
# 6 Dave 68 TX 70
print(df.drop_duplicates(keep=False))
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 4 Ellen 24 CA 88
# 5 Frank 30 NY 57
Specify columns for duplicate detection: subset
Specify columns for duplicate detection with the subset argument, similar to duplicated().
print(df.drop_duplicates(subset='state'))
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 3 Dave 68 TX 70
print(df.drop_duplicates(subset=['state', 'point']))
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
# 4 Ellen 24 CA 88
# 5 Frank 30 NY 57
Reset the index: ignore_index
Setting ignore_index=True resets the index to a 0-based sequence.
print(df.drop_duplicates(subset='state', keep='last'))
# name age state point
# 4 Ellen 24 CA 88
# 5 Frank 30 NY 57
# 6 Dave 68 TX 70
print(df.drop_duplicates(subset='state', keep='last', ignore_index=True))
# name age state point
# 0 Ellen 24 CA 88
# 1 Frank 30 NY 57
# 2 Dave 68 TX 70
Modify the original object: inplace
By default, drop_duplicates() returns a new object without changing the original. Setting inplace=True modifies the original object.
df.drop_duplicates(subset='state', keep='last', inplace=True)
print(df)
# name age state point
# 4 Ellen 24 CA 88
# 5 Frank 30 NY 57
# 6 Dave 68 TX 70
Aggregate based on duplicates: groupby()
Use groupby() to aggregate values based on duplicates.
In the following examples, the average values of the numeric columns (age and point) are calculated for duplicated values in the state column.
df = pd.read_csv('data/src/sample_pandas_normal.csv')
print(df)
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
# 4 Ellen 24 CA 88
# 5 Frank 30 NY 57
print(df.groupby('state').mean(numeric_only=True))
# age point
# state
# CA 28.0 83.333333
# NY 27.0 60.500000
# TX 68.0 70.000000
String concatenation or conversion to lists is also possible.
print(
df.groupby('state').agg(
{'name': lambda x: ','.join(x), 'age': 'mean', 'point': 'sum'}
)
)
# name age point
# state
# CA Bob,Charlie,Ellen 28.0 250
# NY Alice,Frank 27.0 121
# TX Dave 68.0 70
print(df.groupby('state').agg({'name': list, 'age': 'mean', 'point': 'sum'}))
# name age point
# state
# CA [Bob, Charlie, Ellen] 28.0 250
# NY [Alice, Frank] 27.0 121
# TX [Dave] 68.0 70
For more details on groupby(), refer to the following article.
String concatenation can be done using lambda expressions to apply the join() method.