pandas: Find, count, drop duplicates (duplicated, drop_duplicates)

Modified: | Tags: Python, pandas

In pandas, the duplicated() method is used to find, extract, and count duplicate rows in a DataFrame, while drop_duplicates() is used to remove these duplicates.

This article also briefly explains the groupby() method, which aggregates values based on duplicates.

The pandas version used in this article is as follows. Note that functionality may vary between versions. The following DataFrame is used as an example.

import pandas as pd

print(pd.__version__)
# 2.1.4

df = pd.read_csv('data/src/sample_pandas_normal.csv')
df.loc[6] = ['Dave', 68, 'TX', 70]
print(df)
#       name  age state  point
# 0    Alice   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57
# 6     Dave   68    TX     70

The following examples use DataFrame, but Series also supports the duplicated() and drop_duplicates() methods in the same manner.

Find, extract, and count duplicate rows: duplicated()

Use the duplicated() method to find, extract, and count duplicate rows in a DataFrame, or duplicate elements in a Series.

Basic usage

duplicated() returns a Boolean Series that marks duplicate rows as True. By default, rows are considered duplicates if all column values are equal.

print(df.duplicated())
# 0    False
# 1    False
# 2    False
# 3    False
# 4    False
# 5    False
# 6     True
# dtype: bool

You can extract duplicates from the DataFrame using the resulting Series.

print(df[df.duplicated()])
#    name  age state  point
# 6  Dave   68    TX     70

Choose duplicates to keep: keep

By default(keep='first'), duplicated() marks all duplicate rows as True, except for the first occurrence, which is marked as False. Conversely, setting keep='last' marks all duplicates as True except for the last occurrence.

print(df.duplicated())
# 0    False
# 1    False
# 2    False
# 3    False
# 4    False
# 5    False
# 6     True
# dtype: bool

print(df.duplicated(keep='last'))
# 0    False
# 1    False
# 2    False
# 3     True
# 4    False
# 5    False
# 6    False
# dtype: bool

Setting keep=False marks all duplicates as True.

print(df.duplicated(keep=False))
# 0    False
# 1    False
# 2    False
# 3     True
# 4    False
# 5    False
# 6     True
# dtype: bool

Specify columns for duplicate detection: subset

To identify duplicates based on specific columns, use the subset argument. By default, duplicates are identified across all columns.

print(df.duplicated(subset='state'))
# 0    False
# 1    False
# 2     True
# 3    False
# 4     True
# 5     True
# 6     True
# dtype: bool

Multiple columns can also be specified in a list. Rows matching in all these columns are marked as duplicates.

print(df.duplicated(subset=['state', 'point']))
# 0    False
# 1    False
# 2    False
# 3    False
# 4    False
# 5    False
# 6     True
# dtype: bool

Count duplicate and non-duplicate rows

You can count duplicate rows by counting True in the Series returned by duplicated(). True can be counted with the sum() method.

print(df.duplicated().sum())
# 1

To count False (representing non-duplicate rows), negate with ~ and then use the sum() method.

print(~df.duplicated())
# 0     True
# 1     True
# 2     True
# 3     True
# 4     True
# 5     True
# 6    False
# dtype: bool

print((~df.duplicated()).sum())
# 6

Both duplicate and non-duplicate counts can be obtained using value_counts().

print(df.duplicated().value_counts())
# False    6
# True     1
# Name: count, dtype: int64

The results vary based on the keep argument, so choose the appropriate setting for your needs.

print(df.duplicated(keep=False).value_counts())
# False    5
# True     2
# Name: count, dtype: int64

Remove duplicate rows: drop_duplicates()

Use the drop_duplicates() method to remove duplicate rows from a DataFrame, or duplicate elements from a Series.

Basic usage

By default, rows are considered duplicates if all column values are equal. The first duplicate row is kept, while the others are removed.

print(df.drop_duplicates())
#       name  age state  point
# 0    Alice   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57

Choose duplicates to keep: keep

The keep argument in drop_duplicates() behaves similarly to its use in duplicated().

The default setting keep='first' retains the first occurrence of each duplicate row, while keep='last' retains the last. Setting keep=False removes all duplicates.

print(df.drop_duplicates(keep='last'))
#       name  age state  point
# 0    Alice   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57
# 6     Dave   68    TX     70

print(df.drop_duplicates(keep=False))
#       name  age state  point
# 0    Alice   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57

Specify columns for duplicate detection: subset

Specify columns for duplicate detection with the subset argument, similar to duplicated().

print(df.drop_duplicates(subset='state'))
#     name  age state  point
# 0  Alice   24    NY     64
# 1    Bob   42    CA     92
# 3   Dave   68    TX     70

print(df.drop_duplicates(subset=['state', 'point']))
#       name  age state  point
# 0    Alice   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57

Reset the index: ignore_index

Setting ignore_index=True resets the index to a 0-based sequence.

print(df.drop_duplicates(subset='state', keep='last'))
#     name  age state  point
# 4  Ellen   24    CA     88
# 5  Frank   30    NY     57
# 6   Dave   68    TX     70

print(df.drop_duplicates(subset='state', keep='last', ignore_index=True))
#     name  age state  point
# 0  Ellen   24    CA     88
# 1  Frank   30    NY     57
# 2   Dave   68    TX     70

Modify the original object: inplace

By default, drop_duplicates() returns a new object without changing the original. Setting inplace=True modifies the original object.

df.drop_duplicates(subset='state', keep='last', inplace=True)
print(df)
#     name  age state  point
# 4  Ellen   24    CA     88
# 5  Frank   30    NY     57
# 6   Dave   68    TX     70

Aggregate based on duplicates: groupby()

Use groupby() to aggregate values based on duplicates.

In the following examples, the average values of the numeric columns (age and point) are calculated for duplicated values in the state column.

df = pd.read_csv('data/src/sample_pandas_normal.csv')
print(df)
#       name  age state  point
# 0    Alice   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57

print(df.groupby('state').mean(numeric_only=True))
#         age      point
# state                 
# CA     28.0  83.333333
# NY     27.0  60.500000
# TX     68.0  70.000000

String concatenation or conversion to lists is also possible.

print(
    df.groupby('state').agg(
        {'name': lambda x: ','.join(x), 'age': 'mean', 'point': 'sum'}
    )
)
#                     name   age  point
# state                                
# CA     Bob,Charlie,Ellen  28.0    250
# NY           Alice,Frank  27.0    121
# TX                  Dave  68.0     70

print(df.groupby('state').agg({'name': list, 'age': 'mean', 'point': 'sum'}))
#                         name   age  point
# state                                    
# CA     [Bob, Charlie, Ellen]  28.0    250
# NY            [Alice, Frank]  27.0    121
# TX                    [Dave]  68.0     70

For more details on groupby(), refer to the following article.

String concatenation can be done using lambda expressions to apply the join() method.

Related Categories

Related Articles