pandas: Find, count, drop duplicates (duplicated, drop_duplicates)

Modified: 2024-01-26 | Tags: Python, pandas

In pandas, the duplicated() method is used to find, extract, and count duplicate rows in a DataFrame, while drop_duplicates() is used to remove these duplicates.

This article also briefly explains the groupby() method, which aggregates values based on duplicates.

Contents

Find, extract, and count duplicate rows: duplicated()
Remove duplicate rows: drop_duplicates()
Aggregate based on duplicates: groupby()

The pandas version used in this article is as follows. Note that functionality may vary between versions. The following DataFrame is used as an example.

sample_pandas_normal.csv

import pandas as pd

print(pd.__version__)
# 2.1.4

df = pd.read_csv('data/src/sample_pandas_normal.csv')
df.loc[6] = ['Dave', 68, 'TX', 70]
print(df)
#       name  age state  point
# 0    Alice   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57
# 6     Dave   68    TX     70

source: pandas_duplicated_drop_duplicates.py

The following examples use DataFrame, but Series also supports the duplicated() and drop_duplicates() methods in the same manner.

Find, extract, and count duplicate rows: `duplicated()`

Use the duplicated() method to find, extract, and count duplicate rows in a DataFrame, or duplicate elements in a Series.

Basic usage

duplicated() returns a Boolean Series that marks duplicate rows as True. By default, rows are considered duplicates if all column values are equal.

print(df.duplicated())
# 0    False
# 1    False
# 2    False
# 3    False
# 4    False
# 5    False
# 6     True
# dtype: bool

source: pandas_duplicated_drop_duplicates.py

You can extract duplicates from the DataFrame using the resulting Series.

pandas: Get/Set values with loc, iloc, at, iat

print(df[df.duplicated()])
#    name  age state  point
# 6  Dave   68    TX     70

source: pandas_duplicated_drop_duplicates.py

Choose duplicates to keep: `keep`

By default(keep='first'), duplicated() marks all duplicate rows as True, except for the first occurrence, which is marked as False. Conversely, setting keep='last' marks all duplicates as True except for the last occurrence.

print(df.duplicated())
# 0    False
# 1    False
# 2    False
# 3    False
# 4    False
# 5    False
# 6     True
# dtype: bool

print(df.duplicated(keep='last'))
# 0    False
# 1    False
# 2    False
# 3     True
# 4    False
# 5    False
# 6    False
# dtype: bool

source: pandas_duplicated_drop_duplicates.py

Setting keep=False marks all duplicates as True.

print(df.duplicated(keep=False))
# 0    False
# 1    False
# 2    False
# 3     True
# 4    False
# 5    False
# 6     True
# dtype: bool

source: pandas_duplicated_drop_duplicates.py

Specify columns for duplicate detection: `subset`

To identify duplicates based on specific columns, use the subset argument. By default, duplicates are identified across all columns.

print(df.duplicated(subset='state'))
# 0    False
# 1    False
# 2     True
# 3    False
# 4     True
# 5     True
# 6     True
# dtype: bool

source: pandas_duplicated_drop_duplicates.py

Multiple columns can also be specified in a list. Rows matching in all these columns are marked as duplicates.

print(df.duplicated(subset=['state', 'point']))
# 0    False
# 1    False
# 2    False
# 3    False
# 4    False
# 5    False
# 6     True
# dtype: bool

source: pandas_duplicated_drop_duplicates.py

Count duplicate and non-duplicate rows

You can count duplicate rows by counting True in the Series returned by duplicated(). True can be counted with the sum() method.

print(df.duplicated().sum())
# 1

source: pandas_duplicated_drop_duplicates.py

To count False (representing non-duplicate rows), negate with ~ and then use the sum() method.

print(~df.duplicated())
# 0     True
# 1     True
# 2     True
# 3     True
# 4     True
# 5     True
# 6    False
# dtype: bool

print((~df.duplicated()).sum())
# 6

source: pandas_duplicated_drop_duplicates.py

Both duplicate and non-duplicate counts can be obtained using value_counts().

pandas: Get unique values and their counts in a column

print(df.duplicated().value_counts())
# False    6
# True     1
# Name: count, dtype: int64

source: pandas_duplicated_drop_duplicates.py

The results vary based on the keep argument, so choose the appropriate setting for your needs.

print(df.duplicated(keep=False).value_counts())
# False    5
# True     2
# Name: count, dtype: int64

source: pandas_duplicated_drop_duplicates.py

Remove duplicate rows: `drop_duplicates()`

Use the drop_duplicates() method to remove duplicate rows from a DataFrame, or duplicate elements from a Series.

Basic usage

By default, rows are considered duplicates if all column values are equal. The first duplicate row is kept, while the others are removed.

print(df.drop_duplicates())
#       name  age state  point
# 0    Alice   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57

source: pandas_duplicated_drop_duplicates.py

Choose duplicates to keep: `keep`

The keep argument in drop_duplicates() behaves similarly to its use in duplicated().

The default setting keep='first' retains the first occurrence of each duplicate row, while keep='last' retains the last. Setting keep=False removes all duplicates.

print(df.drop_duplicates(keep='last'))
#       name  age state  point
# 0    Alice   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57
# 6     Dave   68    TX     70

print(df.drop_duplicates(keep=False))
#       name  age state  point
# 0    Alice   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57

source: pandas_duplicated_drop_duplicates.py

Specify columns for duplicate detection: `subset`

Specify columns for duplicate detection with the subset argument, similar to duplicated().

print(df.drop_duplicates(subset='state'))
#     name  age state  point
# 0  Alice   24    NY     64
# 1    Bob   42    CA     92
# 3   Dave   68    TX     70

print(df.drop_duplicates(subset=['state', 'point']))
#       name  age state  point
# 0    Alice   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57

source: pandas_duplicated_drop_duplicates.py

Reset the `index`: `ignore_index`

Setting ignore_index=True resets the index to a 0-based sequence.

print(df.drop_duplicates(subset='state', keep='last'))
#     name  age state  point
# 4  Ellen   24    CA     88
# 5  Frank   30    NY     57
# 6   Dave   68    TX     70

print(df.drop_duplicates(subset='state', keep='last', ignore_index=True))
#     name  age state  point
# 0  Ellen   24    CA     88
# 1  Frank   30    NY     57
# 2   Dave   68    TX     70

source: pandas_duplicated_drop_duplicates.py

Modify the original object: `inplace`

By default, drop_duplicates() returns a new object without changing the original. Setting inplace=True modifies the original object.

df.drop_duplicates(subset='state', keep='last', inplace=True)
print(df)
#     name  age state  point
# 4  Ellen   24    CA     88
# 5  Frank   30    NY     57
# 6   Dave   68    TX     70

source: pandas_duplicated_drop_duplicates.py

Aggregate based on duplicates: `groupby()`

Use groupby() to aggregate values based on duplicates.

In the following examples, the average values of the numeric columns (age and point) are calculated for duplicated values in the state column.

df = pd.read_csv('data/src/sample_pandas_normal.csv')
print(df)
#       name  age state  point
# 0    Alice   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57

print(df.groupby('state').mean(numeric_only=True))
#         age      point
# state                 
# CA     28.0  83.333333
# NY     27.0  60.500000
# TX     68.0  70.000000

source: pandas_duplicated_drop_duplicates.py

String concatenation or conversion to lists is also possible.

print(
    df.groupby('state').agg(
        {'name': lambda x: ','.join(x), 'age': 'mean', 'point': 'sum'}
    )
)
#                     name   age  point
# state                                
# CA     Bob,Charlie,Ellen  28.0    250
# NY           Alice,Frank  27.0    121
# TX                  Dave  68.0     70

print(df.groupby('state').agg({'name': list, 'age': 'mean', 'point': 'sum'}))
#                         name   age  point
# state                                    
# CA     [Bob, Charlie, Ellen]  28.0    250
# NY            [Alice, Frank]  27.0    121
# TX                    [Dave]  68.0     70

source: pandas_duplicated_drop_duplicates.py

For more details on groupby(), refer to the following article.

pandas: Grouping data with groupby()

String concatenation can be done using lambda expressions to apply the join() method.

pandas: Find, count, drop duplicates (duplicated, drop_duplicates)

Find, extract, and count duplicate rows: `duplicated()`

Basic usage

Choose duplicates to keep: `keep`

Specify columns for duplicate detection: `subset`

Count duplicate and non-duplicate rows

Remove duplicate rows: `drop_duplicates()`

Basic usage

Choose duplicates to keep: `keep`

Specify columns for duplicate detection: `subset`

Reset the `index`: `ignore_index`

Modify the original object: `inplace`

Aggregate based on duplicates: `groupby()`

Related Categories

Related Articles

pandas: Find, count, drop duplicates (duplicated, drop_duplicates)

Find, extract, and count duplicate rows: duplicated()

Basic usage

Choose duplicates to keep: keep

Specify columns for duplicate detection: subset

Count duplicate and non-duplicate rows

Remove duplicate rows: drop_duplicates()

Basic usage

Choose duplicates to keep: keep

Specify columns for duplicate detection: subset

Reset the index: ignore_index

Modify the original object: inplace

Aggregate based on duplicates: groupby()

Related Categories

Related Articles

Find, extract, and count duplicate rows: `duplicated()`

Choose duplicates to keep: `keep`

Specify columns for duplicate detection: `subset`

Remove duplicate rows: `drop_duplicates()`

Choose duplicates to keep: `keep`

Specify columns for duplicate detection: `subset`

Reset the `index`: `ignore_index`

Modify the original object: `inplace`

Aggregate based on duplicates: `groupby()`