pandas: Remove NaN (missing values) with dropna()

Modified: 2023-08-02 | Tags: Python, pandas

You can remove NaN from pandas.DataFrame and pandas.Series with the dropna() method.

Contents

Remove rows/columns where all elements are NaN: how='all'
Remove rows/columns that contain at least one NaN: how='any' (default)
Remove rows/columns according to the number of non-missing values: thresh
Remove based on specific rows/columns: subset
Update the original object: inplace
For pandas.Series

While this article primarily deals with NaN (Not a Number), it's important to note that in pandas, None is also treated as a missing value.

Missing values in pandas (nan, None, pd.NA)

See the following article on extracting, replacing, and counting missing values.

The sample code in this article uses pandas version 2.0.3. As an example, read a CSV file with missing values.

sample_pandas_normal_nan.csv

import pandas as pd

print(pd.__version__)
# 2.0.3

df = pd.read_csv('data/src/sample_pandas_normal_nan.csv')
print(df)
#       name   age state  point  other
# 0    Alice  24.0    NY    NaN    NaN
# 1      NaN   NaN   NaN    NaN    NaN
# 2  Charlie   NaN    CA    NaN    NaN
# 3     Dave  68.0    TX   70.0    NaN
# 4    Ellen   NaN    CA   88.0    NaN
# 5    Frank  30.0   NaN    NaN    NaN

source: pandas_nan_dropna.py

Remove rows/columns where all elements are `NaN`: `how='all'`

By setting how='all', rows where all elements are NaN are removed.

print(df.dropna(how='all'))
#       name   age state  point  other
# 0    Alice  24.0    NY    NaN    NaN
# 2  Charlie   NaN    CA    NaN    NaN
# 3     Dave  68.0    TX   70.0    NaN
# 4    Ellen   NaN    CA   88.0    NaN
# 5    Frank  30.0   NaN    NaN    NaN

source: pandas_nan_dropna.py

If axis is set to 1 or 'columns', columns where all elements are NaN are removed.

print(df.dropna(how='all', axis=1))
#       name   age state  point
# 0    Alice  24.0    NY    NaN
# 1      NaN   NaN   NaN    NaN
# 2  Charlie   NaN    CA    NaN
# 3     Dave  68.0    TX   70.0
# 4    Ellen   NaN    CA   88.0
# 5    Frank  30.0   NaN    NaN

source: pandas_nan_dropna.py

Note that if axis is set to 0 or 'index', rows are removed. Since the default value of axis is 0, rows are removed if omitted, as shown in the first example.

In former versions, both rows and columns were removed with axis=[0, 1], but since version 1.0.0, axis can no longer be specified with a list or tuple. If you want to remove both rows and columns, you can repeatedly apply dropna().

# print(df.dropna(how='all', axis=[0, 1]))
# TypeError: supplying multiple axes to axis is no longer supported.

print(df.dropna(how='all').dropna(how='all', axis=1))
#       name   age state  point
# 0    Alice  24.0    NY    NaN
# 2  Charlie   NaN    CA    NaN
# 3     Dave  68.0    TX   70.0
# 4    Ellen   NaN    CA   88.0
# 5    Frank  30.0   NaN    NaN

source: pandas_nan_dropna.py

Remove rows/columns that contain at least one `NaN`: `how='any'` (default)

To use as an example, remove rows and columns where all values are NaN.

df2 = df.dropna(how='all').dropna(how='all', axis=1)
print(df2)
#       name   age state  point
# 0    Alice  24.0    NY    NaN
# 2  Charlie   NaN    CA    NaN
# 3     Dave  68.0    TX   70.0
# 4    Ellen   NaN    CA   88.0
# 5    Frank  30.0   NaN    NaN

source: pandas_nan_dropna.py

By setting how='any', rows that contain at least one NaN are removed. Since the default value of how is 'any', the result is the same even if omitted.

print(df2.dropna(how='any'))
#    name   age state  point
# 3  Dave  68.0    TX   70.0

print(df2.dropna())
#    name   age state  point
# 3  Dave  68.0    TX   70.0

source: pandas_nan_dropna.py

If axis is set to 1 or 'columns', columns that contain at least one NaN are removed.

print(df2.dropna(axis=1))
#       name
# 0    Alice
# 2  Charlie
# 3     Dave
# 4    Ellen
# 5    Frank

source: pandas_nan_dropna.py

Remove rows/columns according to the number of non-missing values: `thresh`

With the thresh argument, you can remove rows and columns according to the number of non-missing values.

For example, if thresh=3, the rows that contain more than three non-missing values remain, and the other rows are removed.

print(df.dropna(thresh=3))
#     name   age state  point  other
# 0  Alice  24.0    NY    NaN    NaN
# 3   Dave  68.0    TX   70.0    NaN
# 4  Ellen   NaN    CA   88.0    NaN

source: pandas_nan_dropna.py

If axis is set to 1 or 'columns', columns are removed.

print(df.dropna(thresh=3, axis=1))
#       name   age state
# 0    Alice  24.0    NY
# 1      NaN   NaN   NaN
# 2  Charlie   NaN    CA
# 3     Dave  68.0    TX
# 4    Ellen   NaN    CA
# 5    Frank  30.0   NaN

source: pandas_nan_dropna.py

Remove based on specific rows/columns: `subset`

If you want to remove based on specific rows and columns, specify a list of rows/columns labels (names) to the subset argument of dropna(). Even if you want to set only one label, you need to specify it as a list, like subset=['name'].

Since the default is how='any' and axis=0, rows with NaN in the columns specified by subset are removed.

print(df.dropna(subset=['age']))
#     name   age state  point  other
# 0  Alice  24.0    NY    NaN    NaN
# 3   Dave  68.0    TX   70.0    NaN
# 5  Frank  30.0   NaN    NaN    NaN

print(df.dropna(subset=['age', 'state']))
#     name   age state  point  other
# 0  Alice  24.0    NY    NaN    NaN
# 3   Dave  68.0    TX   70.0    NaN

source: pandas_nan_dropna.py

If how is set to 'all', rows with NaN in all specified columns are removed.

print(df.dropna(subset=['age', 'state'], how='all'))
#       name   age state  point  other
# 0    Alice  24.0    NY    NaN    NaN
# 2  Charlie   NaN    CA    NaN    NaN
# 3     Dave  68.0    TX   70.0    NaN
# 4    Ellen   NaN    CA   88.0    NaN
# 5    Frank  30.0   NaN    NaN    NaN

source: pandas_nan_dropna.py

If axis is set to 1 or 'columns', columns are removed.

print(df.dropna(subset=[0, 4], axis=1))
#       name state
# 0    Alice    NY
# 1      NaN   NaN
# 2  Charlie    CA
# 3     Dave    TX
# 4    Ellen    CA
# 5    Frank   NaN

print(df.dropna(subset=[0, 4], axis=1, how='all'))
#       name   age state  point
# 0    Alice  24.0    NY    NaN
# 1      NaN   NaN   NaN    NaN
# 2  Charlie   NaN    CA    NaN
# 3     Dave  68.0    TX   70.0
# 4    Ellen   NaN    CA   88.0
# 5    Frank  30.0   NaN    NaN

source: pandas_nan_dropna.py

An error is raised if a non-existent row or column name is specified. An error is also raised if you set axis=1 but specify column names or set axis=0 (default) but specify row names.

# print(df.dropna(subset=['age', 'state', 'xxx']))
# KeyError: ['xxx']

# print(df.dropna(subset=['age', 'state'], axis=1))
# KeyError: ['age', 'state']

source: pandas_nan_dropna.py

Update the original object: `inplace`

As shown in the examples above, by default, a new object is returned, and the original object is not changed, but if inplace=True, the original object itself is updated.

df.dropna(subset=['age'], inplace=True)
print(df)
#     name   age state  point  other
# 0  Alice  24.0    NY    NaN    NaN
# 3   Dave  68.0    TX   70.0    NaN
# 5  Frank  30.0   NaN    NaN    NaN

source: pandas_nan_dropna.py

For `pandas.Series`

The only valid argument for dropna() of pandas.Series is inplace. Since it is one-dimensional data, the elements with NaN are simply removed.

s = pd.read_csv('data/src/sample_pandas_normal_nan.csv')['age']
print(s)
# 0    24.0
# 1     NaN
# 2     NaN
# 3    68.0
# 4     NaN
# 5    30.0
# Name: age, dtype: float64

print(s.dropna())
# 0    24.0
# 3    68.0
# 5    30.0
# Name: age, dtype: float64

s.dropna(inplace=True)
print(s)
# 0    24.0
# 3    68.0
# 5    30.0
# Name: age, dtype: float64

source: pandas_nan_dropna.py

pandas: Remove NaN (missing values) with dropna()

Remove rows/columns where all elements are `NaN`: `how='all'`

Remove rows/columns that contain at least one `NaN`: `how='any'` (default)

Remove rows/columns according to the number of non-missing values: `thresh`

Remove based on specific rows/columns: `subset`

Update the original object: `inplace`

For `pandas.Series`

Related Categories

Related Articles

pandas: Remove NaN (missing values) with dropna()

Remove rows/columns where all elements are NaN: how='all'

Remove rows/columns that contain at least one NaN: how='any' (default)

Remove rows/columns according to the number of non-missing values: thresh

Remove based on specific rows/columns: subset

Update the original object: inplace

For pandas.Series

Related Categories

Related Articles

Remove rows/columns where all elements are `NaN`: `how='all'`

Remove rows/columns that contain at least one `NaN`: `how='any'` (default)

Remove rows/columns according to the number of non-missing values: `thresh`

Remove based on specific rows/columns: `subset`

Update the original object: `inplace`

For `pandas.Series`