pandas: Remove NaN (missing values) with dropna()

Modified: | Tags: Python, pandas

You can remove NaN from pandas.DataFrame and pandas.Series with the dropna() method.

While this article primarily deals with NaN (Not a Number), it's important to note that in pandas, None is also treated as a missing value.

See the following article on extracting, replacing, and counting missing values.

The sample code in this article uses pandas version 2.0.3. As an example, read a CSV file with missing values.

import pandas as pd

print(pd.__version__)
# 2.0.3

df = pd.read_csv('data/src/sample_pandas_normal_nan.csv')
print(df)
#       name   age state  point  other
# 0    Alice  24.0    NY    NaN    NaN
# 1      NaN   NaN   NaN    NaN    NaN
# 2  Charlie   NaN    CA    NaN    NaN
# 3     Dave  68.0    TX   70.0    NaN
# 4    Ellen   NaN    CA   88.0    NaN
# 5    Frank  30.0   NaN    NaN    NaN

Remove rows/columns where all elements are NaN: how='all'

By setting how='all', rows where all elements are NaN are removed.

print(df.dropna(how='all'))
#       name   age state  point  other
# 0    Alice  24.0    NY    NaN    NaN
# 2  Charlie   NaN    CA    NaN    NaN
# 3     Dave  68.0    TX   70.0    NaN
# 4    Ellen   NaN    CA   88.0    NaN
# 5    Frank  30.0   NaN    NaN    NaN

If axis is set to 1 or 'columns', columns where all elements are NaN are removed.

print(df.dropna(how='all', axis=1))
#       name   age state  point
# 0    Alice  24.0    NY    NaN
# 1      NaN   NaN   NaN    NaN
# 2  Charlie   NaN    CA    NaN
# 3     Dave  68.0    TX   70.0
# 4    Ellen   NaN    CA   88.0
# 5    Frank  30.0   NaN    NaN

Note that if axis is set to 0 or 'index', rows are removed. Since the default value of axis is 0, rows are removed if omitted, as shown in the first example.

In former versions, both rows and columns were removed with axis=[0, 1], but since version 1.0.0, axis can no longer be specified with a list or tuple. If you want to remove both rows and columns, you can repeatedly apply dropna().

# print(df.dropna(how='all', axis=[0, 1]))
# TypeError: supplying multiple axes to axis is no longer supported.

print(df.dropna(how='all').dropna(how='all', axis=1))
#       name   age state  point
# 0    Alice  24.0    NY    NaN
# 2  Charlie   NaN    CA    NaN
# 3     Dave  68.0    TX   70.0
# 4    Ellen   NaN    CA   88.0
# 5    Frank  30.0   NaN    NaN

Remove rows/columns that contain at least one NaN: how='any' (default)

To use as an example, remove rows and columns where all values are NaN.

df2 = df.dropna(how='all').dropna(how='all', axis=1)
print(df2)
#       name   age state  point
# 0    Alice  24.0    NY    NaN
# 2  Charlie   NaN    CA    NaN
# 3     Dave  68.0    TX   70.0
# 4    Ellen   NaN    CA   88.0
# 5    Frank  30.0   NaN    NaN

By setting how='any', rows that contain at least one NaN are removed. Since the default value of how is 'any', the result is the same even if omitted.

print(df2.dropna(how='any'))
#    name   age state  point
# 3  Dave  68.0    TX   70.0

print(df2.dropna())
#    name   age state  point
# 3  Dave  68.0    TX   70.0

If axis is set to 1 or 'columns', columns that contain at least one NaN are removed.

print(df2.dropna(axis=1))
#       name
# 0    Alice
# 2  Charlie
# 3     Dave
# 4    Ellen
# 5    Frank

Remove rows/columns according to the number of non-missing values: thresh

With the thresh argument, you can remove rows and columns according to the number of non-missing values.

For example, if thresh=3, the rows that contain more than three non-missing values remain, and the other rows are removed.

print(df.dropna(thresh=3))
#     name   age state  point  other
# 0  Alice  24.0    NY    NaN    NaN
# 3   Dave  68.0    TX   70.0    NaN
# 4  Ellen   NaN    CA   88.0    NaN

If axis is set to 1 or 'columns', columns are removed.

print(df.dropna(thresh=3, axis=1))
#       name   age state
# 0    Alice  24.0    NY
# 1      NaN   NaN   NaN
# 2  Charlie   NaN    CA
# 3     Dave  68.0    TX
# 4    Ellen   NaN    CA
# 5    Frank  30.0   NaN

Remove based on specific rows/columns: subset

If you want to remove based on specific rows and columns, specify a list of rows/columns labels (names) to the subset argument of dropna(). Even if you want to set only one label, you need to specify it as a list, like subset=['name'].

Since the default is how='any' and axis=0, rows with NaN in the columns specified by subset are removed.

print(df.dropna(subset=['age']))
#     name   age state  point  other
# 0  Alice  24.0    NY    NaN    NaN
# 3   Dave  68.0    TX   70.0    NaN
# 5  Frank  30.0   NaN    NaN    NaN

print(df.dropna(subset=['age', 'state']))
#     name   age state  point  other
# 0  Alice  24.0    NY    NaN    NaN
# 3   Dave  68.0    TX   70.0    NaN

If how is set to 'all', rows with NaN in all specified columns are removed.

print(df.dropna(subset=['age', 'state'], how='all'))
#       name   age state  point  other
# 0    Alice  24.0    NY    NaN    NaN
# 2  Charlie   NaN    CA    NaN    NaN
# 3     Dave  68.0    TX   70.0    NaN
# 4    Ellen   NaN    CA   88.0    NaN
# 5    Frank  30.0   NaN    NaN    NaN

If axis is set to 1 or 'columns', columns are removed.

print(df.dropna(subset=[0, 4], axis=1))
#       name state
# 0    Alice    NY
# 1      NaN   NaN
# 2  Charlie    CA
# 3     Dave    TX
# 4    Ellen    CA
# 5    Frank   NaN

print(df.dropna(subset=[0, 4], axis=1, how='all'))
#       name   age state  point
# 0    Alice  24.0    NY    NaN
# 1      NaN   NaN   NaN    NaN
# 2  Charlie   NaN    CA    NaN
# 3     Dave  68.0    TX   70.0
# 4    Ellen   NaN    CA   88.0
# 5    Frank  30.0   NaN    NaN

An error is raised if a non-existent row or column name is specified. An error is also raised if you set axis=1 but specify column names or set axis=0 (default) but specify row names.

# print(df.dropna(subset=['age', 'state', 'xxx']))
# KeyError: ['xxx']

# print(df.dropna(subset=['age', 'state'], axis=1))
# KeyError: ['age', 'state']

Update the original object: inplace

As shown in the examples above, by default, a new object is returned, and the original object is not changed, but if inplace=True, the original object itself is updated.

df.dropna(subset=['age'], inplace=True)
print(df)
#     name   age state  point  other
# 0  Alice  24.0    NY    NaN    NaN
# 3   Dave  68.0    TX   70.0    NaN
# 5  Frank  30.0   NaN    NaN    NaN

For pandas.Series

The only valid argument for dropna() of pandas.Series is inplace. Since it is one-dimensional data, the elements with NaN are simply removed.

s = pd.read_csv('data/src/sample_pandas_normal_nan.csv')['age']
print(s)
# 0    24.0
# 1     NaN
# 2     NaN
# 3    68.0
# 4     NaN
# 5    30.0
# Name: age, dtype: float64

print(s.dropna())
# 0    24.0
# 3    68.0
# 5    30.0
# Name: age, dtype: float64

s.dropna(inplace=True)
print(s)
# 0    24.0
# 3    68.0
# 5    30.0
# Name: age, dtype: float64

Related Categories

Related Articles