pandas: Remove NaN (missing values) with dropna()
You can remove NaN
from pandas.DataFrame
and pandas.Series
with the dropna()
method.
- pandas.DataFrame.dropna — pandas 2.0.3 documentation
- pandas.Series.dropna — pandas 2.0.3 documentation
- Remove rows/columns where all elements are
NaN
:how='all'
- Remove rows/columns that contain at least one
NaN
:how='any'
(default) - Remove rows/columns according to the number of non-missing values:
thresh
- Remove based on specific rows/columns:
subset
- Update the original object:
inplace
- For
pandas.Series
While this article primarily deals with NaN
(Not a Number), it's important to note that in pandas, None
is also treated as a missing value.
See the following article on extracting, replacing, and counting missing values.
- pandas: Find rows/columns with NaN (missing values)
- pandas: Replace NaN (missing values) with fillna()
- pandas: Detect and count NaN (missing values) with isnull(), isna()
The sample code in this article uses pandas version 2.0.3
. As an example, read a CSV file with missing values.
import pandas as pd
print(pd.__version__)
# 2.0.3
df = pd.read_csv('data/src/sample_pandas_normal_nan.csv')
print(df)
# name age state point other
# 0 Alice 24.0 NY NaN NaN
# 1 NaN NaN NaN NaN NaN
# 2 Charlie NaN CA NaN NaN
# 3 Dave 68.0 TX 70.0 NaN
# 4 Ellen NaN CA 88.0 NaN
# 5 Frank 30.0 NaN NaN NaN
Remove rows/columns where all elements are NaN
: how='all'
By setting how='all'
, rows where all elements are NaN
are removed.
print(df.dropna(how='all'))
# name age state point other
# 0 Alice 24.0 NY NaN NaN
# 2 Charlie NaN CA NaN NaN
# 3 Dave 68.0 TX 70.0 NaN
# 4 Ellen NaN CA 88.0 NaN
# 5 Frank 30.0 NaN NaN NaN
If axis
is set to 1
or 'columns'
, columns where all elements are NaN
are removed.
print(df.dropna(how='all', axis=1))
# name age state point
# 0 Alice 24.0 NY NaN
# 1 NaN NaN NaN NaN
# 2 Charlie NaN CA NaN
# 3 Dave 68.0 TX 70.0
# 4 Ellen NaN CA 88.0
# 5 Frank 30.0 NaN NaN
Note that if axis
is set to 0
or 'index'
, rows are removed. Since the default value of axis
is 0
, rows are removed if omitted, as shown in the first example.
In former versions, both rows and columns were removed with axis=[0, 1]
, but since version 1.0.0
, axis
can no longer be specified with a list or tuple. If you want to remove both rows and columns, you can repeatedly apply dropna()
.
# print(df.dropna(how='all', axis=[0, 1]))
# TypeError: supplying multiple axes to axis is no longer supported.
print(df.dropna(how='all').dropna(how='all', axis=1))
# name age state point
# 0 Alice 24.0 NY NaN
# 2 Charlie NaN CA NaN
# 3 Dave 68.0 TX 70.0
# 4 Ellen NaN CA 88.0
# 5 Frank 30.0 NaN NaN
Remove rows/columns that contain at least one NaN
: how='any'
(default)
To use as an example, remove rows and columns where all values are NaN
.
df2 = df.dropna(how='all').dropna(how='all', axis=1)
print(df2)
# name age state point
# 0 Alice 24.0 NY NaN
# 2 Charlie NaN CA NaN
# 3 Dave 68.0 TX 70.0
# 4 Ellen NaN CA 88.0
# 5 Frank 30.0 NaN NaN
By setting how='any'
, rows that contain at least one NaN
are removed. Since the default value of how
is 'any'
, the result is the same even if omitted.
print(df2.dropna(how='any'))
# name age state point
# 3 Dave 68.0 TX 70.0
print(df2.dropna())
# name age state point
# 3 Dave 68.0 TX 70.0
If axis
is set to 1
or 'columns'
, columns that contain at least one NaN
are removed.
print(df2.dropna(axis=1))
# name
# 0 Alice
# 2 Charlie
# 3 Dave
# 4 Ellen
# 5 Frank
Remove rows/columns according to the number of non-missing values: thresh
With the thresh
argument, you can remove rows and columns according to the number of non-missing values.
For example, if thresh=3
, the rows that contain more than three non-missing values remain, and the other rows are removed.
print(df.dropna(thresh=3))
# name age state point other
# 0 Alice 24.0 NY NaN NaN
# 3 Dave 68.0 TX 70.0 NaN
# 4 Ellen NaN CA 88.0 NaN
If axis
is set to 1
or 'columns'
, columns are removed.
print(df.dropna(thresh=3, axis=1))
# name age state
# 0 Alice 24.0 NY
# 1 NaN NaN NaN
# 2 Charlie NaN CA
# 3 Dave 68.0 TX
# 4 Ellen NaN CA
# 5 Frank 30.0 NaN
Remove based on specific rows/columns: subset
If you want to remove based on specific rows and columns, specify a list of rows/columns labels (names) to the subset
argument of dropna()
. Even if you want to set only one label, you need to specify it as a list, like subset=['name']
.
Since the default is how='any'
and axis=0
, rows with NaN
in the columns specified by subset
are removed.
print(df.dropna(subset=['age']))
# name age state point other
# 0 Alice 24.0 NY NaN NaN
# 3 Dave 68.0 TX 70.0 NaN
# 5 Frank 30.0 NaN NaN NaN
print(df.dropna(subset=['age', 'state']))
# name age state point other
# 0 Alice 24.0 NY NaN NaN
# 3 Dave 68.0 TX 70.0 NaN
If how
is set to 'all'
, rows with NaN
in all specified columns are removed.
print(df.dropna(subset=['age', 'state'], how='all'))
# name age state point other
# 0 Alice 24.0 NY NaN NaN
# 2 Charlie NaN CA NaN NaN
# 3 Dave 68.0 TX 70.0 NaN
# 4 Ellen NaN CA 88.0 NaN
# 5 Frank 30.0 NaN NaN NaN
If axis
is set to 1
or 'columns'
, columns are removed.
print(df.dropna(subset=[0, 4], axis=1))
# name state
# 0 Alice NY
# 1 NaN NaN
# 2 Charlie CA
# 3 Dave TX
# 4 Ellen CA
# 5 Frank NaN
print(df.dropna(subset=[0, 4], axis=1, how='all'))
# name age state point
# 0 Alice 24.0 NY NaN
# 1 NaN NaN NaN NaN
# 2 Charlie NaN CA NaN
# 3 Dave 68.0 TX 70.0
# 4 Ellen NaN CA 88.0
# 5 Frank 30.0 NaN NaN
An error is raised if a non-existent row or column name is specified. An error is also raised if you set axis=1
but specify column names or set axis=0
(default) but specify row names.
# print(df.dropna(subset=['age', 'state', 'xxx']))
# KeyError: ['xxx']
# print(df.dropna(subset=['age', 'state'], axis=1))
# KeyError: ['age', 'state']
Update the original object: inplace
As shown in the examples above, by default, a new object is returned, and the original object is not changed, but if inplace=True
, the original object itself is updated.
df.dropna(subset=['age'], inplace=True)
print(df)
# name age state point other
# 0 Alice 24.0 NY NaN NaN
# 3 Dave 68.0 TX 70.0 NaN
# 5 Frank 30.0 NaN NaN NaN
For pandas.Series
The only valid argument for dropna()
of pandas.Series
is inplace
. Since it is one-dimensional data, the elements with NaN
are simply removed.
s = pd.read_csv('data/src/sample_pandas_normal_nan.csv')['age']
print(s)
# 0 24.0
# 1 NaN
# 2 NaN
# 3 68.0
# 4 NaN
# 5 30.0
# Name: age, dtype: float64
print(s.dropna())
# 0 24.0
# 3 68.0
# 5 30.0
# Name: age, dtype: float64
s.dropna(inplace=True)
print(s)
# 0 24.0
# 3 68.0
# 5 30.0
# Name: age, dtype: float64