pandas: Remove NaN (missing values) with dropna()
You can remove NaN from pandas.DataFrame and pandas.Series with the dropna() method.
- pandas.DataFrame.dropna — pandas 2.0.3 documentation
- pandas.Series.dropna — pandas 2.0.3 documentation
- Remove rows/columns where all elements are
NaN:how='all' - Remove rows/columns that contain at least one
NaN:how='any'(default) - Remove rows/columns according to the number of non-missing values:
thresh - Remove based on specific rows/columns:
subset - Update the original object:
inplace - For
pandas.Series
While this article primarily deals with NaN (Not a Number), it's important to note that in pandas, None is also treated as a missing value.
See the following article on extracting, replacing, and counting missing values.
- pandas: Find rows/columns with NaN (missing values)
- pandas: Replace NaN (missing values) with fillna()
- pandas: Detect and count NaN (missing values) with isnull(), isna()
The sample code in this article uses pandas version 2.0.3. As an example, read a CSV file with missing values.
import pandas as pd
print(pd.__version__)
# 2.0.3
df = pd.read_csv('data/src/sample_pandas_normal_nan.csv')
print(df)
# name age state point other
# 0 Alice 24.0 NY NaN NaN
# 1 NaN NaN NaN NaN NaN
# 2 Charlie NaN CA NaN NaN
# 3 Dave 68.0 TX 70.0 NaN
# 4 Ellen NaN CA 88.0 NaN
# 5 Frank 30.0 NaN NaN NaN
Remove rows/columns where all elements are NaN: how='all'
By setting how='all', rows where all elements are NaN are removed.
print(df.dropna(how='all'))
# name age state point other
# 0 Alice 24.0 NY NaN NaN
# 2 Charlie NaN CA NaN NaN
# 3 Dave 68.0 TX 70.0 NaN
# 4 Ellen NaN CA 88.0 NaN
# 5 Frank 30.0 NaN NaN NaN
If axis is set to 1 or 'columns', columns where all elements are NaN are removed.
print(df.dropna(how='all', axis=1))
# name age state point
# 0 Alice 24.0 NY NaN
# 1 NaN NaN NaN NaN
# 2 Charlie NaN CA NaN
# 3 Dave 68.0 TX 70.0
# 4 Ellen NaN CA 88.0
# 5 Frank 30.0 NaN NaN
Note that if axis is set to 0 or 'index', rows are removed. Since the default value of axis is 0, rows are removed if omitted, as shown in the first example.
In former versions, both rows and columns were removed with axis=[0, 1], but since version 1.0.0, axis can no longer be specified with a list or tuple. If you want to remove both rows and columns, you can repeatedly apply dropna().
# print(df.dropna(how='all', axis=[0, 1]))
# TypeError: supplying multiple axes to axis is no longer supported.
print(df.dropna(how='all').dropna(how='all', axis=1))
# name age state point
# 0 Alice 24.0 NY NaN
# 2 Charlie NaN CA NaN
# 3 Dave 68.0 TX 70.0
# 4 Ellen NaN CA 88.0
# 5 Frank 30.0 NaN NaN
Remove rows/columns that contain at least one NaN: how='any' (default)
To use as an example, remove rows and columns where all values are NaN.
df2 = df.dropna(how='all').dropna(how='all', axis=1)
print(df2)
# name age state point
# 0 Alice 24.0 NY NaN
# 2 Charlie NaN CA NaN
# 3 Dave 68.0 TX 70.0
# 4 Ellen NaN CA 88.0
# 5 Frank 30.0 NaN NaN
By setting how='any', rows that contain at least one NaN are removed. Since the default value of how is 'any', the result is the same even if omitted.
print(df2.dropna(how='any'))
# name age state point
# 3 Dave 68.0 TX 70.0
print(df2.dropna())
# name age state point
# 3 Dave 68.0 TX 70.0
If axis is set to 1 or 'columns', columns that contain at least one NaN are removed.
print(df2.dropna(axis=1))
# name
# 0 Alice
# 2 Charlie
# 3 Dave
# 4 Ellen
# 5 Frank
Remove rows/columns according to the number of non-missing values: thresh
With the thresh argument, you can remove rows and columns according to the number of non-missing values.
For example, if thresh=3, the rows that contain more than three non-missing values remain, and the other rows are removed.
print(df.dropna(thresh=3))
# name age state point other
# 0 Alice 24.0 NY NaN NaN
# 3 Dave 68.0 TX 70.0 NaN
# 4 Ellen NaN CA 88.0 NaN
If axis is set to 1 or 'columns', columns are removed.
print(df.dropna(thresh=3, axis=1))
# name age state
# 0 Alice 24.0 NY
# 1 NaN NaN NaN
# 2 Charlie NaN CA
# 3 Dave 68.0 TX
# 4 Ellen NaN CA
# 5 Frank 30.0 NaN
Remove based on specific rows/columns: subset
If you want to remove based on specific rows and columns, specify a list of rows/columns labels (names) to the subset argument of dropna(). Even if you want to set only one label, you need to specify it as a list, like subset=['name'].
Since the default is how='any' and axis=0, rows with NaN in the columns specified by subset are removed.
print(df.dropna(subset=['age']))
# name age state point other
# 0 Alice 24.0 NY NaN NaN
# 3 Dave 68.0 TX 70.0 NaN
# 5 Frank 30.0 NaN NaN NaN
print(df.dropna(subset=['age', 'state']))
# name age state point other
# 0 Alice 24.0 NY NaN NaN
# 3 Dave 68.0 TX 70.0 NaN
If how is set to 'all', rows with NaN in all specified columns are removed.
print(df.dropna(subset=['age', 'state'], how='all'))
# name age state point other
# 0 Alice 24.0 NY NaN NaN
# 2 Charlie NaN CA NaN NaN
# 3 Dave 68.0 TX 70.0 NaN
# 4 Ellen NaN CA 88.0 NaN
# 5 Frank 30.0 NaN NaN NaN
If axis is set to 1 or 'columns', columns are removed.
print(df.dropna(subset=[0, 4], axis=1))
# name state
# 0 Alice NY
# 1 NaN NaN
# 2 Charlie CA
# 3 Dave TX
# 4 Ellen CA
# 5 Frank NaN
print(df.dropna(subset=[0, 4], axis=1, how='all'))
# name age state point
# 0 Alice 24.0 NY NaN
# 1 NaN NaN NaN NaN
# 2 Charlie NaN CA NaN
# 3 Dave 68.0 TX 70.0
# 4 Ellen NaN CA 88.0
# 5 Frank 30.0 NaN NaN
An error is raised if a non-existent row or column name is specified. An error is also raised if you set axis=1 but specify column names or set axis=0 (default) but specify row names.
# print(df.dropna(subset=['age', 'state', 'xxx']))
# KeyError: ['xxx']
# print(df.dropna(subset=['age', 'state'], axis=1))
# KeyError: ['age', 'state']
Update the original object: inplace
As shown in the examples above, by default, a new object is returned, and the original object is not changed, but if inplace=True, the original object itself is updated.
df.dropna(subset=['age'], inplace=True)
print(df)
# name age state point other
# 0 Alice 24.0 NY NaN NaN
# 3 Dave 68.0 TX 70.0 NaN
# 5 Frank 30.0 NaN NaN NaN
For pandas.Series
The only valid argument for dropna() of pandas.Series is inplace. Since it is one-dimensional data, the elements with NaN are simply removed.
s = pd.read_csv('data/src/sample_pandas_normal_nan.csv')['age']
print(s)
# 0 24.0
# 1 NaN
# 2 NaN
# 3 68.0
# 4 NaN
# 5 30.0
# Name: age, dtype: float64
print(s.dropna())
# 0 24.0
# 3 68.0
# 5 30.0
# Name: age, dtype: float64
s.dropna(inplace=True)
print(s)
# 0 24.0
# 3 68.0
# 5 30.0
# Name: age, dtype: float64