pandas: Replace NaN (missing values) with fillna()
In pandas, the fillna()
method allows you to replace NaN
values in a DataFrame
or Series
with a specific value.
- pandas.DataFrame.fillna — pandas 2.1.4 documentation
- pandas.Series.fillna — pandas 2.1.4 documentation
While this article primarily deals with NaN
(Not a Number), it is important to note that in pandas, None
is also treated as a missing value.
To fill missing values with linear or spline interpolation, use the interpolate()
method.
For extracting, deleting, or counting missing values, refer to the following articles.
- pandas: Find rows/columns with NaN (missing values)
- pandas: Remove NaN (missing values) with dropna()
- pandas: Detect and count NaN (missing values) with isnull(), isna()
The pandas version used in this article is as follows. Note that functionality may vary between versions. The following DataFrame
is used as an example.
import pandas as pd
print(pd.__version__)
# 2.1.4
df = pd.read_csv('data/src/sample_pandas_normal_nan.csv')
print(df)
# name age state point other
# 0 Alice 24.0 NY NaN NaN
# 1 NaN NaN NaN NaN NaN
# 2 Charlie NaN CA NaN NaN
# 3 Dave 68.0 TX 70.0 NaN
# 4 Ellen NaN CA 88.0 NaN
# 5 Frank 30.0 NaN NaN NaN
Replace NaN
with a common value
By specifying the scalar value as the first argument (value
) in fillna()
, all NaN
values are replaced with that value.
print(df.fillna(0))
# name age state point other
# 0 Alice 24.0 NY 0.0 0.0
# 1 0 0.0 0 0.0 0.0
# 2 Charlie 0.0 CA 0.0 0.0
# 3 Dave 68.0 TX 70.0 0.0
# 4 Ellen 0.0 CA 88.0 0.0
# 5 Frank 30.0 0 0.0 0.0
Note that numeric columns with NaN
are float
type. Even if you replace NaN
with an integer (int
), the data type remains float
. Use astype()
to convert it to int
.
Replace NaN
with different values for each column
Specify a dictionary (dict
), in the form {column_name: value}
, as the first argument (value
) in fillna()
to assign different values to each column.
NaN
values in columns not specified in the dictionary remain unchanged. Any keys not matching a column name are ignored.
print(df.fillna({'name': 'XXX', 'age': 20, 'ZZZ': 100}))
# name age state point other
# 0 Alice 24.0 NY NaN NaN
# 1 XXX 20.0 NaN NaN NaN
# 2 Charlie 20.0 CA NaN NaN
# 3 Dave 68.0 TX 70.0 NaN
# 4 Ellen 20.0 CA 88.0 NaN
# 5 Frank 30.0 NaN NaN NaN
You can also specify Series
. The labels of Series
correspond to the keys of dict
.
s_for_fill = pd.Series(['XXX', 20, 100], index=['name', 'age', 'ZZZ'])
print(s_for_fill)
# name XXX
# age 20
# ZZZ 100
# dtype: object
print(df.fillna(s_for_fill))
# name age state point other
# 0 Alice 24.0 NY NaN NaN
# 1 XXX 20.0 NaN NaN NaN
# 2 Charlie 20.0 CA NaN NaN
# 3 Dave 68.0 TX 70.0 NaN
# 4 Ellen 20.0 CA 88.0 NaN
# 5 Frank 30.0 NaN NaN NaN
Replace NaN
with mean, median, or mode for each column
The mean()
method calculates the average of each column, returning a Series
.
NaN
is excluded in the calculation, but columns with all elements as NaN
remain NaN
. The numeric_only
argument can be set to True
to include only numeric columns.
print(df.mean(numeric_only=True))
# age 40.666667
# point 79.000000
# other NaN
# dtype: float64
Specifying this Series
as the first argument (value
) in fillna()
replaces NaN
in the corresponding columns with the mean value.
print(df.fillna(df.mean(numeric_only=True)))
# name age state point other
# 0 Alice 24.000000 NY 79.0 NaN
# 1 NaN 40.666667 NaN 79.0 NaN
# 2 Charlie 40.666667 CA 79.0 NaN
# 3 Dave 68.000000 TX 70.0 NaN
# 4 Ellen 40.666667 CA 88.0 NaN
# 5 Frank 30.000000 NaN 79.0 NaN
Similarly, to replace NaN
values with the median, use the median()
method. For an even number of elements, the median is the average of the two central values.
print(df.fillna(df.median(numeric_only=True)))
# name age state point other
# 0 Alice 24.0 NY 79.0 NaN
# 1 NaN 30.0 NaN 79.0 NaN
# 2 Charlie 30.0 CA 79.0 NaN
# 3 Dave 68.0 TX 70.0 NaN
# 4 Ellen 30.0 CA 88.0 NaN
# 5 Frank 30.0 NaN 79.0 NaN
The mode can be obtained using the mode()
method. Since mode()
returns a DataFrame
, the first row is obtained as a Series
using iloc[0]
. mode()
can also handle strings.
print(df.fillna(df.mode().iloc[0]))
# name age state point other
# 0 Alice 24.0 NY 70.0 NaN
# 1 Alice 24.0 CA 70.0 NaN
# 2 Charlie 24.0 CA 70.0 NaN
# 3 Dave 68.0 TX 70.0 NaN
# 4 Ellen 24.0 CA 88.0 NaN
# 5 Frank 30.0 CA 70.0 NaN
Replace NaN
with adjacent values: ffill()
, bfill()
To replace NaN
with the adjacent valid value, use the ffill()
and bfill()
methods.
- pandas.DataFrame.ffill — pandas 2.1.4 documentation
- pandas.DataFrame.bfill — pandas 2.1.4 documentation
ffill()
replaces NaN
with the previous valid value, and bfill()
replaces it with the next valid value.
print(df.ffill())
# name age state point other
# 0 Alice 24.0 NY NaN NaN
# 1 Alice 24.0 NY NaN NaN
# 2 Charlie 24.0 CA NaN NaN
# 3 Dave 68.0 TX 70.0 NaN
# 4 Ellen 68.0 CA 88.0 NaN
# 5 Frank 30.0 CA 88.0 NaN
print(df.bfill())
# name age state point other
# 0 Alice 24.0 NY 70.0 NaN
# 1 Charlie 68.0 CA 70.0 NaN
# 2 Charlie 68.0 CA 70.0 NaN
# 3 Dave 68.0 TX 70.0 NaN
# 4 Ellen 30.0 CA 88.0 NaN
# 5 Frank 30.0 NaN NaN NaN
By default, all consecutive NaN
values are replaced. The limit
argument specifies how many consecutive replacements are allowed.
print(df.ffill(limit=1))
# name age state point other
# 0 Alice 24.0 NY NaN NaN
# 1 Alice 24.0 NY NaN NaN
# 2 Charlie NaN CA NaN NaN
# 3 Dave 68.0 TX 70.0 NaN
# 4 Ellen 68.0 CA 88.0 NaN
# 5 Frank 30.0 CA 88.0 NaN
print(df.bfill(limit=1))
# name age state point other
# 0 Alice 24.0 NY NaN NaN
# 1 Charlie NaN CA NaN NaN
# 2 Charlie 68.0 CA 70.0 NaN
# 3 Dave 68.0 TX 70.0 NaN
# 4 Ellen 30.0 CA 88.0 NaN
# 5 Frank 30.0 NaN NaN NaN
Setting the axis
argument to 1
or 'columns'
replaces NaN
with left or right values. ffill()
uses the left value, and bfill()
uses the right value.
print(df.ffill(axis=1))
# name age state point other
# 0 Alice 24.0 NY NY NY
# 1 NaN NaN NaN NaN NaN
# 2 Charlie Charlie CA CA CA
# 3 Dave 68.0 TX 70.0 70.0
# 4 Ellen Ellen CA 88.0 88.0
# 5 Frank 30.0 30.0 30.0 30.0
print(df.bfill(axis=1))
# name age state point other
# 0 Alice 24.0 NY NaN NaN
# 1 NaN NaN NaN NaN NaN
# 2 Charlie CA CA NaN NaN
# 3 Dave 68.0 TX 70.0 NaN
# 4 Ellen CA CA 88.0 NaN
# 5 Frank 30.0 NaN NaN NaN
Note that pad()
and backfill()
, which perform the same operation as ffill()
and bfill()
, have been deprecated since pandas version 2.0.0.
- pandas.DataFrame.pad — pandas 2.1.4 documentation
- pandas.DataFrame.backfill — pandas 2.1.4 documentation
The method
argument in fillna()
The method
argument in fillna()
, although deprecated since version 2.1.0, allows for the same functionality as ffill()
and bfill()
.
- What’s new in 2.1.0 (Aug 30, 2023) — pandas 2.1.4 documentation
- DEPR: fillna 'method' · Issue #53394 · pandas-dev/pandas
Setting the method
argument to 'ffill'
or 'pad'
replicates the functionality of ffill()
, while 'bfill'
or 'backfill'
yields the same result as bfill()
.
As of version 2.1.4, it is still usable, but a FutureWarning
is issued.
print(df.fillna(method='ffill', limit=1))
# name age state point other
# 0 Alice 24.0 NY NaN NaN
# 1 Alice 24.0 NY NaN NaN
# 2 Charlie NaN CA NaN NaN
# 3 Dave 68.0 TX 70.0 NaN
# 4 Ellen 68.0 CA 88.0 NaN
# 5 Frank 30.0 CA 88.0 NaN
#
# /var/folders/rf/b7l8_vgj5mdgvghn_326rn_c0000gn/T/ipykernel_50534/2498159999.py:1: FutureWarning: DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.
Modify the original object: inplace
By default, fillna()
, ffill()
, and bfill()
return a new object without modifying the original. Setting the inplace
argument to True
modifies the original object.
df.fillna(0, inplace=True)
print(df)
# name age state point other
# 0 Alice 24.0 NY 0.0 0.0
# 1 0 0.0 0 0.0 0.0
# 2 Charlie 0.0 CA 0.0 0.0
# 3 Dave 68.0 TX 70.0 0.0
# 4 Ellen 0.0 CA 88.0 0.0
# 5 Frank 30.0 0 0.0 0.0
fillna()
, ffill()
, and bfill()
on pandas.Series
For Series, the fillna()
method can be used in a manner similar to its usage in DataFrame
.
s = pd.read_csv('data/src/sample_pandas_normal_nan.csv')['age']
print(s)
# 0 24.0
# 1 NaN
# 2 NaN
# 3 68.0
# 4 NaN
# 5 30.0
# Name: age, dtype: float64
print(s.fillna(100))
# 0 24.0
# 1 100.0
# 2 100.0
# 3 68.0
# 4 100.0
# 5 30.0
# Name: age, dtype: float64
print(s.fillna({1: 100, 4: -100}))
# 0 24.0
# 1 100.0
# 2 NaN
# 3 68.0
# 4 -100.0
# 5 30.0
# Name: age, dtype: float64
ffill()
and bfill()
are also available.
print(s.ffill(limit=1))
# 0 24.0
# 1 24.0
# 2 NaN
# 3 68.0
# 4 68.0
# 5 30.0
# Name: age, dtype: float64
print(s.bfill(limit=1))
# 0 24.0
# 1 NaN
# 2 68.0
# 3 68.0
# 4 30.0
# 5 30.0
# Name: age, dtype: float64
pad()
and backfill()
are also present, but have been deprecated since version 2.0.0.
The method
argument in fillna()
has been deprecated since version 2.1.0. As of version 2.1.4, it is still usable, but a FutureWarning
is issued.
print(s.fillna(method='ffill', limit=1))
# 0 24.0
# 1 24.0
# 2 NaN
# 3 68.0
# 4 68.0
# 5 30.0
# Name: age, dtype: float64
#
# /var/folders/rf/b7l8_vgj5mdgvghn_326rn_c0000gn/T/ipykernel_50534/2241812369.py:1: FutureWarning: Series.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.