pandas: Replace NaN (missing values) with fillna()
In pandas, the fillna() method allows you to replace NaN values in a DataFrame or Series with a specific value.
- pandas.DataFrame.fillna — pandas 2.1.4 documentation
- pandas.Series.fillna — pandas 2.1.4 documentation
While this article primarily deals with NaN (Not a Number), it is important to note that in pandas, None is also treated as a missing value.
To fill missing values with linear or spline interpolation, use the interpolate() method.
For extracting, deleting, or counting missing values, refer to the following articles.
- pandas: Find rows/columns with NaN (missing values)
- pandas: Remove NaN (missing values) with dropna()
- pandas: Detect and count NaN (missing values) with isnull(), isna()
The pandas version used in this article is as follows. Note that functionality may vary between versions. The following DataFrame is used as an example.
import pandas as pd
print(pd.__version__)
# 2.1.4
df = pd.read_csv('data/src/sample_pandas_normal_nan.csv')
print(df)
# name age state point other
# 0 Alice 24.0 NY NaN NaN
# 1 NaN NaN NaN NaN NaN
# 2 Charlie NaN CA NaN NaN
# 3 Dave 68.0 TX 70.0 NaN
# 4 Ellen NaN CA 88.0 NaN
# 5 Frank 30.0 NaN NaN NaN
Replace NaN with a common value
By specifying the scalar value as the first argument (value) in fillna(), all NaN values are replaced with that value.
print(df.fillna(0))
# name age state point other
# 0 Alice 24.0 NY 0.0 0.0
# 1 0 0.0 0 0.0 0.0
# 2 Charlie 0.0 CA 0.0 0.0
# 3 Dave 68.0 TX 70.0 0.0
# 4 Ellen 0.0 CA 88.0 0.0
# 5 Frank 30.0 0 0.0 0.0
Note that numeric columns with NaN are float type. Even if you replace NaN with an integer (int), the data type remains float. Use astype() to convert it to int.
Replace NaN with different values for each column
Specify a dictionary (dict), in the form {column_name: value}, as the first argument (value) in fillna() to assign different values to each column.
NaN values in columns not specified in the dictionary remain unchanged. Any keys not matching a column name are ignored.
print(df.fillna({'name': 'XXX', 'age': 20, 'ZZZ': 100}))
# name age state point other
# 0 Alice 24.0 NY NaN NaN
# 1 XXX 20.0 NaN NaN NaN
# 2 Charlie 20.0 CA NaN NaN
# 3 Dave 68.0 TX 70.0 NaN
# 4 Ellen 20.0 CA 88.0 NaN
# 5 Frank 30.0 NaN NaN NaN
You can also specify Series. The labels of Series correspond to the keys of dict.
s_for_fill = pd.Series(['XXX', 20, 100], index=['name', 'age', 'ZZZ'])
print(s_for_fill)
# name XXX
# age 20
# ZZZ 100
# dtype: object
print(df.fillna(s_for_fill))
# name age state point other
# 0 Alice 24.0 NY NaN NaN
# 1 XXX 20.0 NaN NaN NaN
# 2 Charlie 20.0 CA NaN NaN
# 3 Dave 68.0 TX 70.0 NaN
# 4 Ellen 20.0 CA 88.0 NaN
# 5 Frank 30.0 NaN NaN NaN
Replace NaN with mean, median, or mode for each column
The mean() method calculates the average of each column, returning a Series.
NaN is excluded in the calculation, but columns with all elements as NaN remain NaN. The numeric_only argument can be set to True to include only numeric columns.
print(df.mean(numeric_only=True))
# age 40.666667
# point 79.000000
# other NaN
# dtype: float64
Specifying this Series as the first argument (value) in fillna() replaces NaN in the corresponding columns with the mean value.
print(df.fillna(df.mean(numeric_only=True)))
# name age state point other
# 0 Alice 24.000000 NY 79.0 NaN
# 1 NaN 40.666667 NaN 79.0 NaN
# 2 Charlie 40.666667 CA 79.0 NaN
# 3 Dave 68.000000 TX 70.0 NaN
# 4 Ellen 40.666667 CA 88.0 NaN
# 5 Frank 30.000000 NaN 79.0 NaN
Similarly, to replace NaN values with the median, use the median() method. For an even number of elements, the median is the average of the two central values.
print(df.fillna(df.median(numeric_only=True)))
# name age state point other
# 0 Alice 24.0 NY 79.0 NaN
# 1 NaN 30.0 NaN 79.0 NaN
# 2 Charlie 30.0 CA 79.0 NaN
# 3 Dave 68.0 TX 70.0 NaN
# 4 Ellen 30.0 CA 88.0 NaN
# 5 Frank 30.0 NaN 79.0 NaN
The mode can be obtained using the mode() method. Since mode() returns a DataFrame, the first row is obtained as a Series using iloc[0]. mode() can also handle strings.
print(df.fillna(df.mode().iloc[0]))
# name age state point other
# 0 Alice 24.0 NY 70.0 NaN
# 1 Alice 24.0 CA 70.0 NaN
# 2 Charlie 24.0 CA 70.0 NaN
# 3 Dave 68.0 TX 70.0 NaN
# 4 Ellen 24.0 CA 88.0 NaN
# 5 Frank 30.0 CA 70.0 NaN
Replace NaN with adjacent values: ffill(), bfill()
To replace NaN with the adjacent valid value, use the ffill() and bfill() methods.
- pandas.DataFrame.ffill — pandas 2.1.4 documentation
- pandas.DataFrame.bfill — pandas 2.1.4 documentation
ffill() replaces NaN with the previous valid value, and bfill() replaces it with the next valid value.
print(df.ffill())
# name age state point other
# 0 Alice 24.0 NY NaN NaN
# 1 Alice 24.0 NY NaN NaN
# 2 Charlie 24.0 CA NaN NaN
# 3 Dave 68.0 TX 70.0 NaN
# 4 Ellen 68.0 CA 88.0 NaN
# 5 Frank 30.0 CA 88.0 NaN
print(df.bfill())
# name age state point other
# 0 Alice 24.0 NY 70.0 NaN
# 1 Charlie 68.0 CA 70.0 NaN
# 2 Charlie 68.0 CA 70.0 NaN
# 3 Dave 68.0 TX 70.0 NaN
# 4 Ellen 30.0 CA 88.0 NaN
# 5 Frank 30.0 NaN NaN NaN
By default, all consecutive NaN values are replaced. The limit argument specifies how many consecutive replacements are allowed.
print(df.ffill(limit=1))
# name age state point other
# 0 Alice 24.0 NY NaN NaN
# 1 Alice 24.0 NY NaN NaN
# 2 Charlie NaN CA NaN NaN
# 3 Dave 68.0 TX 70.0 NaN
# 4 Ellen 68.0 CA 88.0 NaN
# 5 Frank 30.0 CA 88.0 NaN
print(df.bfill(limit=1))
# name age state point other
# 0 Alice 24.0 NY NaN NaN
# 1 Charlie NaN CA NaN NaN
# 2 Charlie 68.0 CA 70.0 NaN
# 3 Dave 68.0 TX 70.0 NaN
# 4 Ellen 30.0 CA 88.0 NaN
# 5 Frank 30.0 NaN NaN NaN
Setting the axis argument to 1 or 'columns' replaces NaN with left or right values. ffill() uses the left value, and bfill() uses the right value.
print(df.ffill(axis=1))
# name age state point other
# 0 Alice 24.0 NY NY NY
# 1 NaN NaN NaN NaN NaN
# 2 Charlie Charlie CA CA CA
# 3 Dave 68.0 TX 70.0 70.0
# 4 Ellen Ellen CA 88.0 88.0
# 5 Frank 30.0 30.0 30.0 30.0
print(df.bfill(axis=1))
# name age state point other
# 0 Alice 24.0 NY NaN NaN
# 1 NaN NaN NaN NaN NaN
# 2 Charlie CA CA NaN NaN
# 3 Dave 68.0 TX 70.0 NaN
# 4 Ellen CA CA 88.0 NaN
# 5 Frank 30.0 NaN NaN NaN
Note that pad() and backfill(), which perform the same operation as ffill() and bfill(), have been deprecated since pandas version 2.0.0.
- pandas.DataFrame.pad — pandas 2.1.4 documentation
- pandas.DataFrame.backfill — pandas 2.1.4 documentation
The method argument in fillna()
The method argument in fillna(), although deprecated since version 2.1.0, allows for the same functionality as ffill() and bfill().
- What’s new in 2.1.0 (Aug 30, 2023) — pandas 2.1.4 documentation
- DEPR: fillna 'method' · Issue #53394 · pandas-dev/pandas
Setting the method argument to 'ffill' or 'pad' replicates the functionality of ffill(), while 'bfill' or 'backfill' yields the same result as bfill().
As of version 2.1.4, it is still usable, but a FutureWarning is issued.
print(df.fillna(method='ffill', limit=1))
# name age state point other
# 0 Alice 24.0 NY NaN NaN
# 1 Alice 24.0 NY NaN NaN
# 2 Charlie NaN CA NaN NaN
# 3 Dave 68.0 TX 70.0 NaN
# 4 Ellen 68.0 CA 88.0 NaN
# 5 Frank 30.0 CA 88.0 NaN
#
# /var/folders/rf/b7l8_vgj5mdgvghn_326rn_c0000gn/T/ipykernel_50534/2498159999.py:1: FutureWarning: DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.
Modify the original object: inplace
By default, fillna(), ffill(), and bfill() return a new object without modifying the original. Setting the inplace argument to True modifies the original object.
df.fillna(0, inplace=True)
print(df)
# name age state point other
# 0 Alice 24.0 NY 0.0 0.0
# 1 0 0.0 0 0.0 0.0
# 2 Charlie 0.0 CA 0.0 0.0
# 3 Dave 68.0 TX 70.0 0.0
# 4 Ellen 0.0 CA 88.0 0.0
# 5 Frank 30.0 0 0.0 0.0
fillna(), ffill(), and bfill() on pandas.Series
For Series, the fillna() method can be used in a manner similar to its usage in DataFrame.
s = pd.read_csv('data/src/sample_pandas_normal_nan.csv')['age']
print(s)
# 0 24.0
# 1 NaN
# 2 NaN
# 3 68.0
# 4 NaN
# 5 30.0
# Name: age, dtype: float64
print(s.fillna(100))
# 0 24.0
# 1 100.0
# 2 100.0
# 3 68.0
# 4 100.0
# 5 30.0
# Name: age, dtype: float64
print(s.fillna({1: 100, 4: -100}))
# 0 24.0
# 1 100.0
# 2 NaN
# 3 68.0
# 4 -100.0
# 5 30.0
# Name: age, dtype: float64
ffill() and bfill() are also available.
print(s.ffill(limit=1))
# 0 24.0
# 1 24.0
# 2 NaN
# 3 68.0
# 4 68.0
# 5 30.0
# Name: age, dtype: float64
print(s.bfill(limit=1))
# 0 24.0
# 1 NaN
# 2 68.0
# 3 68.0
# 4 30.0
# 5 30.0
# Name: age, dtype: float64
pad() and backfill() are also present, but have been deprecated since version 2.0.0.
The method argument in fillna() has been deprecated since version 2.1.0. As of version 2.1.4, it is still usable, but a FutureWarning is issued.
print(s.fillna(method='ffill', limit=1))
# 0 24.0
# 1 24.0
# 2 NaN
# 3 68.0
# 4 68.0
# 5 30.0
# Name: age, dtype: float64
#
# /var/folders/rf/b7l8_vgj5mdgvghn_326rn_c0000gn/T/ipykernel_50534/2241812369.py:1: FutureWarning: Series.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.