pandas: Replace NaN (missing values) with fillna()

Modified: | Tags: Python, pandas

In pandas, the fillna() method allows you to replace NaN values in a DataFrame or Series with a specific value.

While this article primarily deals with NaN (Not a Number), it is important to note that in pandas, None is also treated as a missing value.

To fill missing values with linear or spline interpolation, use the interpolate() method.

For extracting, deleting, or counting missing values, refer to the following articles.

The pandas version used in this article is as follows. Note that functionality may vary between versions. The following DataFrame is used as an example.

import pandas as pd

print(pd.__version__)
# 2.1.4

df = pd.read_csv('data/src/sample_pandas_normal_nan.csv')
print(df)
#       name   age state  point  other
# 0    Alice  24.0    NY    NaN    NaN
# 1      NaN   NaN   NaN    NaN    NaN
# 2  Charlie   NaN    CA    NaN    NaN
# 3     Dave  68.0    TX   70.0    NaN
# 4    Ellen   NaN    CA   88.0    NaN
# 5    Frank  30.0   NaN    NaN    NaN

Replace NaN with a common value

By specifying the scalar value as the first argument (value) in fillna(), all NaN values are replaced with that value.

print(df.fillna(0))
#       name   age state  point  other
# 0    Alice  24.0    NY    0.0    0.0
# 1        0   0.0     0    0.0    0.0
# 2  Charlie   0.0    CA    0.0    0.0
# 3     Dave  68.0    TX   70.0    0.0
# 4    Ellen   0.0    CA   88.0    0.0
# 5    Frank  30.0     0    0.0    0.0

Note that numeric columns with NaN are float type. Even if you replace NaN with an integer (int), the data type remains float. Use astype() to convert it to int.

Replace NaN with different values for each column

Specify a dictionary (dict), in the form {column_name: value}, as the first argument (value) in fillna() to assign different values to each column.

NaN values in columns not specified in the dictionary remain unchanged. Any keys not matching a column name are ignored.

print(df.fillna({'name': 'XXX', 'age': 20, 'ZZZ': 100}))
#       name   age state  point  other
# 0    Alice  24.0    NY    NaN    NaN
# 1      XXX  20.0   NaN    NaN    NaN
# 2  Charlie  20.0    CA    NaN    NaN
# 3     Dave  68.0    TX   70.0    NaN
# 4    Ellen  20.0    CA   88.0    NaN
# 5    Frank  30.0   NaN    NaN    NaN

You can also specify Series. The labels of Series correspond to the keys of dict.

s_for_fill = pd.Series(['XXX', 20, 100], index=['name', 'age', 'ZZZ'])
print(s_for_fill)
# name    XXX
# age      20
# ZZZ     100
# dtype: object

print(df.fillna(s_for_fill))
#       name   age state  point  other
# 0    Alice  24.0    NY    NaN    NaN
# 1      XXX  20.0   NaN    NaN    NaN
# 2  Charlie  20.0    CA    NaN    NaN
# 3     Dave  68.0    TX   70.0    NaN
# 4    Ellen  20.0    CA   88.0    NaN
# 5    Frank  30.0   NaN    NaN    NaN

Replace NaN with mean, median, or mode for each column

The mean() method calculates the average of each column, returning a Series.

NaN is excluded in the calculation, but columns with all elements as NaN remain NaN. The numeric_only argument can be set to True to include only numeric columns.

print(df.mean(numeric_only=True))
# age      40.666667
# point    79.000000
# other          NaN
# dtype: float64

Specifying this Series as the first argument (value) in fillna() replaces NaN in the corresponding columns with the mean value.

print(df.fillna(df.mean(numeric_only=True)))
#       name        age state  point  other
# 0    Alice  24.000000    NY   79.0    NaN
# 1      NaN  40.666667   NaN   79.0    NaN
# 2  Charlie  40.666667    CA   79.0    NaN
# 3     Dave  68.000000    TX   70.0    NaN
# 4    Ellen  40.666667    CA   88.0    NaN
# 5    Frank  30.000000   NaN   79.0    NaN

Similarly, to replace NaN values with the median, use the median() method. For an even number of elements, the median is the average of the two central values.

print(df.fillna(df.median(numeric_only=True)))
#       name   age state  point  other
# 0    Alice  24.0    NY   79.0    NaN
# 1      NaN  30.0   NaN   79.0    NaN
# 2  Charlie  30.0    CA   79.0    NaN
# 3     Dave  68.0    TX   70.0    NaN
# 4    Ellen  30.0    CA   88.0    NaN
# 5    Frank  30.0   NaN   79.0    NaN

The mode can be obtained using the mode() method. Since mode() returns a DataFrame, the first row is obtained as a Series using iloc[0]. mode() can also handle strings.

print(df.fillna(df.mode().iloc[0]))
#       name   age state  point  other
# 0    Alice  24.0    NY   70.0    NaN
# 1    Alice  24.0    CA   70.0    NaN
# 2  Charlie  24.0    CA   70.0    NaN
# 3     Dave  68.0    TX   70.0    NaN
# 4    Ellen  24.0    CA   88.0    NaN
# 5    Frank  30.0    CA   70.0    NaN

Replace NaN with adjacent values: ffill(), bfill()

To replace NaN with the adjacent valid value, use the ffill() and bfill() methods.

ffill() replaces NaN with the previous valid value, and bfill() replaces it with the next valid value.

print(df.ffill())
#       name   age state  point  other
# 0    Alice  24.0    NY    NaN    NaN
# 1    Alice  24.0    NY    NaN    NaN
# 2  Charlie  24.0    CA    NaN    NaN
# 3     Dave  68.0    TX   70.0    NaN
# 4    Ellen  68.0    CA   88.0    NaN
# 5    Frank  30.0    CA   88.0    NaN

print(df.bfill())
#       name   age state  point  other
# 0    Alice  24.0    NY   70.0    NaN
# 1  Charlie  68.0    CA   70.0    NaN
# 2  Charlie  68.0    CA   70.0    NaN
# 3     Dave  68.0    TX   70.0    NaN
# 4    Ellen  30.0    CA   88.0    NaN
# 5    Frank  30.0   NaN    NaN    NaN

By default, all consecutive NaN values are replaced. The limit argument specifies how many consecutive replacements are allowed.

print(df.ffill(limit=1))
#       name   age state  point  other
# 0    Alice  24.0    NY    NaN    NaN
# 1    Alice  24.0    NY    NaN    NaN
# 2  Charlie   NaN    CA    NaN    NaN
# 3     Dave  68.0    TX   70.0    NaN
# 4    Ellen  68.0    CA   88.0    NaN
# 5    Frank  30.0    CA   88.0    NaN

print(df.bfill(limit=1))
#       name   age state  point  other
# 0    Alice  24.0    NY    NaN    NaN
# 1  Charlie   NaN    CA    NaN    NaN
# 2  Charlie  68.0    CA   70.0    NaN
# 3     Dave  68.0    TX   70.0    NaN
# 4    Ellen  30.0    CA   88.0    NaN
# 5    Frank  30.0   NaN    NaN    NaN

Setting the axis argument to 1 or 'columns' replaces NaN with left or right values. ffill() uses the left value, and bfill() uses the right value.

print(df.ffill(axis=1))
#       name      age state point other
# 0    Alice     24.0    NY    NY    NY
# 1      NaN      NaN   NaN   NaN   NaN
# 2  Charlie  Charlie    CA    CA    CA
# 3     Dave     68.0    TX  70.0  70.0
# 4    Ellen    Ellen    CA  88.0  88.0
# 5    Frank     30.0  30.0  30.0  30.0

print(df.bfill(axis=1))
#       name   age state point other
# 0    Alice  24.0    NY   NaN   NaN
# 1      NaN   NaN   NaN   NaN   NaN
# 2  Charlie    CA    CA   NaN   NaN
# 3     Dave  68.0    TX  70.0   NaN
# 4    Ellen    CA    CA  88.0   NaN
# 5    Frank  30.0   NaN   NaN   NaN

Note that pad() and backfill(), which perform the same operation as ffill() and bfill(), have been deprecated since pandas version 2.0.0.

The method argument in fillna()

The method argument in fillna(), although deprecated since version 2.1.0, allows for the same functionality as ffill() and bfill().

Setting the method argument to 'ffill' or 'pad' replicates the functionality of ffill(), while 'bfill' or 'backfill' yields the same result as bfill().

As of version 2.1.4, it is still usable, but a FutureWarning is issued.

print(df.fillna(method='ffill', limit=1))
#       name   age state  point  other
# 0    Alice  24.0    NY    NaN    NaN
# 1    Alice  24.0    NY    NaN    NaN
# 2  Charlie   NaN    CA    NaN    NaN
# 3     Dave  68.0    TX   70.0    NaN
# 4    Ellen  68.0    CA   88.0    NaN
# 5    Frank  30.0    CA   88.0    NaN
# 
# /var/folders/rf/b7l8_vgj5mdgvghn_326rn_c0000gn/T/ipykernel_50534/2498159999.py:1: FutureWarning: DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.

Modify the original object: inplace

By default, fillna(), ffill(), and bfill() return a new object without modifying the original. Setting the inplace argument to True modifies the original object.

df.fillna(0, inplace=True)
print(df)
#       name   age state  point  other
# 0    Alice  24.0    NY    0.0    0.0
# 1        0   0.0     0    0.0    0.0
# 2  Charlie   0.0    CA    0.0    0.0
# 3     Dave  68.0    TX   70.0    0.0
# 4    Ellen   0.0    CA   88.0    0.0
# 5    Frank  30.0     0    0.0    0.0

fillna(), ffill(), and bfill() on pandas.Series

For Series, the fillna() method can be used in a manner similar to its usage in DataFrame.

s = pd.read_csv('data/src/sample_pandas_normal_nan.csv')['age']
print(s)
# 0    24.0
# 1     NaN
# 2     NaN
# 3    68.0
# 4     NaN
# 5    30.0
# Name: age, dtype: float64

print(s.fillna(100))
# 0     24.0
# 1    100.0
# 2    100.0
# 3     68.0
# 4    100.0
# 5     30.0
# Name: age, dtype: float64

print(s.fillna({1: 100, 4: -100}))
# 0     24.0
# 1    100.0
# 2      NaN
# 3     68.0
# 4   -100.0
# 5     30.0
# Name: age, dtype: float64

ffill() and bfill() are also available.

print(s.ffill(limit=1))
# 0    24.0
# 1    24.0
# 2     NaN
# 3    68.0
# 4    68.0
# 5    30.0
# Name: age, dtype: float64

print(s.bfill(limit=1))
# 0    24.0
# 1     NaN
# 2    68.0
# 3    68.0
# 4    30.0
# 5    30.0
# Name: age, dtype: float64

pad() and backfill() are also present, but have been deprecated since version 2.0.0.

The method argument in fillna() has been deprecated since version 2.1.0. As of version 2.1.4, it is still usable, but a FutureWarning is issued.

print(s.fillna(method='ffill', limit=1))
# 0    24.0
# 1    24.0
# 2     NaN
# 3    68.0
# 4    68.0
# 5    30.0
# Name: age, dtype: float64
# 
# /var/folders/rf/b7l8_vgj5mdgvghn_326rn_c0000gn/T/ipykernel_50534/2241812369.py:1: FutureWarning: Series.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.

Related Categories

Related Articles