pandas: Interpolate NaN (missing values) with interpolate()

Modified: 2023-08-02 | Tags: Python, pandas

You can interpolate missing values (NaN) in pandas.DataFrame and pandas.Series with the interpolate() method.

Contents

Basic usage of interpolate()
Interpolation method: method
For time-series data
In the case that the data type (dtype) is object

See the following article on removing, replacing, and counting missing values.

The sample code in this article uses pandas version 2.0.3. NumPy is also imported.

import pandas as pd
import numpy as np

print(pd.__version__)
# 2.0.3

source: pandas_interpolate.py

Basic usage of `interpolate()`

The following pandas.DataFrame is used as an example.

df = pd.DataFrame({'col1': [0, np.nan, np.nan, 3, 4],
                   'col2': [np.nan, 1, 2, np.nan, np.nan],
                   'col3': [4, np.nan, np.nan, 7, 10]})
print(df)
#    col1  col2  col3
# 0   0.0   NaN   4.0
# 1   NaN   1.0   NaN
# 2   NaN   2.0   NaN
# 3   3.0   NaN   7.0
# 4   4.0   NaN  10.0

source: pandas_interpolate.py

By default, linear interpolation is performed on each column. The same value is repeated for NaN at the bottom, while NaN at the top remains unchanged.

print(df.interpolate())
#    col1  col2  col3
# 0   0.0   NaN   4.0
# 1   1.0   1.0   5.0
# 2   2.0   2.0   6.0
# 3   3.0   2.0   7.0
# 4   4.0   2.0  10.0

source: pandas_interpolate.py

Row or column: `axis`

If axis=1, interpolation is performed across each row. The same value is repeated for the rightmost NaN, while the leftmost NaN remains unchanged.

print(df.interpolate(axis=1))
#    col1  col2  col3
# 0   0.0   2.0   4.0
# 1   NaN   1.0   1.0
# 2   NaN   2.0   2.0
# 3   3.0   5.0   7.0
# 4   4.0   7.0  10.0

source: pandas_interpolate.py

Maximum number of consecutive `NaN` to fill: `limit`

You can use the limit argument to specify the maximum number of consecutive NaN values to interpolate. By default, this is set to None, meaning all consecutive NaN values will be interpolated.

print(df.interpolate(limit=1))
#    col1  col2  col3
# 0   0.0   NaN   4.0
# 1   1.0   1.0   5.0
# 2   NaN   2.0   NaN
# 3   3.0   2.0   7.0
# 4   4.0   NaN  10.0

source: pandas_interpolate.py

Direction to interpolate: `limit_direction`

You can specify the direction of interpolation with the limit_direction argument, which can be set to 'forward', 'backward', or 'both'.

print(df.interpolate(limit=1, limit_direction='forward'))
#    col1  col2  col3
# 0   0.0   NaN   4.0
# 1   1.0   1.0   5.0
# 2   NaN   2.0   NaN
# 3   3.0   2.0   7.0
# 4   4.0   NaN  10.0

print(df.interpolate(limit=1, limit_direction='backward'))
#    col1  col2  col3
# 0   0.0   1.0   4.0
# 1   NaN   1.0   NaN
# 2   2.0   2.0   6.0
# 3   3.0   NaN   7.0
# 4   4.0   NaN  10.0

print(df.interpolate(limit=1, limit_direction='both'))
#    col1  col2  col3
# 0   0.0   1.0   4.0
# 1   1.0   1.0   5.0
# 2   2.0   2.0   6.0
# 3   3.0   2.0   7.0
# 4   4.0   NaN  10.0

source: pandas_interpolate.py

As mentioned above, by default, the top (or leftmost) NaN values remain unchanged. However, setting limit_direction='both' allows interpolation at both ends.

print(df.interpolate(limit_direction='both'))
#    col1  col2  col3
# 0   0.0   1.0   4.0
# 1   1.0   1.0   5.0
# 2   2.0   2.0   6.0
# 3   3.0   2.0   7.0
# 4   4.0   2.0  10.0

source: pandas_interpolate.py

Interpolate or extrapolate or both: `limit_area`

You can specify the area to be interpolated with the limit_area argument.

'inside': Only interpolation
'outside': Only extrapolation
None (default): Both interpolation and extrapolation

print(df.interpolate(limit_area='inside'))
#    col1  col2  col3
# 0   0.0   NaN   4.0
# 1   1.0   1.0   5.0
# 2   2.0   2.0   6.0
# 3   3.0   NaN   7.0
# 4   4.0   NaN  10.0

print(df.interpolate(limit_area='outside'))
#    col1  col2  col3
# 0   0.0   NaN   4.0
# 1   NaN   1.0   NaN
# 2   NaN   2.0   NaN
# 3   3.0   2.0   7.0
# 4   4.0   2.0  10.0

print(df.interpolate(limit_area='outside', limit_direction='both'))
#    col1  col2  col3
# 0   0.0   1.0   4.0
# 1   NaN   1.0   NaN
# 2   NaN   2.0   NaN
# 3   3.0   2.0   7.0
# 4   4.0   2.0  10.0

source: pandas_interpolate.py

Note that "extrapolation" is used here for convenience. For linear interpolation (default), outer values are merely repetitions of the end values, not truly extrapolated. However, in spline interpolation as discussed below, the outer values are genuinely extrapolated.

Operate inplace: `inplace`

Like many other methods, you can directly modify the original object by setting inplace=True.

df.interpolate(inplace=True)
print(df)
#    col1  col2  col3
# 0   0.0   NaN   4.0
# 1   1.0   1.0   5.0
# 2   2.0   2.0   6.0
# 3   3.0   2.0   7.0
# 4   4.0   2.0  10.0

source: pandas_interpolate.py

Interpolation method: `method`

The interpolation method is specified by the first argument method. The default value is 'linear' (linear interpolation).

Linear interpolation: `linear`, `index`, `values`

With method='linear' (default), the index is ignored, but with method='index' or method='values', it is interpolated using the index value.

s = pd.Series([0, np.nan, np.nan, 3],
              index=[0, 4, 6, 8])
print(s)
# 0    0.0
# 4    NaN
# 6    NaN
# 8    3.0
# dtype: float64

print(s.interpolate())
# 0    0.0
# 4    1.0
# 6    2.0
# 8    3.0
# dtype: float64

print(s.interpolate('index'))
# 0    0.00
# 4    1.50
# 6    2.25
# 8    3.00
# dtype: float64

source: pandas_interpolate.py

If the index column contains strings, method='linear' (default) works fine, but if method is set to 'index' or 'values', an error is raised.

s.index = list('abcd')
print(s)
# a    0.0
# b    NaN
# c    NaN
# d    3.0
# dtype: float64

print(s.interpolate())
# a    0.0
# b    1.0
# c    2.0
# d    3.0
# dtype: float64

# print(s.interpolate('index'))
# TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'

source: pandas_interpolate.py

Using existing values: `ffill`, `pad`, `bfill`, `backfill`

NaN values are filled with the previous existing value if method='ffill' or method='pad', or with the next existing value if method='bfill' or method='backfill'.

s = pd.Series([np.nan, 1, np.nan, 2, np.nan])
print(s)
# 0    NaN
# 1    1.0
# 2    NaN
# 3    2.0
# 4    NaN
# dtype: float64

print(s.interpolate('ffill'))
# 0    NaN
# 1    1.0
# 2    1.0
# 3    2.0
# 4    2.0
# dtype: float64

print(s.interpolate('bfill'))
# 0    1.0
# 1    1.0
# 2    2.0
# 3    2.0
# 4    NaN
# dtype: float64

source: pandas_interpolate.py

The limit_direction should be set to 'forward' if method is set to 'ffill' or 'pad', and to 'backward' if method is set to 'bfill' or 'backfill'.

# s.interpolate('ffill', limit_direction='both')
# ValueError: `limit_direction` must be 'forward' for method `ffill`

# s.interpolate('bfill', limit_direction='both')
# ValueError: `limit_direction` must be 'backward' for method `bfill`

source: pandas_interpolate.py

You can do the same with the fillna() method with the method argument.

pandas: Replace NaN (missing values) with fillna()

print(s.fillna(method='ffill'))
# 0    NaN
# 1    1.0
# 2    1.0
# 3    2.0
# 4    2.0
# dtype: float64

print(s.fillna(method='bfill'))
# 0    1.0
# 1    1.0
# 2    2.0
# 3    2.0
# 4    NaN
# dtype: float64

source: pandas_interpolate.py

Spline interpolation: `spline`

If method='spline', spline interpolation is done. You must specify the order argument.

s = pd.Series([0, 10, np.nan, np.nan, 4, np.nan],
              index=[0, 2, 5, 6, 8, 12])
print(s)
# 0      0.0
# 2     10.0
# 5      NaN
# 6      NaN
# 8      4.0
# 12     NaN
# dtype: float64

print(s.interpolate('spline', order=2))
# 0      0.00
# 2     10.00
# 5     13.75
# 6     12.00
# 8      4.00
# 12   -30.00
# dtype: float64

source: pandas_interpolate.py

Spline interpolation always depends on the index, so any changes to the index will consequently alter the results.

s.index = range(6)
print(s)
# 0     0.0
# 1    10.0
# 2     NaN
# 3     NaN
# 4     4.0
# 5     NaN
# dtype: float64

print(s.interpolate('spline', order=2))
# 0     0.0
# 1    10.0
# 2    14.0
# 3    12.0
# 4     4.0
# 5   -10.0
# dtype: float64

source: pandas_interpolate.py

Therefore, spline interpolation requires the index to be numeric. If it is strings, an error is raised.

s.index = list('abcdef')
print(s)
# a     0.0
# b    10.0
# c     NaN
# d     NaN
# e     4.0
# f     NaN
# dtype: float64

# print(s.interpolate('spline', order=2))
# ValueError: Index column must be numeric or datetime type when using spline method other than linear.
# Try setting a numeric or datetime index column before interpolating.

source: pandas_interpolate.py

Others

There are other interpolation methods that can be specified for the method argument: 'nearest', 'zero', 'slinear', 'quadratic', 'cubic', 'barycentric', 'krogh', 'polynomial', 'piecewise_polynomial', 'from_derivatives', 'pchip', 'akima'.

As mentioned in the official documentation, these are wrappers for SciPy functions, including the spline interpolation ('spline') mentioned above.

In all these methods, the index must be numeric, just like in spline interpolation.

For time-series data

method='time' is provided for time-series data. In the case of method='time', linear interpolation is done according to the date and time of the index column.

df_nan = pd.DataFrame({'value': [1, np.nan, np.nan, np.nan, 31]},
                      index=pd.to_datetime(['2018-01-01', '2018-01-02', '2018-01-15', '2018-01-20', '2018-01-31']))

print(df_nan)
#             value
# 2018-01-01    1.0
# 2018-01-02    NaN
# 2018-01-15    NaN
# 2018-01-20    NaN
# 2018-01-31   31.0

print(df_nan.interpolate())
#             value
# 2018-01-01    1.0
# 2018-01-02    8.5
# 2018-01-15   16.0
# 2018-01-20   23.5
# 2018-01-31   31.0

print(df_nan.interpolate('time'))
#             value
# 2018-01-01    1.0
# 2018-01-02    2.0
# 2018-01-15   15.0
# 2018-01-20   20.0
# 2018-01-31   31.0

source: pandas_time_series_interpolation.py

In the case that the data type (`dtype`) is `object`

For example, the data type (dtype) of a column containing a string element is object.

pandas: Cast DataFrame to a specific dtype with astype()

s_object = pd.Series(['A', np.nan, 'C'])
print(s_object)
# 0      A
# 1    NaN
# 2      C
# dtype: object

source: pandas_interpolate.py

The object column is not interpolated by method='linear' (default) or other methods. It can be filled by methods such as ffill, pad, bfill, and backfill, which use the existing values.

print(s_object.interpolate())
# 0      A
# 1    NaN
# 2      C
# dtype: object

print(s_object.interpolate('ffill'))
# 0    A
# 1    A
# 2    C
# dtype: object

source: pandas_interpolate.py

The same is true if the element is a number but the data type is object.

s_object_num = pd.Series([0, np.nan, 2], dtype=object)
print(s_object_num)
# 0      0
# 1    NaN
# 2      2
# dtype: object

print(s_object_num.interpolate())
# 0      0
# 1    NaN
# 2      2
# dtype: object

print(s_object_num.interpolate('ffill'))
# 0    0
# 1    0
# 2    2
# dtype: int64

source: pandas_interpolate.py

If you convert it to float with astype(), you can interpolate. Note that it cannot be converted to int if it contains NaN.

print(s_object_num.astype(float).interpolate())
# 0    0.0
# 1    1.0
# 2    2.0
# dtype: float64

# print(s_object_num.astype(int))
# ValueError: cannot convert float NaN to integer

source: pandas_interpolate.py

pandas: Interpolate NaN (missing values) with interpolate()

Basic usage of `interpolate()`

Row or column: `axis`

Maximum number of consecutive `NaN` to fill: `limit`

Direction to interpolate: `limit_direction`

Interpolate or extrapolate or both: `limit_area`

Operate inplace: `inplace`

Interpolation method: `method`

Linear interpolation: `linear`, `index`, `values`

Using existing values: `ffill`, `pad`, `bfill`, `backfill`

Spline interpolation: `spline`

Others

For time-series data

In the case that the data type (`dtype`) is `object`

Related Categories

Related Articles

pandas: Interpolate NaN (missing values) with interpolate()

Basic usage of interpolate()

Row or column: axis

Maximum number of consecutive NaN to fill: limit

Direction to interpolate: limit_direction

Interpolate or extrapolate or both: limit_area

Operate inplace: inplace

Interpolation method: method

Linear interpolation: linear, index, values

Using existing values: ffill, pad, bfill, backfill

Spline interpolation: spline

Others

For time-series data

In the case that the data type (dtype) is object

Related Categories

Related Articles

Basic usage of `interpolate()`

Row or column: `axis`

Maximum number of consecutive `NaN` to fill: `limit`

Direction to interpolate: `limit_direction`

Interpolate or extrapolate or both: `limit_area`

Operate inplace: `inplace`

Interpolation method: `method`

Linear interpolation: `linear`, `index`, `values`

Using existing values: `ffill`, `pad`, `bfill`, `backfill`

Spline interpolation: `spline`

In the case that the data type (`dtype`) is `object`