pandas: Interpolate NaN (missing values) with interpolate()

Modified: | Tags: Python, pandas

You can interpolate missing values (NaN) in pandas.DataFrame and pandas.Series with the interpolate() method.

See the following article on removing, replacing, and counting missing values.

The sample code in this article uses pandas version 2.0.3. NumPy is also imported.

import pandas as pd
import numpy as np

print(pd.__version__)
# 2.0.3

Basic usage of interpolate()

The following pandas.DataFrame is used as an example.

df = pd.DataFrame({'col1': [0, np.nan, np.nan, 3, 4],
                   'col2': [np.nan, 1, 2, np.nan, np.nan],
                   'col3': [4, np.nan, np.nan, 7, 10]})
print(df)
#    col1  col2  col3
# 0   0.0   NaN   4.0
# 1   NaN   1.0   NaN
# 2   NaN   2.0   NaN
# 3   3.0   NaN   7.0
# 4   4.0   NaN  10.0

By default, linear interpolation is performed on each column. The same value is repeated for NaN at the bottom, while NaN at the top remains unchanged.

print(df.interpolate())
#    col1  col2  col3
# 0   0.0   NaN   4.0
# 1   1.0   1.0   5.0
# 2   2.0   2.0   6.0
# 3   3.0   2.0   7.0
# 4   4.0   2.0  10.0

Row or column: axis

If axis=1, interpolation is performed across each row. The same value is repeated for the rightmost NaN, while the leftmost NaN remains unchanged.

print(df.interpolate(axis=1))
#    col1  col2  col3
# 0   0.0   2.0   4.0
# 1   NaN   1.0   1.0
# 2   NaN   2.0   2.0
# 3   3.0   5.0   7.0
# 4   4.0   7.0  10.0

Maximum number of consecutive NaN to fill: limit

You can use the limit argument to specify the maximum number of consecutive NaN values to interpolate. By default, this is set to None, meaning all consecutive NaN values will be interpolated.

print(df.interpolate(limit=1))
#    col1  col2  col3
# 0   0.0   NaN   4.0
# 1   1.0   1.0   5.0
# 2   NaN   2.0   NaN
# 3   3.0   2.0   7.0
# 4   4.0   NaN  10.0

Direction to interpolate: limit_direction

You can specify the direction of interpolation with the limit_direction argument, which can be set to 'forward', 'backward', or 'both'.

print(df.interpolate(limit=1, limit_direction='forward'))
#    col1  col2  col3
# 0   0.0   NaN   4.0
# 1   1.0   1.0   5.0
# 2   NaN   2.0   NaN
# 3   3.0   2.0   7.0
# 4   4.0   NaN  10.0

print(df.interpolate(limit=1, limit_direction='backward'))
#    col1  col2  col3
# 0   0.0   1.0   4.0
# 1   NaN   1.0   NaN
# 2   2.0   2.0   6.0
# 3   3.0   NaN   7.0
# 4   4.0   NaN  10.0

print(df.interpolate(limit=1, limit_direction='both'))
#    col1  col2  col3
# 0   0.0   1.0   4.0
# 1   1.0   1.0   5.0
# 2   2.0   2.0   6.0
# 3   3.0   2.0   7.0
# 4   4.0   NaN  10.0

As mentioned above, by default, the top (or leftmost) NaN values remain unchanged. However, setting limit_direction='both' allows interpolation at both ends.

print(df.interpolate(limit_direction='both'))
#    col1  col2  col3
# 0   0.0   1.0   4.0
# 1   1.0   1.0   5.0
# 2   2.0   2.0   6.0
# 3   3.0   2.0   7.0
# 4   4.0   2.0  10.0

Interpolate or extrapolate or both: limit_area

You can specify the area to be interpolated with the limit_area argument.

  • 'inside': Only interpolation
  • 'outside': Only extrapolation
  • None (default): Both interpolation and extrapolation
print(df.interpolate(limit_area='inside'))
#    col1  col2  col3
# 0   0.0   NaN   4.0
# 1   1.0   1.0   5.0
# 2   2.0   2.0   6.0
# 3   3.0   NaN   7.0
# 4   4.0   NaN  10.0

print(df.interpolate(limit_area='outside'))
#    col1  col2  col3
# 0   0.0   NaN   4.0
# 1   NaN   1.0   NaN
# 2   NaN   2.0   NaN
# 3   3.0   2.0   7.0
# 4   4.0   2.0  10.0

print(df.interpolate(limit_area='outside', limit_direction='both'))
#    col1  col2  col3
# 0   0.0   1.0   4.0
# 1   NaN   1.0   NaN
# 2   NaN   2.0   NaN
# 3   3.0   2.0   7.0
# 4   4.0   2.0  10.0

Note that "extrapolation" is used here for convenience. For linear interpolation (default), outer values are merely repetitions of the end values, not truly extrapolated. However, in spline interpolation as discussed below, the outer values are genuinely extrapolated.

Operate inplace: inplace

Like many other methods, you can directly modify the original object by setting inplace=True.

df.interpolate(inplace=True)
print(df)
#    col1  col2  col3
# 0   0.0   NaN   4.0
# 1   1.0   1.0   5.0
# 2   2.0   2.0   6.0
# 3   3.0   2.0   7.0
# 4   4.0   2.0  10.0

Interpolation method: method

The interpolation method is specified by the first argument method. The default value is 'linear' (linear interpolation).

Linear interpolation: linear, index, values

With method='linear' (default), the index is ignored, but with method='index' or method='values', it is interpolated using the index value.

s = pd.Series([0, np.nan, np.nan, 3],
              index=[0, 4, 6, 8])
print(s)
# 0    0.0
# 4    NaN
# 6    NaN
# 8    3.0
# dtype: float64

print(s.interpolate())
# 0    0.0
# 4    1.0
# 6    2.0
# 8    3.0
# dtype: float64

print(s.interpolate('index'))
# 0    0.00
# 4    1.50
# 6    2.25
# 8    3.00
# dtype: float64

If the index column contains strings, method='linear' (default) works fine, but if method is set to 'index' or 'values', an error is raised.

s.index = list('abcd')
print(s)
# a    0.0
# b    NaN
# c    NaN
# d    3.0
# dtype: float64

print(s.interpolate())
# a    0.0
# b    1.0
# c    2.0
# d    3.0
# dtype: float64

# print(s.interpolate('index'))
# TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'

Using existing values: ffill, pad, bfill, backfill

NaN values are filled with the previous existing value if method='ffill' or method='pad', or with the next existing value if method='bfill' or method='backfill'.

s = pd.Series([np.nan, 1, np.nan, 2, np.nan])
print(s)
# 0    NaN
# 1    1.0
# 2    NaN
# 3    2.0
# 4    NaN
# dtype: float64

print(s.interpolate('ffill'))
# 0    NaN
# 1    1.0
# 2    1.0
# 3    2.0
# 4    2.0
# dtype: float64

print(s.interpolate('bfill'))
# 0    1.0
# 1    1.0
# 2    2.0
# 3    2.0
# 4    NaN
# dtype: float64

The limit_direction should be set to 'forward' if method is set to 'ffill' or 'pad', and to 'backward' if method is set to 'bfill' or 'backfill'.

# s.interpolate('ffill', limit_direction='both')
# ValueError: `limit_direction` must be 'forward' for method `ffill`

# s.interpolate('bfill', limit_direction='both')
# ValueError: `limit_direction` must be 'backward' for method `bfill`

You can do the same with the fillna() method with the method argument.

print(s.fillna(method='ffill'))
# 0    NaN
# 1    1.0
# 2    1.0
# 3    2.0
# 4    2.0
# dtype: float64

print(s.fillna(method='bfill'))
# 0    1.0
# 1    1.0
# 2    2.0
# 3    2.0
# 4    NaN
# dtype: float64

Spline interpolation: spline

If method='spline', spline interpolation is done. You must specify the order argument.

s = pd.Series([0, 10, np.nan, np.nan, 4, np.nan],
              index=[0, 2, 5, 6, 8, 12])
print(s)
# 0      0.0
# 2     10.0
# 5      NaN
# 6      NaN
# 8      4.0
# 12     NaN
# dtype: float64

print(s.interpolate('spline', order=2))
# 0      0.00
# 2     10.00
# 5     13.75
# 6     12.00
# 8      4.00
# 12   -30.00
# dtype: float64

Spline interpolation always depends on the index, so any changes to the index will consequently alter the results.

s.index = range(6)
print(s)
# 0     0.0
# 1    10.0
# 2     NaN
# 3     NaN
# 4     4.0
# 5     NaN
# dtype: float64

print(s.interpolate('spline', order=2))
# 0     0.0
# 1    10.0
# 2    14.0
# 3    12.0
# 4     4.0
# 5   -10.0
# dtype: float64

Therefore, spline interpolation requires the index to be numeric. If it is strings, an error is raised.

s.index = list('abcdef')
print(s)
# a     0.0
# b    10.0
# c     NaN
# d     NaN
# e     4.0
# f     NaN
# dtype: float64

# print(s.interpolate('spline', order=2))
# ValueError: Index column must be numeric or datetime type when using spline method other than linear.
# Try setting a numeric or datetime index column before interpolating.

Others

There are other interpolation methods that can be specified for the method argument: 'nearest', 'zero', 'slinear', 'quadratic', 'cubic', 'barycentric', 'krogh', 'polynomial', 'piecewise_polynomial', 'from_derivatives', 'pchip', 'akima'.

As mentioned in the official documentation, these are wrappers for SciPy functions, including the spline interpolation ('spline') mentioned above.

In all these methods, the index must be numeric, just like in spline interpolation.

For time-series data

method='time' is provided for time-series data. In the case of method='time', linear interpolation is done according to the date and time of the index column.

df_nan = pd.DataFrame({'value': [1, np.nan, np.nan, np.nan, 31]},
                      index=pd.to_datetime(['2018-01-01', '2018-01-02', '2018-01-15', '2018-01-20', '2018-01-31']))

print(df_nan)
#             value
# 2018-01-01    1.0
# 2018-01-02    NaN
# 2018-01-15    NaN
# 2018-01-20    NaN
# 2018-01-31   31.0

print(df_nan.interpolate())
#             value
# 2018-01-01    1.0
# 2018-01-02    8.5
# 2018-01-15   16.0
# 2018-01-20   23.5
# 2018-01-31   31.0

print(df_nan.interpolate('time'))
#             value
# 2018-01-01    1.0
# 2018-01-02    2.0
# 2018-01-15   15.0
# 2018-01-20   20.0
# 2018-01-31   31.0

In the case that the data type (dtype) is object

For example, the data type (dtype) of a column containing a string element is object.

s_object = pd.Series(['A', np.nan, 'C'])
print(s_object)
# 0      A
# 1    NaN
# 2      C
# dtype: object

The object column is not interpolated by method='linear' (default) or other methods. It can be filled by methods such as ffill, pad, bfill, and backfill, which use the existing values.

print(s_object.interpolate())
# 0      A
# 1    NaN
# 2      C
# dtype: object

print(s_object.interpolate('ffill'))
# 0    A
# 1    A
# 2    C
# dtype: object

The same is true if the element is a number but the data type is object.

s_object_num = pd.Series([0, np.nan, 2], dtype=object)
print(s_object_num)
# 0      0
# 1    NaN
# 2      2
# dtype: object

print(s_object_num.interpolate())
# 0      0
# 1    NaN
# 2      2
# dtype: object

print(s_object_num.interpolate('ffill'))
# 0    0
# 1    0
# 2    2
# dtype: int64

If you convert it to float with astype(), you can interpolate. Note that it cannot be converted to int if it contains NaN.

print(s_object_num.astype(float).interpolate())
# 0    0.0
# 1    1.0
# 2    2.0
# dtype: float64

# print(s_object_num.astype(int))
# ValueError: cannot convert float NaN to integer

Related Categories

Related Articles