pandas: Interpolate NaN (missing values) with interpolate()
You can interpolate missing values (NaN
) in pandas.DataFrame
and pandas.Series
with the interpolate()
method.
- pandas.DataFrame.interpolate — pandas 2.0.3 documentation
- pandas.Series.interpolate — pandas 2.0.3 documentation
See the following article on removing, replacing, and counting missing values.
- pandas: Remove NaN (missing values) with dropna()
- pandas: Replace NaN (missing values) with fillna()
- pandas: Detect and count NaN (missing values) with isnull(), isna()
The sample code in this article uses pandas version 2.0.3
. NumPy is also imported.
import pandas as pd
import numpy as np
print(pd.__version__)
# 2.0.3
Basic usage of interpolate()
The following pandas.DataFrame
is used as an example.
df = pd.DataFrame({'col1': [0, np.nan, np.nan, 3, 4],
'col2': [np.nan, 1, 2, np.nan, np.nan],
'col3': [4, np.nan, np.nan, 7, 10]})
print(df)
# col1 col2 col3
# 0 0.0 NaN 4.0
# 1 NaN 1.0 NaN
# 2 NaN 2.0 NaN
# 3 3.0 NaN 7.0
# 4 4.0 NaN 10.0
By default, linear interpolation is performed on each column. The same value is repeated for NaN
at the bottom, while NaN
at the top remains unchanged.
print(df.interpolate())
# col1 col2 col3
# 0 0.0 NaN 4.0
# 1 1.0 1.0 5.0
# 2 2.0 2.0 6.0
# 3 3.0 2.0 7.0
# 4 4.0 2.0 10.0
Row or column: axis
If axis=1
, interpolation is performed across each row. The same value is repeated for the rightmost NaN
, while the leftmost NaN
remains unchanged.
print(df.interpolate(axis=1))
# col1 col2 col3
# 0 0.0 2.0 4.0
# 1 NaN 1.0 1.0
# 2 NaN 2.0 2.0
# 3 3.0 5.0 7.0
# 4 4.0 7.0 10.0
Maximum number of consecutive NaN
to fill: limit
You can use the limit
argument to specify the maximum number of consecutive NaN
values to interpolate. By default, this is set to None
, meaning all consecutive NaN
values will be interpolated.
print(df.interpolate(limit=1))
# col1 col2 col3
# 0 0.0 NaN 4.0
# 1 1.0 1.0 5.0
# 2 NaN 2.0 NaN
# 3 3.0 2.0 7.0
# 4 4.0 NaN 10.0
Direction to interpolate: limit_direction
You can specify the direction of interpolation with the limit_direction
argument, which can be set to 'forward'
, 'backward'
, or 'both'
.
print(df.interpolate(limit=1, limit_direction='forward'))
# col1 col2 col3
# 0 0.0 NaN 4.0
# 1 1.0 1.0 5.0
# 2 NaN 2.0 NaN
# 3 3.0 2.0 7.0
# 4 4.0 NaN 10.0
print(df.interpolate(limit=1, limit_direction='backward'))
# col1 col2 col3
# 0 0.0 1.0 4.0
# 1 NaN 1.0 NaN
# 2 2.0 2.0 6.0
# 3 3.0 NaN 7.0
# 4 4.0 NaN 10.0
print(df.interpolate(limit=1, limit_direction='both'))
# col1 col2 col3
# 0 0.0 1.0 4.0
# 1 1.0 1.0 5.0
# 2 2.0 2.0 6.0
# 3 3.0 2.0 7.0
# 4 4.0 NaN 10.0
As mentioned above, by default, the top (or leftmost) NaN
values remain unchanged. However, setting limit_direction='both'
allows interpolation at both ends.
print(df.interpolate(limit_direction='both'))
# col1 col2 col3
# 0 0.0 1.0 4.0
# 1 1.0 1.0 5.0
# 2 2.0 2.0 6.0
# 3 3.0 2.0 7.0
# 4 4.0 2.0 10.0
Interpolate or extrapolate or both: limit_area
You can specify the area to be interpolated with the limit_area
argument.
'inside'
: Only interpolation'outside'
: Only extrapolationNone
(default): Both interpolation and extrapolation
print(df.interpolate(limit_area='inside'))
# col1 col2 col3
# 0 0.0 NaN 4.0
# 1 1.0 1.0 5.0
# 2 2.0 2.0 6.0
# 3 3.0 NaN 7.0
# 4 4.0 NaN 10.0
print(df.interpolate(limit_area='outside'))
# col1 col2 col3
# 0 0.0 NaN 4.0
# 1 NaN 1.0 NaN
# 2 NaN 2.0 NaN
# 3 3.0 2.0 7.0
# 4 4.0 2.0 10.0
print(df.interpolate(limit_area='outside', limit_direction='both'))
# col1 col2 col3
# 0 0.0 1.0 4.0
# 1 NaN 1.0 NaN
# 2 NaN 2.0 NaN
# 3 3.0 2.0 7.0
# 4 4.0 2.0 10.0
Note that "extrapolation" is used here for convenience. For linear interpolation (default), outer values are merely repetitions of the end values, not truly extrapolated. However, in spline interpolation as discussed below, the outer values are genuinely extrapolated.
Operate inplace: inplace
Like many other methods, you can directly modify the original object by setting inplace=True
.
df.interpolate(inplace=True)
print(df)
# col1 col2 col3
# 0 0.0 NaN 4.0
# 1 1.0 1.0 5.0
# 2 2.0 2.0 6.0
# 3 3.0 2.0 7.0
# 4 4.0 2.0 10.0
Interpolation method: method
The interpolation method is specified by the first argument method
. The default value is 'linear'
(linear interpolation).
Linear interpolation: linear
, index
, values
With method='linear'
(default), the index is ignored, but with method='index'
or method='values'
, it is interpolated using the index value.
s = pd.Series([0, np.nan, np.nan, 3],
index=[0, 4, 6, 8])
print(s)
# 0 0.0
# 4 NaN
# 6 NaN
# 8 3.0
# dtype: float64
print(s.interpolate())
# 0 0.0
# 4 1.0
# 6 2.0
# 8 3.0
# dtype: float64
print(s.interpolate('index'))
# 0 0.00
# 4 1.50
# 6 2.25
# 8 3.00
# dtype: float64
If the index column contains strings, method='linear'
(default) works fine, but if method
is set to 'index'
or 'values'
, an error is raised.
s.index = list('abcd')
print(s)
# a 0.0
# b NaN
# c NaN
# d 3.0
# dtype: float64
print(s.interpolate())
# a 0.0
# b 1.0
# c 2.0
# d 3.0
# dtype: float64
# print(s.interpolate('index'))
# TypeError: Cannot cast array data from dtype('O') to dtype('float64') according to the rule 'safe'
Using existing values: ffill
, pad
, bfill
, backfill
NaN
values are filled with the previous existing value if method='ffill'
or method='pad'
, or with the next existing value if method='bfill'
or method='backfill'
.
s = pd.Series([np.nan, 1, np.nan, 2, np.nan])
print(s)
# 0 NaN
# 1 1.0
# 2 NaN
# 3 2.0
# 4 NaN
# dtype: float64
print(s.interpolate('ffill'))
# 0 NaN
# 1 1.0
# 2 1.0
# 3 2.0
# 4 2.0
# dtype: float64
print(s.interpolate('bfill'))
# 0 1.0
# 1 1.0
# 2 2.0
# 3 2.0
# 4 NaN
# dtype: float64
The limit_direction
should be set to 'forward'
if method
is set to 'ffill'
or 'pad'
, and to 'backward'
if method
is set to 'bfill'
or 'backfill'
.
# s.interpolate('ffill', limit_direction='both')
# ValueError: `limit_direction` must be 'forward' for method `ffill`
# s.interpolate('bfill', limit_direction='both')
# ValueError: `limit_direction` must be 'backward' for method `bfill`
You can do the same with the fillna()
method with the method
argument.
print(s.fillna(method='ffill'))
# 0 NaN
# 1 1.0
# 2 1.0
# 3 2.0
# 4 2.0
# dtype: float64
print(s.fillna(method='bfill'))
# 0 1.0
# 1 1.0
# 2 2.0
# 3 2.0
# 4 NaN
# dtype: float64
Spline interpolation: spline
If method='spline'
, spline interpolation is done. You must specify the order
argument.
s = pd.Series([0, 10, np.nan, np.nan, 4, np.nan],
index=[0, 2, 5, 6, 8, 12])
print(s)
# 0 0.0
# 2 10.0
# 5 NaN
# 6 NaN
# 8 4.0
# 12 NaN
# dtype: float64
print(s.interpolate('spline', order=2))
# 0 0.00
# 2 10.00
# 5 13.75
# 6 12.00
# 8 4.00
# 12 -30.00
# dtype: float64
Spline interpolation always depends on the index, so any changes to the index will consequently alter the results.
s.index = range(6)
print(s)
# 0 0.0
# 1 10.0
# 2 NaN
# 3 NaN
# 4 4.0
# 5 NaN
# dtype: float64
print(s.interpolate('spline', order=2))
# 0 0.0
# 1 10.0
# 2 14.0
# 3 12.0
# 4 4.0
# 5 -10.0
# dtype: float64
Therefore, spline interpolation requires the index to be numeric. If it is strings, an error is raised.
s.index = list('abcdef')
print(s)
# a 0.0
# b 10.0
# c NaN
# d NaN
# e 4.0
# f NaN
# dtype: float64
# print(s.interpolate('spline', order=2))
# ValueError: Index column must be numeric or datetime type when using spline method other than linear.
# Try setting a numeric or datetime index column before interpolating.
Others
There are other interpolation methods that can be specified for the method
argument: 'nearest'
, 'zero'
, 'slinear'
, 'quadratic'
, 'cubic'
, 'barycentric'
, 'krogh'
, 'polynomial'
, 'piecewise_polynomial'
, 'from_derivatives'
, 'pchip'
, 'akima'
.
As mentioned in the official documentation, these are wrappers for SciPy functions, including the spline interpolation ('spline'
) mentioned above.
- pandas.DataFrame.interpolate — pandas 2.0.3 documentation
- Interpolation (scipy.interpolate) — SciPy v1.11.1 Manual
In all these methods, the index must be numeric, just like in spline interpolation.
For time-series data
method='time'
is provided for time-series data. In the case of method='time'
, linear interpolation is done according to the date and time of the index column.
df_nan = pd.DataFrame({'value': [1, np.nan, np.nan, np.nan, 31]},
index=pd.to_datetime(['2018-01-01', '2018-01-02', '2018-01-15', '2018-01-20', '2018-01-31']))
print(df_nan)
# value
# 2018-01-01 1.0
# 2018-01-02 NaN
# 2018-01-15 NaN
# 2018-01-20 NaN
# 2018-01-31 31.0
print(df_nan.interpolate())
# value
# 2018-01-01 1.0
# 2018-01-02 8.5
# 2018-01-15 16.0
# 2018-01-20 23.5
# 2018-01-31 31.0
print(df_nan.interpolate('time'))
# value
# 2018-01-01 1.0
# 2018-01-02 2.0
# 2018-01-15 15.0
# 2018-01-20 20.0
# 2018-01-31 31.0
In the case that the data type (dtype
) is object
For example, the data type (dtype
) of a column containing a string element is object
.
s_object = pd.Series(['A', np.nan, 'C'])
print(s_object)
# 0 A
# 1 NaN
# 2 C
# dtype: object
The object
column is not interpolated by method='linear'
(default) or other methods. It can be filled by methods such as ffill
, pad
, bfill
, and backfill
, which use the existing values.
print(s_object.interpolate())
# 0 A
# 1 NaN
# 2 C
# dtype: object
print(s_object.interpolate('ffill'))
# 0 A
# 1 A
# 2 C
# dtype: object
The same is true if the element is a number but the data type is object
.
s_object_num = pd.Series([0, np.nan, 2], dtype=object)
print(s_object_num)
# 0 0
# 1 NaN
# 2 2
# dtype: object
print(s_object_num.interpolate())
# 0 0
# 1 NaN
# 2 2
# dtype: object
print(s_object_num.interpolate('ffill'))
# 0 0
# 1 0
# 2 2
# dtype: int64
If you convert it to float
with astype()
, you can interpolate. Note that it cannot be converted to int
if it contains NaN
.
print(s_object_num.astype(float).interpolate())
# 0 0.0
# 1 1.0
# 2 2.0
# dtype: float64
# print(s_object_num.astype(int))
# ValueError: cannot convert float NaN to integer