Missing values in pandas (nan, None, pd.NA)
In pandas, a missing value (NA: not available) is mainly represented by nan (not a number). None is also considered a missing value.
The sample code in this article uses pandas version 2.0.3. NumPy and math are also imported.
import math
import numpy as np
import pandas as pd
print(pd.__version__)
# 2.0.3
Missing values caused by reading files, etc.
Reading a CSV file with missing values generates nan. When printed with print(), this missing value is represented as NaN.
df = pd.read_csv('data/src/sample_pandas_normal_nan.csv')[:3]
print(df)
# name age state point other
# 0 Alice 24.0 NY NaN NaN
# 1 NaN NaN NaN NaN NaN
# 2 Charlie NaN CA NaN NaN
You can use methods like isnull(), dropna(), and fillna() to detect, remove, and replace missing values.
- pandas: Detect and count NaN (missing values) with isnull(), isna()
- pandas: Remove NaN (missing values) with dropna()
- pandas: Replace NaN (missing values) with fillna()
print(df.isnull())
# name age state point other
# 0 False False False True True
# 1 True True True True True
# 2 False True False True True
print(df.dropna(how='all'))
# name age state point other
# 0 Alice 24.0 NY NaN NaN
# 2 Charlie NaN CA NaN NaN
print(df.fillna(0))
# name age state point other
# 0 Alice 24.0 NY 0.0 0.0
# 1 0 0.0 0 0.0 0.0
# 2 Charlie 0.0 CA 0.0 0.0
nan in a column with object is a Python built-in float type, and nan in a column with floatXX is a NumPy numpy.floatXX type. Both are treated as missing values.
print(df.dtypes)
# name object
# age float64
# state object
# point float64
# other float64
# dtype: object
print(df.at[1, 'name'])
# nan
print(type(df.at[1, 'name']))
# <class 'float'>
print(df.at[1, 'age'])
# nan
print(type(df.at[1, 'age']))
# <class 'numpy.float64'>
In addition to reading a file, nan is used to represent a missing value when an element does not exist in the result of methods like reindex(), merge(), and others.
- pandas: Reorder rows and columns in DataFrame with reindex()
- pandas: Merge DataFrame with merge(), join() (INNER, OUTER JOIN)
nan (not a number) is considered a missing value
In Python, you can create nan with float('nan'), math.nan, or np.nan. nan is considered a missing value in pandas.
s_nan = pd.Series([float('nan'), math.nan, np.nan])
print(s_nan)
# 0 NaN
# 1 NaN
# 2 NaN
# dtype: float64
print(s_nan.isnull())
# 0 True
# 1 True
# 2 True
# dtype: bool
None is also considered a missing value
In pandas, None is also treated as a missing value. None is a built-in constant in Python.
print(None)
# None
print(type(None))
# <class 'NoneType'>
For numeric columns, None is converted to nan when a DataFrame or Series containing None is created, or None is assigned to an element.
s_none_float = pd.Series([None, 0.1, 0.2])
s_none_float[2] = None
print(s_none_float)
# 0 NaN
# 1 0.1
# 2 NaN
# dtype: float64
print(s_none_float.isnull())
# 0 True
# 1 False
# 2 True
# dtype: bool
Since nan is a floating-point number float, if None is converted to nan, the data type dtype of the column is changed to float, even if the other values are integers int.
s_none_int = pd.Series([None, 1, 2])
print(s_none_int)
# 0 NaN
# 1 1.0
# 2 2.0
# dtype: float64
Although None in the object column remains as None, it is detected as a missing value by isnull(). Of course, it is also handled by methods such as dropna() and fillna().
s_none_object = pd.Series([None, 'abc', 'xyz'])
print(s_none_object)
# 0 None
# 1 abc
# 2 xyz
# dtype: object
print(s_none_object.isnull())
# 0 True
# 1 False
# 2 False
# dtype: bool
print(s_none_object.fillna(0))
# 0 0
# 1 abc
# 2 xyz
# dtype: object
String is not considered a missing value
Though indistinguishable on display, the strings 'NaN' and 'None' are not treated as missing values. The empty string '' is also not considered a missing value.
s_str = pd.Series(['NaN', 'None', ''])
print(s_str)
# 0 NaN
# 1 None
# 2
# dtype: object
print(s_str.isnull())
# 0 False
# 1 False
# 2 False
# dtype: bool
If you want to treat certain values as missing, you can use the replace() method to replace them with float('nan'), np.nan, or math.nan.
s_replace = s_str.replace(['NaN', 'None', ''], float('nan'))
print(s_replace)
# 0 NaN
# 1 NaN
# 2 NaN
# dtype: float64
print(s_replace.isnull())
# 0 True
# 1 True
# 2 True
# dtype: bool
Note that functions to read files such as read_csv() consider '', 'NaN', 'null', etc., as missing values by default and replace them with nan.
Infinity inf is not considered a missing value by default
In Python, inf represents infinity in floating-point numbers (float).
Infinity inf is not considered a missing value by default.
s_inf = pd.Series([float('inf'), -float('inf')])
print(s_inf)
# 0 inf
# 1 -inf
# dtype: float64
print(s_inf.isnull())
# 0 False
# 1 False
# dtype: bool
If pd.options.mode.use_inf_as_na is set to True, inf in DataFrame and Series is converted to nan and treated as a missing value. Unlike None, inf in the object column is also converted to nan.
pd.options.mode.use_inf_as_na = True
print(s_inf)
# 0 NaN
# 1 NaN
# dtype: float64
print(s_inf.isnull())
# 0 True
# 1 True
# dtype: bool
s_inf_object = pd.Series([float('inf'), -float('inf'), 'abc'])
print(s_inf_object)
# 0 NaN
# 1 NaN
# 2 abc
# dtype: object
print(s_inf_object.isnull())
# 0 True
# 1 True
# 2 False
# dtype: bool
See the following article on how to set options in pandas.
pd.NA is the experimental value (as of 2.0.3)
pd.NA was introduced as an experimental NA scalar in pandas 1.0.0.
print(pd.NA)
# <NA>
print(type(pd.NA))
# <class 'pandas._libs.missing.NAType'>
While nan == nan is False, pd.NA == pd.NA is pd.NA as in the R language.
print(float('nan') == float('nan'))
# False
print(pd.NA == pd.NA)
# <NA>
Of course, pd.NA is treated as a missing value.
s_na = pd.Series([None, 1, 2], dtype='Int64')
print(s_na)
# 0 <NA>
# 1 1
# 2 2
# dtype: Int64
print(s_na.isnull())
# 0 True
# 1 False
# 2 False
# dtype: bool
print(s_na.fillna(0))
# 0 0
# 1 1
# 2 2
# dtype: Int64
See the following document for Int64 in the sample code above. Even if it contains missing values, other integer values are not converted to floating point numbers.
Note that as of 2.0.3 (June 2023), it is still "Experimental", and its behavior may change.
Warning
Experimental: the behaviour of pd.NA can still change without warning. Working with missing data - Experimental NA scalar to denote missing values — pandas 2.0.3 documentation