Missing values in pandas (nan, None, pd.NA)

Modified: 2023-08-02 | Tags: Python, pandas

In pandas, a missing value (NA: not available) is mainly represented by nan (not a number). None is also considered a missing value.

Working with missing data — pandas 2.0.3 documentation

Contents

Missing values caused by reading files, etc.
nan (not a number) is considered a missing value
None is also considered a missing value
String is not considered a missing value
Infinity inf is not considered a missing value by default
pd.NA is the experimental value (as of 2.0.3)

The sample code in this article uses pandas version 2.0.3. NumPy and math are also imported.

import math

import numpy as np
import pandas as pd

print(pd.__version__)
# 2.0.3

source: pandas_nan_none_na.py

Missing values caused by reading files, etc.

Reading a CSV file with missing values generates nan. When printed with print(), this missing value is represented as NaN.

sample_pandas_normal_nan.csv

df = pd.read_csv('data/src/sample_pandas_normal_nan.csv')[:3]
print(df)
#       name   age state  point  other
# 0    Alice  24.0    NY    NaN    NaN
# 1      NaN   NaN   NaN    NaN    NaN
# 2  Charlie   NaN    CA    NaN    NaN

source: pandas_nan_none_na.py

You can use methods like isnull(), dropna(), and fillna() to detect, remove, and replace missing values.

print(df.isnull())
#     name    age  state  point  other
# 0  False  False  False   True   True
# 1   True   True   True   True   True
# 2  False   True  False   True   True

print(df.dropna(how='all'))
#       name   age state  point  other
# 0    Alice  24.0    NY    NaN    NaN
# 2  Charlie   NaN    CA    NaN    NaN

print(df.fillna(0))
#       name   age state  point  other
# 0    Alice  24.0    NY    0.0    0.0
# 1        0   0.0     0    0.0    0.0
# 2  Charlie   0.0    CA    0.0    0.0

source: pandas_nan_none_na.py

nan in a column with object is a Python built-in float type, and nan in a column with floatXX is a NumPy numpy.floatXX type. Both are treated as missing values.

print(df.dtypes)
# name      object
# age      float64
# state     object
# point    float64
# other    float64
# dtype: object

print(df.at[1, 'name'])
# nan

print(type(df.at[1, 'name']))
# <class 'float'>

print(df.at[1, 'age'])
# nan

print(type(df.at[1, 'age']))
# <class 'numpy.float64'>

source: pandas_nan_none_na.py

In addition to reading a file, nan is used to represent a missing value when an element does not exist in the result of methods like reindex(), merge(), and others.

`nan` (not a number) is considered a missing value

In Python, you can create nan with float('nan'), math.nan, or np.nan. nan is considered a missing value in pandas.

nan (not a number) in Python

s_nan = pd.Series([float('nan'), math.nan, np.nan])
print(s_nan)
# 0   NaN
# 1   NaN
# 2   NaN
# dtype: float64

print(s_nan.isnull())
# 0    True
# 1    True
# 2    True
# dtype: bool

source: pandas_nan_none_na.py

`None` is also considered a missing value

In pandas, None is also treated as a missing value. None is a built-in constant in Python.

None in Python

print(None)
# None

print(type(None))
# <class 'NoneType'>

source: pandas_nan_none_na.py

For numeric columns, None is converted to nan when a DataFrame or Series containing None is created, or None is assigned to an element.

s_none_float = pd.Series([None, 0.1, 0.2])
s_none_float[2] = None
print(s_none_float)
# 0    NaN
# 1    0.1
# 2    NaN
# dtype: float64

print(s_none_float.isnull())
# 0     True
# 1    False
# 2     True
# dtype: bool

source: pandas_nan_none_na.py

Since nan is a floating-point number float, if None is converted to nan, the data type dtype of the column is changed to float, even if the other values are integers int.

s_none_int = pd.Series([None, 1, 2])
print(s_none_int)
# 0    NaN
# 1    1.0
# 2    2.0
# dtype: float64

source: pandas_nan_none_na.py

Although None in the object column remains as None, it is detected as a missing value by isnull(). Of course, it is also handled by methods such as dropna() and fillna().

s_none_object = pd.Series([None, 'abc', 'xyz'])
print(s_none_object)
# 0    None
# 1     abc
# 2     xyz
# dtype: object

print(s_none_object.isnull())
# 0     True
# 1    False
# 2    False
# dtype: bool

print(s_none_object.fillna(0))
# 0      0
# 1    abc
# 2    xyz
# dtype: object

source: pandas_nan_none_na.py

String is not considered a missing value

Though indistinguishable on display, the strings 'NaN' and 'None' are not treated as missing values. The empty string '' is also not considered a missing value.

s_str = pd.Series(['NaN', 'None', ''])
print(s_str)
# 0     NaN
# 1    None
# 2        
# dtype: object

print(s_str.isnull())
# 0    False
# 1    False
# 2    False
# dtype: bool

source: pandas_nan_none_na.py

If you want to treat certain values as missing, you can use the replace() method to replace them with float('nan'), np.nan, or math.nan.

pandas: Replace values in DataFrame and Series with replace()

s_replace = s_str.replace(['NaN', 'None', ''], float('nan'))
print(s_replace)
# 0   NaN
# 1   NaN
# 2   NaN
# dtype: float64

print(s_replace.isnull())
# 0    True
# 1    True
# 2    True
# dtype: bool

source: pandas_nan_none_na.py

Note that functions to read files such as read_csv() consider '', 'NaN', 'null', etc., as missing values by default and replace them with nan.

pandas: Read CSV into DataFrame with read_csv()

Infinity `inf` is not considered a missing value by default

In Python, inf represents infinity in floating-point numbers (float).

Infinity (inf) in Python

Infinity inf is not considered a missing value by default.

s_inf = pd.Series([float('inf'), -float('inf')])
print(s_inf)
# 0    inf
# 1   -inf
# dtype: float64

print(s_inf.isnull())
# 0    False
# 1    False
# dtype: bool

source: pandas_nan_none_na.py

If pd.options.mode.use_inf_as_na is set to True, inf in DataFrame and Series is converted to nan and treated as a missing value. Unlike None, inf in the object column is also converted to nan.

pd.options.mode.use_inf_as_na = True

print(s_inf)
# 0   NaN
# 1   NaN
# dtype: float64

print(s_inf.isnull())
# 0    True
# 1    True
# dtype: bool

s_inf_object = pd.Series([float('inf'), -float('inf'), 'abc'])
print(s_inf_object)
# 0    NaN
# 1    NaN
# 2    abc
# dtype: object

print(s_inf_object.isnull())
# 0     True
# 1     True
# 2    False
# dtype: bool

source: pandas_nan_none_na.py

See the following article on how to set options in pandas.

pandas: Get and set options for display, data behavior, etc.

`pd.NA` is the experimental value (as of 2.0.3)

pd.NA was introduced as an experimental NA scalar in pandas 1.0.0.

What’s new in 1.0.0 (January 29, 2020) - Experimental NA scalar to denote missing values — pandas 2.0.3 documentation

print(pd.NA)
# <NA>

print(type(pd.NA))
# <class 'pandas._libs.missing.NAType'>

source: pandas_nan_none_na.py

While nan == nan is False, pd.NA == pd.NA is pd.NA as in the R language.

print(float('nan') == float('nan'))
# False

print(pd.NA == pd.NA)
# <NA>

source: pandas_nan_none_na.py

Of course, pd.NA is treated as a missing value.

s_na = pd.Series([None, 1, 2], dtype='Int64')
print(s_na)
# 0    <NA>
# 1       1
# 2       2
# dtype: Int64

print(s_na.isnull())
# 0     True
# 1    False
# 2    False
# dtype: bool

print(s_na.fillna(0))
# 0    0
# 1    1
# 2    2
# dtype: Int64

source: pandas_nan_none_na.py

See the following document for Int64 in the sample code above. Even if it contains missing values, other integer values are not converted to floating point numbers.

Nullable integer data type — pandas 2.0.3 documentation

Note that as of 2.0.3 (June 2023), it is still "Experimental", and its behavior may change.

Warning
Experimental: the behaviour of pd.NA can still change without warning. Working with missing data - Experimental NA scalar to denote missing values — pandas 2.0.3 documentation

Missing values in pandas (nan, None, pd.NA)

Missing values caused by reading files, etc.

`nan` (not a number) is considered a missing value

`None` is also considered a missing value

String is not considered a missing value

Infinity `inf` is not considered a missing value by default

`pd.NA` is the experimental value (as of 2.0.3)

Related Categories

Related Articles

Missing values in pandas (nan, None, pd.NA)

Missing values caused by reading files, etc.

nan (not a number) is considered a missing value

None is also considered a missing value

String is not considered a missing value

Infinity inf is not considered a missing value by default

pd.NA is the experimental value (as of 2.0.3)

Related Categories

Related Articles

`nan` (not a number) is considered a missing value

`None` is also considered a missing value

Infinity `inf` is not considered a missing value by default

`pd.NA` is the experimental value (as of 2.0.3)