pandas: Replace values based on conditions with where(), mask()
This article explains how to replace values based on conditions in pandas. You can perform conditional operations like if then ...
or if then ... else ...
on DataFrame
or Series
.
Use the where()
method to replace values where the condition is False
, and the mask()
method where it is True
. For replacing both True
and False
, use NumPy'snp.where()
function.
Additionally, you can use Boolean indexing with loc
or iloc
to assign values based on conditions.
For information on replacing specific values or replacing and deleting missing values NaN
, see the following articles.
- pandas: Replace values in DataFrame and Series with replace()
- pandas: Replace NaN (missing values) with fillna()
- pandas: Remove NaN (missing values) with dropna()
The pandas and NumPy versions used in this article are as follows. Note that functionality may vary between versions.
import pandas as pd
import numpy as np
print(pd.__version__)
# 2.1.4
print(np.__version__)
# 1.26.2
where()
replaces False
, keeps True
unchanged
The where()
method is provided for both DataFrame
and Series
.
- pandas.DataFrame.where — pandas 2.1.4 documentation
- pandas.Series.where — pandas 2.1.4 documentation
where()
in Series
Consider the following Series
.
s = pd.Series([-2, -1, 0, 1, 2])
print(s)
# 0 -2
# 1 -1
# 2 0
# 3 1
# 4 2
# dtype: int64
First argument, cond
When you specify a bool
type Series
as the first argument, cond
, in where()
, True
values remain unchanged, while False
values become NaN
.
For example, a comparison operation on a Series
returns a bool
type Series
, which can be specified as the first argument.
print(s < 0)
# 0 True
# 1 True
# 2 False
# 3 False
# 4 False
# dtype: bool
print(s.where(s < 0))
# 0 -2.0
# 1 -1.0
# 2 NaN
# 3 NaN
# 4 NaN
# dtype: float64
You can specify not only comparison operations but also conditions on strings. See the examples with mask()
in the next section.
Array-like objects such as a list or NumPy array (ndarray
) containing bool
values can also be specified.
print(s.where([False, True, False, True, False]))
# 0 NaN
# 1 -1.0
# 2 NaN
# 3 1.0
# 4 NaN
# dtype: float64
Second argument, other
By specifying a scalar as the second argument, other
, False
values are replaced with it.
print(s.where(s < 0, 10))
# 0 -2
# 1 -1
# 2 10
# 3 10
# 4 10
# dtype: int64
Array-like objects like a list or NumPy array (ndarray
) can also be specified. False
values are replaced with the corresponding elements in these objects.
print(s.where(s < 0, [0, 10, 20, 30, 40]))
# 0 -2
# 1 -1
# 2 20
# 3 30
# 4 40
# dtype: int64
A Series
can also be specified. The replacement occurs for values matching in index (label), not position.
s2 = pd.Series([0, 10, 20, 30, 40], index=[4, 3, 2, 1, 0])
print(s2)
# 4 0
# 3 10
# 2 20
# 1 30
# 0 40
# dtype: int64
print(s.where(s < 0, s2))
# 0 -2
# 1 -1
# 2 20
# 3 10
# 4 0
# dtype: int64
It is also possible to apply a process to the original Series
for replacement.
print(s * 100 + 10)
# 0 -190
# 1 -90
# 2 10
# 3 110
# 4 210
# dtype: int64
print(s.where(s < 0, s * 100 + 10))
# 0 -2
# 1 -1
# 2 10
# 3 110
# 4 210
# dtype: int64
inplace
By default, a new object is returned without changing the original object. If the argument inplace
is set to True
, the original object is updated.
s.where(s < 0, 10, inplace=True)
print(s)
# 0 -2
# 1 -1
# 2 10
# 3 10
# 4 10
# dtype: int64
where()
in DataFrame
Consider the following DataFrame
.
df = pd.DataFrame({'A': [-2, -1, 0, 1, 2], 'B': [0, 10, 20, 30, 40]})
print(df)
# A B
# 0 -2 0
# 1 -1 10
# 2 0 20
# 3 1 30
# 4 2 40
Basic usage
The basic usage of where()
in DataFrame
is the same as in Series
.
When you specify a bool
type DataFrame
as the first argument, cond
, in where()
, True
values remain unchanged, while False
values become NaN
.
print((df < 0) | (df > 20))
# A B
# 0 True False
# 1 True False
# 2 False False
# 3 False True
# 4 False True
print(df.where((df < 0) | (df > 20)))
# A B
# 0 -2.0 NaN
# 1 -1.0 NaN
# 2 NaN NaN
# 3 NaN 30.0
# 4 NaN 40.0
In the above example, multiple conditions are combined using |
(OR). Note that using and
, or
instead of &
, |
, or omitting parentheses will result in an error.
By specifying a scalar value or DataFrame
as the second argument, other
, False
values are replaced with them.
print(df.where((df < 0) | (df > 20), 100))
# A B
# 0 -2 100
# 1 -1 100
# 2 100 100
# 3 100 30
# 4 100 40
print(df * 100 + 10)
# A B
# 0 -190 10
# 1 -90 1010
# 2 10 2010
# 3 110 3010
# 4 210 4010
print(df.where((df < 0) | (df > 20), df * 100 + 10))
# A B
# 0 -2 10
# 1 -1 1010
# 2 10 2010
# 3 110 30
# 4 210 40
Although examples are not provided here, it is possible to specify two-dimensional arrays for the first and second arguments, just as in where()
for Series
. Additionally, the inplace
argument can be used.
Process specific columns in DataFrame
For example, performing a comparison operation on a DataFrame
with a mix of numeric and string columns will result in an error.
df['C'] = ['A', 'B', 'C', 'D', 'E']
print(df)
# A B C
# 0 -2 0 A
# 1 -1 10 B
# 2 0 20 C
# 3 1 30 D
# 4 2 40 E
# print(df < 0)
# TypeError: '<' not supported between instances of 'str' and 'int'
Since DataFrame
columns are Series
, they can be processed individually. It is also possible to add them as new columns.
- pandas: Select rows/columns by index (numbers and names)
- pandas: Add rows/columns to DataFrame with assign(), insert()
print(df['C'].where(df['A'] < 0, 'X'))
# 0 A
# 1 B
# 2 X
# 3 X
# 4 X
# Name: C, dtype: object
df['D'] = df['C'].where(df['A'] < 0, 'X')
print(df)
# A B C D
# 0 -2 0 A A
# 1 -1 10 B B
# 2 0 20 C X
# 3 1 30 D X
# 4 2 40 E X
You can also extract and process only numeric columns and then concatenate them with non-numeric columns. Use the select_dtypes()
method and the pd.concat()
function.
- pandas: Select columns by dtype with select_dtypes()
- pandas: Sort DataFrame, Series with sort_values(), sort_index()
df_num = df.select_dtypes('number')
print(df_num.where(df_num > 0, -10))
# A B
# 0 -10 -10
# 1 -10 10
# 2 -10 20
# 3 1 30
# 4 2 40
print(df.select_dtypes(exclude='number'))
# C D
# 0 A A
# 1 B B
# 2 C X
# 3 D X
# 4 E X
print(pd.concat([df_num.where(df_num > 0, -10), df.select_dtypes(exclude='number')],
axis=1))
# A B C D
# 0 -10 -10 A A
# 1 -10 10 B B
# 2 -10 20 C X
# 3 1 30 D X
# 4 2 40 E X
mask()
replaces True
, keeps False
unchanged
The mask()
method is provided for both DataFrame
and Series
.
The mask()
method works inversely compared to where()
: it keeps values unchanged where the condition in the first argument is False
and replaces True
values with NaN
or a value specified in the second argument.
The usage of arguments is the same as in where()
. See the where()
section above for details.
Below are some examples.
s = pd.Series(['Alice', 'Bob', 'Charlie', 'Dave', 'Ellen'])
print(s)
# 0 Alice
# 1 Bob
# 2 Charlie
# 3 Dave
# 4 Ellen
# dtype: object
print(s.mask(s.str.endswith('e')))
# 0 NaN
# 1 Bob
# 2 NaN
# 3 NaN
# 4 Ellen
# dtype: object
print(s.mask(s.str.endswith('e'), 'X'))
# 0 X
# 1 Bob
# 2 X
# 3 X
# 4 Ellen
# dtype: object
print(s.mask(s.str.endswith('e'), s.str.upper()))
# 0 ALICE
# 1 Bob
# 2 CHARLIE
# 3 DAVE
# 4 Ellen
# dtype: object
df = pd.DataFrame({'A': [-2, -1, 0, 1, 2], 'B': [0, 10, 20, 30, 40]})
print(df)
# A B
# 0 -2 0
# 1 -1 10
# 2 0 20
# 3 1 30
# 4 2 40
print(df.mask((df < 0) | (df > 20)))
# A B
# 0 NaN 0.0
# 1 NaN 10.0
# 2 0.0 20.0
# 3 1.0 NaN
# 4 2.0 NaN
print(df.mask((df < 0) | (df > 20), 100))
# A B
# 0 100 0
# 1 100 10
# 2 0 20
# 3 1 100
# 4 2 100
print(df.mask((df < 0) | (df > 20), df * 100 + 10))
# A B
# 0 -190 0
# 1 -90 10
# 2 0 20
# 3 1 3010
# 4 2 4010
In the above examples for Series
, conditions check if strings end with certain characters. Exact matching and regular expression-based matching are also possible. See the following articles for details.
np.where()
replace both True
and False
The np.where()
function from NumPy can be used to replace values in DataFrame
or Series
according to conditions.
The where()
and mask()
methods in pandas allow for replacing either True
or False
values, but not both simultaneously. Untouched values retain their original state.
In contrast, NumPy's np.where()
function enables simultaneous replacement of both True
and False
values. Specify the condition as the first argument, the replacement for True
values as the second, and the replacement for False
values as the third. For these second and third arguments, scalars, arrays, Series
, or DataFrame
can be used.
s = pd.Series([-2, -1, 0, 1, 2])
print(s)
# 0 -2
# 1 -1
# 2 0
# 3 1
# 4 2
# dtype: int64
print(np.where(s < 0, -100, 1))
# [-100 -100 1 1 1]
print(np.where(s < 0, s * 10, s * 100 + 10))
# [-20 -10 10 110 210]
print(type(np.where(s < 0, -100, 1)))
# <class 'numpy.ndarray'>
df = pd.DataFrame({'A': [-2, -1, 0, 1, 2], 'B': [0, 10, 20, 30, 40]})
print(df)
# A B
# 0 -2 0
# 1 -1 10
# 2 0 20
# 3 1 30
# 4 2 40
print(np.where((df < 0) | (df > 20), -100, 1))
# [[-100 1]
# [-100 1]
# [ 1 1]
# [ 1 -100]
# [ 1 -100]]
print(np.where((df < 0) | (df > 20), df * 10, df * 100 + 10))
# [[ -20 10]
# [ -10 1010]
# [ 10 2010]
# [ 110 300]
# [ 210 400]]
print(type(np.where((df < 0) | (df > 20), -100, 1)))
# <class 'numpy.ndarray'>
np.where()
returns a NumPy array (ndarray
). Using the index
and columns
attributes of the original DataFrame
or Series
, you can generate a new DataFrame
or Series
.
print(pd.Series(np.where(s < 0, -100, 1), index=s.index))
# 0 -100
# 1 -100
# 2 1
# 3 1
# 4 1
# dtype: int64
print(pd.DataFrame(np.where((df < 0) | (df > 20), -100, 1),
index=df.index, columns=df.columns))
# A B
# 0 -100 1
# 1 -100 1
# 2 1 1
# 3 1 -100
# 4 1 -100
When adding as a new column, the ndarray
can be specified directly on the right-hand side.
df['C'] = np.where(df['A'] < 0, -100, 1)
print(df)
# A B C
# 0 -2 0 -100
# 1 -1 10 -100
# 2 0 20 1
# 3 1 30 1
# 4 2 40 1
Assign values using Boolean indexing with loc
, iloc
Specifying a bool
type Series
or array in loc
or iloc
allows for extracting values at positions where True
.
df = pd.DataFrame({'A': [-2, -1, 0, 1, 2], 'B': [0, 10, 20, 30, 40]})
print(df)
# A B
# 0 -2 0
# 1 -1 10
# 2 0 20
# 3 1 30
# 4 2 40
print(df.loc[df['A'] < 0, 'A'])
# 0 -2
# 1 -1
# Name: A, dtype: int64
loc
and iloc
can be used for assigning values. Scalars, Series
, or arrays can be specified on the right-hand side.
df.loc[df['A'] < 0, 'A'] = -10
print(df)
# A B
# 0 -10 0
# 1 -10 10
# 2 0 20
# 3 1 30
# 4 2 40
df.loc[df['A'] >= 0, 'A'] = df['B'] * 10
print(df)
# A B
# 0 -10 0
# 1 -10 10
# 2 200 20
# 3 300 30
# 4 400 40
Specifying a new column name adds a new column. Values that do not meet the condition become missing values NaN
. Note that the data type (dtype
) of a column containing NaN
becomes float
.
df.loc[df['A'] < 0, 'C'] = -100
print(df)
# A B C
# 0 -10 0 -100.0
# 1 -10 10 -100.0
# 2 200 20 NaN
# 3 300 30 NaN
# 4 400 40 NaN