pandas: Replace values based on conditions with where(), mask()
This article explains how to replace values based on conditions in pandas. You can perform conditional operations like if then ... or if then ... else ... on DataFrame or Series.
Use the where() method to replace values where the condition is False, and the mask() method where it is True. For replacing both True and False, use NumPy'snp.where() function.
Additionally, you can use Boolean indexing with loc or iloc to assign values based on conditions.
For information on replacing specific values or replacing and deleting missing values NaN, see the following articles.
- pandas: Replace values in DataFrame and Series with replace()
- pandas: Replace NaN (missing values) with fillna()
- pandas: Remove NaN (missing values) with dropna()
The pandas and NumPy versions used in this article are as follows. Note that functionality may vary between versions.
import pandas as pd
import numpy as np
print(pd.__version__)
# 2.1.4
print(np.__version__)
# 1.26.2
where() replaces False, keeps True unchanged
The where() method is provided for both DataFrame and Series.
- pandas.DataFrame.where — pandas 2.1.4 documentation
- pandas.Series.where — pandas 2.1.4 documentation
where() in Series
Consider the following Series.
s = pd.Series([-2, -1, 0, 1, 2])
print(s)
# 0 -2
# 1 -1
# 2 0
# 3 1
# 4 2
# dtype: int64
First argument, cond
When you specify a bool type Series as the first argument, cond, in where(), True values remain unchanged, while False values become NaN.
For example, a comparison operation on a Series returns a bool type Series, which can be specified as the first argument.
print(s < 0)
# 0 True
# 1 True
# 2 False
# 3 False
# 4 False
# dtype: bool
print(s.where(s < 0))
# 0 -2.0
# 1 -1.0
# 2 NaN
# 3 NaN
# 4 NaN
# dtype: float64
You can specify not only comparison operations but also conditions on strings. See the examples with mask() in the next section.
Array-like objects such as a list or NumPy array (ndarray) containing bool values can also be specified.
print(s.where([False, True, False, True, False]))
# 0 NaN
# 1 -1.0
# 2 NaN
# 3 1.0
# 4 NaN
# dtype: float64
Second argument, other
By specifying a scalar as the second argument, other, False values are replaced with it.
print(s.where(s < 0, 10))
# 0 -2
# 1 -1
# 2 10
# 3 10
# 4 10
# dtype: int64
Array-like objects like a list or NumPy array (ndarray) can also be specified. False values are replaced with the corresponding elements in these objects.
print(s.where(s < 0, [0, 10, 20, 30, 40]))
# 0 -2
# 1 -1
# 2 20
# 3 30
# 4 40
# dtype: int64
A Series can also be specified. The replacement occurs for values matching in index (label), not position.
s2 = pd.Series([0, 10, 20, 30, 40], index=[4, 3, 2, 1, 0])
print(s2)
# 4 0
# 3 10
# 2 20
# 1 30
# 0 40
# dtype: int64
print(s.where(s < 0, s2))
# 0 -2
# 1 -1
# 2 20
# 3 10
# 4 0
# dtype: int64
It is also possible to apply a process to the original Series for replacement.
print(s * 100 + 10)
# 0 -190
# 1 -90
# 2 10
# 3 110
# 4 210
# dtype: int64
print(s.where(s < 0, s * 100 + 10))
# 0 -2
# 1 -1
# 2 10
# 3 110
# 4 210
# dtype: int64
inplace
By default, a new object is returned without changing the original object. If the argument inplace is set to True, the original object is updated.
s.where(s < 0, 10, inplace=True)
print(s)
# 0 -2
# 1 -1
# 2 10
# 3 10
# 4 10
# dtype: int64
where() in DataFrame
Consider the following DataFrame.
df = pd.DataFrame({'A': [-2, -1, 0, 1, 2], 'B': [0, 10, 20, 30, 40]})
print(df)
# A B
# 0 -2 0
# 1 -1 10
# 2 0 20
# 3 1 30
# 4 2 40
Basic usage
The basic usage of where() in DataFrame is the same as in Series.
When you specify a bool type DataFrame as the first argument, cond, in where(), True values remain unchanged, while False values become NaN.
print((df < 0) | (df > 20))
# A B
# 0 True False
# 1 True False
# 2 False False
# 3 False True
# 4 False True
print(df.where((df < 0) | (df > 20)))
# A B
# 0 -2.0 NaN
# 1 -1.0 NaN
# 2 NaN NaN
# 3 NaN 30.0
# 4 NaN 40.0
In the above example, multiple conditions are combined using | (OR). Note that using and, or instead of &, |, or omitting parentheses will result in an error.
By specifying a scalar value or DataFrame as the second argument, other, False values are replaced with them.
print(df.where((df < 0) | (df > 20), 100))
# A B
# 0 -2 100
# 1 -1 100
# 2 100 100
# 3 100 30
# 4 100 40
print(df * 100 + 10)
# A B
# 0 -190 10
# 1 -90 1010
# 2 10 2010
# 3 110 3010
# 4 210 4010
print(df.where((df < 0) | (df > 20), df * 100 + 10))
# A B
# 0 -2 10
# 1 -1 1010
# 2 10 2010
# 3 110 30
# 4 210 40
Although examples are not provided here, it is possible to specify two-dimensional arrays for the first and second arguments, just as in where() for Series. Additionally, the inplace argument can be used.
Process specific columns in DataFrame
For example, performing a comparison operation on a DataFrame with a mix of numeric and string columns will result in an error.
df['C'] = ['A', 'B', 'C', 'D', 'E']
print(df)
# A B C
# 0 -2 0 A
# 1 -1 10 B
# 2 0 20 C
# 3 1 30 D
# 4 2 40 E
# print(df < 0)
# TypeError: '<' not supported between instances of 'str' and 'int'
Since DataFrame columns are Series, they can be processed individually. It is also possible to add them as new columns.
- pandas: Select rows/columns by index (numbers and names)
- pandas: Add rows/columns to DataFrame with assign(), insert()
print(df['C'].where(df['A'] < 0, 'X'))
# 0 A
# 1 B
# 2 X
# 3 X
# 4 X
# Name: C, dtype: object
df['D'] = df['C'].where(df['A'] < 0, 'X')
print(df)
# A B C D
# 0 -2 0 A A
# 1 -1 10 B B
# 2 0 20 C X
# 3 1 30 D X
# 4 2 40 E X
You can also extract and process only numeric columns and then concatenate them with non-numeric columns. Use the select_dtypes() method and the pd.concat() function.
- pandas: Select columns by dtype with select_dtypes()
- pandas: Sort DataFrame, Series with sort_values(), sort_index()
df_num = df.select_dtypes('number')
print(df_num.where(df_num > 0, -10))
# A B
# 0 -10 -10
# 1 -10 10
# 2 -10 20
# 3 1 30
# 4 2 40
print(df.select_dtypes(exclude='number'))
# C D
# 0 A A
# 1 B B
# 2 C X
# 3 D X
# 4 E X
print(pd.concat([df_num.where(df_num > 0, -10), df.select_dtypes(exclude='number')],
axis=1))
# A B C D
# 0 -10 -10 A A
# 1 -10 10 B B
# 2 -10 20 C X
# 3 1 30 D X
# 4 2 40 E X
mask() replaces True, keeps False unchanged
The mask() method is provided for both DataFrame and Series.
The mask() method works inversely compared to where(): it keeps values unchanged where the condition in the first argument is False and replaces True values with NaN or a value specified in the second argument.
The usage of arguments is the same as in where(). See the where() section above for details.
Below are some examples.
s = pd.Series(['Alice', 'Bob', 'Charlie', 'Dave', 'Ellen'])
print(s)
# 0 Alice
# 1 Bob
# 2 Charlie
# 3 Dave
# 4 Ellen
# dtype: object
print(s.mask(s.str.endswith('e')))
# 0 NaN
# 1 Bob
# 2 NaN
# 3 NaN
# 4 Ellen
# dtype: object
print(s.mask(s.str.endswith('e'), 'X'))
# 0 X
# 1 Bob
# 2 X
# 3 X
# 4 Ellen
# dtype: object
print(s.mask(s.str.endswith('e'), s.str.upper()))
# 0 ALICE
# 1 Bob
# 2 CHARLIE
# 3 DAVE
# 4 Ellen
# dtype: object
df = pd.DataFrame({'A': [-2, -1, 0, 1, 2], 'B': [0, 10, 20, 30, 40]})
print(df)
# A B
# 0 -2 0
# 1 -1 10
# 2 0 20
# 3 1 30
# 4 2 40
print(df.mask((df < 0) | (df > 20)))
# A B
# 0 NaN 0.0
# 1 NaN 10.0
# 2 0.0 20.0
# 3 1.0 NaN
# 4 2.0 NaN
print(df.mask((df < 0) | (df > 20), 100))
# A B
# 0 100 0
# 1 100 10
# 2 0 20
# 3 1 100
# 4 2 100
print(df.mask((df < 0) | (df > 20), df * 100 + 10))
# A B
# 0 -190 0
# 1 -90 10
# 2 0 20
# 3 1 3010
# 4 2 4010
In the above examples for Series, conditions check if strings end with certain characters. Exact matching and regular expression-based matching are also possible. See the following articles for details.
np.where() replace both True and False
The np.where() function from NumPy can be used to replace values in DataFrame or Series according to conditions.
The where() and mask() methods in pandas allow for replacing either True or False values, but not both simultaneously. Untouched values retain their original state.
In contrast, NumPy's np.where() function enables simultaneous replacement of both True and False values. Specify the condition as the first argument, the replacement for True values as the second, and the replacement for False values as the third. For these second and third arguments, scalars, arrays, Series, or DataFrame can be used.
s = pd.Series([-2, -1, 0, 1, 2])
print(s)
# 0 -2
# 1 -1
# 2 0
# 3 1
# 4 2
# dtype: int64
print(np.where(s < 0, -100, 1))
# [-100 -100 1 1 1]
print(np.where(s < 0, s * 10, s * 100 + 10))
# [-20 -10 10 110 210]
print(type(np.where(s < 0, -100, 1)))
# <class 'numpy.ndarray'>
df = pd.DataFrame({'A': [-2, -1, 0, 1, 2], 'B': [0, 10, 20, 30, 40]})
print(df)
# A B
# 0 -2 0
# 1 -1 10
# 2 0 20
# 3 1 30
# 4 2 40
print(np.where((df < 0) | (df > 20), -100, 1))
# [[-100 1]
# [-100 1]
# [ 1 1]
# [ 1 -100]
# [ 1 -100]]
print(np.where((df < 0) | (df > 20), df * 10, df * 100 + 10))
# [[ -20 10]
# [ -10 1010]
# [ 10 2010]
# [ 110 300]
# [ 210 400]]
print(type(np.where((df < 0) | (df > 20), -100, 1)))
# <class 'numpy.ndarray'>
np.where() returns a NumPy array (ndarray). Using the index and columns attributes of the original DataFrame or Series, you can generate a new DataFrame or Series.
print(pd.Series(np.where(s < 0, -100, 1), index=s.index))
# 0 -100
# 1 -100
# 2 1
# 3 1
# 4 1
# dtype: int64
print(pd.DataFrame(np.where((df < 0) | (df > 20), -100, 1),
index=df.index, columns=df.columns))
# A B
# 0 -100 1
# 1 -100 1
# 2 1 1
# 3 1 -100
# 4 1 -100
When adding as a new column, the ndarray can be specified directly on the right-hand side.
df['C'] = np.where(df['A'] < 0, -100, 1)
print(df)
# A B C
# 0 -2 0 -100
# 1 -1 10 -100
# 2 0 20 1
# 3 1 30 1
# 4 2 40 1
Assign values using Boolean indexing with loc, iloc
Specifying a bool type Series or array in loc or iloc allows for extracting values at positions where True.
df = pd.DataFrame({'A': [-2, -1, 0, 1, 2], 'B': [0, 10, 20, 30, 40]})
print(df)
# A B
# 0 -2 0
# 1 -1 10
# 2 0 20
# 3 1 30
# 4 2 40
print(df.loc[df['A'] < 0, 'A'])
# 0 -2
# 1 -1
# Name: A, dtype: int64
loc and iloc can be used for assigning values. Scalars, Series, or arrays can be specified on the right-hand side.
df.loc[df['A'] < 0, 'A'] = -10
print(df)
# A B
# 0 -10 0
# 1 -10 10
# 2 0 20
# 3 1 30
# 4 2 40
df.loc[df['A'] >= 0, 'A'] = df['B'] * 10
print(df)
# A B
# 0 -10 0
# 1 -10 10
# 2 200 20
# 3 300 30
# 4 400 40
Specifying a new column name adds a new column. Values that do not meet the condition become missing values NaN. Note that the data type (dtype) of a column containing NaN becomes float.
df.loc[df['A'] < 0, 'C'] = -100
print(df)
# A B C
# 0 -10 0 -100.0
# 1 -10 10 -100.0
# 2 200 20 NaN
# 3 300 30 NaN
# 4 400 40 NaN