pandas: Replace values based on conditions with where(), mask()

Posted: 2024-01-17 | Tags: Python, pandas

This article explains how to replace values based on conditions in pandas. You can perform conditional operations like if then ... or if then ... else ... on DataFrame or Series.

Use the where() method to replace values where the condition is False, and the mask() method where it is True. For replacing both True and False, use NumPy'snp.where() function.

Additionally, you can use Boolean indexing with loc or iloc to assign values based on conditions.

Contents

where() replaces False, keeps True unchanged
- where() in Series
- where() in DataFrame
mask() replaces True, keeps False unchanged
np.where() replace both True and False
Assign values using Boolean indexing with loc, iloc

For information on replacing specific values or replacing and deleting missing values NaN, see the following articles.

The pandas and NumPy versions used in this article are as follows. Note that functionality may vary between versions.

import pandas as pd
import numpy as np

print(pd.__version__)
# 2.1.4

print(np.__version__)
# 1.26.2

source: pandas_where_mask.py

`where()` replaces `False`, keeps `True` unchanged

The where() method is provided for both DataFrame and Series.

`where()` in `Series`

Consider the following Series.

s = pd.Series([-2, -1, 0, 1, 2])
print(s)
# 0   -2
# 1   -1
# 2    0
# 3    1
# 4    2
# dtype: int64

source: pandas_where_mask.py

First argument, `cond`

When you specify a bool type Series as the first argument, cond, in where(), True values remain unchanged, while False values become NaN.

For example, a comparison operation on a Series returns a bool type Series, which can be specified as the first argument.

print(s < 0)
# 0     True
# 1     True
# 2    False
# 3    False
# 4    False
# dtype: bool

print(s.where(s < 0))
# 0   -2.0
# 1   -1.0
# 2    NaN
# 3    NaN
# 4    NaN
# dtype: float64

source: pandas_where_mask.py

You can specify not only comparison operations but also conditions on strings. See the examples with mask() in the next section.

Array-like objects such as a list or NumPy array (ndarray) containing bool values can also be specified.

print(s.where([False, True, False, True, False]))
# 0    NaN
# 1   -1.0
# 2    NaN
# 3    1.0
# 4    NaN
# dtype: float64

source: pandas_where_mask.py

Second argument, `other`

By specifying a scalar as the second argument, other, False values are replaced with it.

print(s.where(s < 0, 10))
# 0    -2
# 1    -1
# 2    10
# 3    10
# 4    10
# dtype: int64

source: pandas_where_mask.py

Array-like objects like a list or NumPy array (ndarray) can also be specified. False values are replaced with the corresponding elements in these objects.

print(s.where(s < 0, [0, 10, 20, 30, 40]))
# 0    -2
# 1    -1
# 2    20
# 3    30
# 4    40
# dtype: int64

source: pandas_where_mask.py

A Series can also be specified. The replacement occurs for values matching in index (label), not position.

s2 = pd.Series([0, 10, 20, 30, 40], index=[4, 3, 2, 1, 0])
print(s2)
# 4     0
# 3    10
# 2    20
# 1    30
# 0    40
# dtype: int64

print(s.where(s < 0, s2))
# 0    -2
# 1    -1
# 2    20
# 3    10
# 4     0
# dtype: int64

source: pandas_where_mask.py

It is also possible to apply a process to the original Series for replacement.

print(s * 100 + 10)
# 0   -190
# 1    -90
# 2     10
# 3    110
# 4    210
# dtype: int64

print(s.where(s < 0, s * 100 + 10))
# 0     -2
# 1     -1
# 2     10
# 3    110
# 4    210
# dtype: int64

source: pandas_where_mask.py

`inplace`

By default, a new object is returned without changing the original object. If the argument inplace is set to True, the original object is updated.

s.where(s < 0, 10, inplace=True)
print(s)
# 0    -2
# 1    -1
# 2    10
# 3    10
# 4    10
# dtype: int64

source: pandas_where_mask.py

`where()` in `DataFrame`

Consider the following DataFrame.

df = pd.DataFrame({'A': [-2, -1, 0, 1, 2], 'B': [0, 10, 20, 30, 40]})
print(df)
#    A   B
# 0 -2   0
# 1 -1  10
# 2  0  20
# 3  1  30
# 4  2  40

source: pandas_where_mask.py

Basic usage

The basic usage of where() in DataFrame is the same as in Series.

When you specify a bool type DataFrame as the first argument, cond, in where(), True values remain unchanged, while False values become NaN.

print((df < 0) | (df > 20))
#        A      B
# 0   True  False
# 1   True  False
# 2  False  False
# 3  False   True
# 4  False   True

print(df.where((df < 0) | (df > 20)))
#      A     B
# 0 -2.0   NaN
# 1 -1.0   NaN
# 2  NaN   NaN
# 3  NaN  30.0
# 4  NaN  40.0

source: pandas_where_mask.py

In the above example, multiple conditions are combined using | (OR). Note that using and, or instead of &, |, or omitting parentheses will result in an error.

How to fix "ValueError: The truth value ... is ambiguous" in NumPy, pandas

By specifying a scalar value or DataFrame as the second argument, other, False values are replaced with them.

print(df.where((df < 0) | (df > 20), 100))
#      A    B
# 0   -2  100
# 1   -1  100
# 2  100  100
# 3  100   30
# 4  100   40

print(df * 100 + 10)
#      A     B
# 0 -190    10
# 1  -90  1010
# 2   10  2010
# 3  110  3010
# 4  210  4010

print(df.where((df < 0) | (df > 20), df * 100 + 10))
#      A     B
# 0   -2    10
# 1   -1  1010
# 2   10  2010
# 3  110    30
# 4  210    40

source: pandas_where_mask.py

Although examples are not provided here, it is possible to specify two-dimensional arrays for the first and second arguments, just as in where() for Series. Additionally, the inplace argument can be used.

Process specific columns in `DataFrame`

For example, performing a comparison operation on a DataFrame with a mix of numeric and string columns will result in an error.

df['C'] = ['A', 'B', 'C', 'D', 'E']
print(df)
#    A   B  C
# 0 -2   0  A
# 1 -1  10  B
# 2  0  20  C
# 3  1  30  D
# 4  2  40  E

# print(df < 0)
# TypeError: '<' not supported between instances of 'str' and 'int'

source: pandas_where_mask.py

Since DataFrame columns are Series, they can be processed individually. It is also possible to add them as new columns.

print(df['C'].where(df['A'] < 0, 'X'))
# 0    A
# 1    B
# 2    X
# 3    X
# 4    X
# Name: C, dtype: object

df['D'] = df['C'].where(df['A'] < 0, 'X')
print(df)
#    A   B  C  D
# 0 -2   0  A  A
# 1 -1  10  B  B
# 2  0  20  C  X
# 3  1  30  D  X
# 4  2  40  E  X

source: pandas_where_mask.py

You can also extract and process only numeric columns and then concatenate them with non-numeric columns. Use the select_dtypes() method and the pd.concat() function.

df_num = df.select_dtypes('number')
print(df_num.where(df_num > 0, -10))
#     A   B
# 0 -10 -10
# 1 -10  10
# 2 -10  20
# 3   1  30
# 4   2  40

print(df.select_dtypes(exclude='number'))
#    C  D
# 0  A  A
# 1  B  B
# 2  C  X
# 3  D  X
# 4  E  X

print(pd.concat([df_num.where(df_num > 0, -10), df.select_dtypes(exclude='number')],
                axis=1))
#     A   B  C  D
# 0 -10 -10  A  A
# 1 -10  10  B  B
# 2 -10  20  C  X
# 3   1  30  D  X
# 4   2  40  E  X

source: pandas_where_mask.py

`mask()` replaces `True`, keeps `False` unchanged

The mask() method is provided for both DataFrame and Series.

The mask() method works inversely compared to where(): it keeps values unchanged where the condition in the first argument is False and replaces True values with NaN or a value specified in the second argument.

The usage of arguments is the same as in where(). See the where() section above for details.

Below are some examples.

s = pd.Series(['Alice', 'Bob', 'Charlie', 'Dave', 'Ellen'])
print(s)
# 0      Alice
# 1        Bob
# 2    Charlie
# 3       Dave
# 4      Ellen
# dtype: object

print(s.mask(s.str.endswith('e')))
# 0      NaN
# 1      Bob
# 2      NaN
# 3      NaN
# 4    Ellen
# dtype: object

print(s.mask(s.str.endswith('e'), 'X'))
# 0        X
# 1      Bob
# 2        X
# 3        X
# 4    Ellen
# dtype: object

print(s.mask(s.str.endswith('e'), s.str.upper()))
# 0      ALICE
# 1        Bob
# 2    CHARLIE
# 3       DAVE
# 4      Ellen
# dtype: object

source: pandas_where_mask.py

df = pd.DataFrame({'A': [-2, -1, 0, 1, 2], 'B': [0, 10, 20, 30, 40]})
print(df)
#    A   B
# 0 -2   0
# 1 -1  10
# 2  0  20
# 3  1  30
# 4  2  40

print(df.mask((df < 0) | (df > 20)))
#      A     B
# 0  NaN   0.0
# 1  NaN  10.0
# 2  0.0  20.0
# 3  1.0   NaN
# 4  2.0   NaN

print(df.mask((df < 0) | (df > 20), 100))
#      A    B
# 0  100    0
# 1  100   10
# 2    0   20
# 3    1  100
# 4    2  100

print(df.mask((df < 0) | (df > 20), df * 100 + 10))
#      A     B
# 0 -190     0
# 1  -90    10
# 2    0    20
# 3    1  3010
# 4    2  4010

source: pandas_where_mask.py

In the above examples for Series, conditions check if strings end with certain characters. Exact matching and regular expression-based matching are also possible. See the following articles for details.

pandas: Extract rows that contain specific strings from a DataFrame

`np.where()` replace both `True` and `False`

The np.where() function from NumPy can be used to replace values in DataFrame or Series according to conditions.

numpy.where(): Manipulate elements depending on conditions

The where() and mask() methods in pandas allow for replacing either True or False values, but not both simultaneously. Untouched values retain their original state.

In contrast, NumPy's np.where() function enables simultaneous replacement of both True and False values. Specify the condition as the first argument, the replacement for True values as the second, and the replacement for False values as the third. For these second and third arguments, scalars, arrays, Series, or DataFrame can be used.

s = pd.Series([-2, -1, 0, 1, 2])
print(s)
# 0   -2
# 1   -1
# 2    0
# 3    1
# 4    2
# dtype: int64

print(np.where(s < 0, -100, 1))
# [-100 -100    1    1    1]

print(np.where(s < 0, s * 10, s * 100 + 10))
# [-20 -10  10 110 210]

print(type(np.where(s < 0, -100, 1)))
# <class 'numpy.ndarray'>

source: pandas_where_mask.py

df = pd.DataFrame({'A': [-2, -1, 0, 1, 2], 'B': [0, 10, 20, 30, 40]})
print(df)
#    A   B
# 0 -2   0
# 1 -1  10
# 2  0  20
# 3  1  30
# 4  2  40

print(np.where((df < 0) | (df > 20), -100, 1))
# [[-100    1]
#  [-100    1]
#  [   1    1]
#  [   1 -100]
#  [   1 -100]]

print(np.where((df < 0) | (df > 20), df * 10, df * 100 + 10))
# [[ -20   10]
#  [ -10 1010]
#  [  10 2010]
#  [ 110  300]
#  [ 210  400]]

print(type(np.where((df < 0) | (df > 20), -100, 1)))
# <class 'numpy.ndarray'>

source: pandas_where_mask.py

np.where() returns a NumPy array (ndarray). Using the index and columns attributes of the original DataFrame or Series, you can generate a new DataFrame or Series.

print(pd.Series(np.where(s < 0, -100, 1), index=s.index))
# 0   -100
# 1   -100
# 2      1
# 3      1
# 4      1
# dtype: int64

source: pandas_where_mask.py

print(pd.DataFrame(np.where((df < 0) | (df > 20), -100, 1),
                   index=df.index, columns=df.columns))
#      A    B
# 0 -100    1
# 1 -100    1
# 2    1    1
# 3    1 -100
# 4    1 -100

source: pandas_where_mask.py

When adding as a new column, the ndarray can be specified directly on the right-hand side.

df['C'] = np.where(df['A'] < 0, -100, 1)
print(df)
#    A   B    C
# 0 -2   0 -100
# 1 -1  10 -100
# 2  0  20    1
# 3  1  30    1
# 4  2  40    1

source: pandas_where_mask.py

Assign values using Boolean indexing with `loc`, `iloc`

Specifying a bool type Series or array in loc or iloc allows for extracting values at positions where True.

pandas: Get/Set values with loc, iloc, at, iat

df = pd.DataFrame({'A': [-2, -1, 0, 1, 2], 'B': [0, 10, 20, 30, 40]})
print(df)
#    A   B
# 0 -2   0
# 1 -1  10
# 2  0  20
# 3  1  30
# 4  2  40

print(df.loc[df['A'] < 0, 'A'])
# 0   -2
# 1   -1
# Name: A, dtype: int64

source: pandas_where_mask.py

loc and iloc can be used for assigning values. Scalars, Series, or arrays can be specified on the right-hand side.

df.loc[df['A'] < 0, 'A'] = -10
print(df)
#     A   B
# 0 -10   0
# 1 -10  10
# 2   0  20
# 3   1  30
# 4   2  40

df.loc[df['A'] >= 0, 'A'] = df['B'] * 10
print(df)
#      A   B
# 0  -10   0
# 1  -10  10
# 2  200  20
# 3  300  30
# 4  400  40

source: pandas_where_mask.py

Specifying a new column name adds a new column. Values that do not meet the condition become missing values NaN. Note that the data type (dtype) of a column containing NaN becomes float.

Missing values in pandas (nan, None, pd.NA)

df.loc[df['A'] < 0, 'C'] = -100
print(df)
#      A   B      C
# 0  -10   0 -100.0
# 1  -10  10 -100.0
# 2  200  20    NaN
# 3  300  30    NaN
# 4  400  40    NaN

source: pandas_where_mask.py

Related Categories

Related Articles