pandas: Replace values based on conditions with where(), mask()

Posted: | Tags: Python, pandas

This article explains how to replace values based on conditions in pandas. You can perform conditional operations like if then ... or if then ... else ... on DataFrame or Series.

Use the where() method to replace values where the condition is False, and the mask() method where it is True. For replacing both True and False, use NumPy'snp.where() function.

Additionally, you can use Boolean indexing with loc or iloc to assign values based on conditions.

For information on replacing specific values or replacing and deleting missing values NaN, see the following articles.

The pandas and NumPy versions used in this article are as follows. Note that functionality may vary between versions.

import pandas as pd
import numpy as np

print(pd.__version__)
# 2.1.4

print(np.__version__)
# 1.26.2

where() replaces False, keeps True unchanged

The where() method is provided for both DataFrame and Series.

where() in Series

Consider the following Series.

s = pd.Series([-2, -1, 0, 1, 2])
print(s)
# 0   -2
# 1   -1
# 2    0
# 3    1
# 4    2
# dtype: int64

First argument, cond

When you specify a bool type Series as the first argument, cond, in where(), True values remain unchanged, while False values become NaN.

For example, a comparison operation on a Series returns a bool type Series, which can be specified as the first argument.

print(s < 0)
# 0     True
# 1     True
# 2    False
# 3    False
# 4    False
# dtype: bool

print(s.where(s < 0))
# 0   -2.0
# 1   -1.0
# 2    NaN
# 3    NaN
# 4    NaN
# dtype: float64

You can specify not only comparison operations but also conditions on strings. See the examples with mask() in the next section.

Array-like objects such as a list or NumPy array (ndarray) containing bool values can also be specified.

print(s.where([False, True, False, True, False]))
# 0    NaN
# 1   -1.0
# 2    NaN
# 3    1.0
# 4    NaN
# dtype: float64

Second argument, other

By specifying a scalar as the second argument, other, False values are replaced with it.

print(s.where(s < 0, 10))
# 0    -2
# 1    -1
# 2    10
# 3    10
# 4    10
# dtype: int64

Array-like objects like a list or NumPy array (ndarray) can also be specified. False values are replaced with the corresponding elements in these objects.

print(s.where(s < 0, [0, 10, 20, 30, 40]))
# 0    -2
# 1    -1
# 2    20
# 3    30
# 4    40
# dtype: int64

A Series can also be specified. The replacement occurs for values matching in index (label), not position.

s2 = pd.Series([0, 10, 20, 30, 40], index=[4, 3, 2, 1, 0])
print(s2)
# 4     0
# 3    10
# 2    20
# 1    30
# 0    40
# dtype: int64

print(s.where(s < 0, s2))
# 0    -2
# 1    -1
# 2    20
# 3    10
# 4     0
# dtype: int64

It is also possible to apply a process to the original Series for replacement.

print(s * 100 + 10)
# 0   -190
# 1    -90
# 2     10
# 3    110
# 4    210
# dtype: int64

print(s.where(s < 0, s * 100 + 10))
# 0     -2
# 1     -1
# 2     10
# 3    110
# 4    210
# dtype: int64

inplace

By default, a new object is returned without changing the original object. If the argument inplace is set to True, the original object is updated.

s.where(s < 0, 10, inplace=True)
print(s)
# 0    -2
# 1    -1
# 2    10
# 3    10
# 4    10
# dtype: int64

where() in DataFrame

Consider the following DataFrame.

df = pd.DataFrame({'A': [-2, -1, 0, 1, 2], 'B': [0, 10, 20, 30, 40]})
print(df)
#    A   B
# 0 -2   0
# 1 -1  10
# 2  0  20
# 3  1  30
# 4  2  40

Basic usage

The basic usage of where() in DataFrame is the same as in Series.

When you specify a bool type DataFrame as the first argument, cond, in where(), True values remain unchanged, while False values become NaN.

print((df < 0) | (df > 20))
#        A      B
# 0   True  False
# 1   True  False
# 2  False  False
# 3  False   True
# 4  False   True

print(df.where((df < 0) | (df > 20)))
#      A     B
# 0 -2.0   NaN
# 1 -1.0   NaN
# 2  NaN   NaN
# 3  NaN  30.0
# 4  NaN  40.0

In the above example, multiple conditions are combined using | (OR). Note that using and, or instead of &, |, or omitting parentheses will result in an error.

By specifying a scalar value or DataFrame as the second argument, other, False values are replaced with them.

print(df.where((df < 0) | (df > 20), 100))
#      A    B
# 0   -2  100
# 1   -1  100
# 2  100  100
# 3  100   30
# 4  100   40

print(df * 100 + 10)
#      A     B
# 0 -190    10
# 1  -90  1010
# 2   10  2010
# 3  110  3010
# 4  210  4010

print(df.where((df < 0) | (df > 20), df * 100 + 10))
#      A     B
# 0   -2    10
# 1   -1  1010
# 2   10  2010
# 3  110    30
# 4  210    40

Although examples are not provided here, it is possible to specify two-dimensional arrays for the first and second arguments, just as in where() for Series. Additionally, the inplace argument can be used.

Process specific columns in DataFrame

For example, performing a comparison operation on a DataFrame with a mix of numeric and string columns will result in an error.

df['C'] = ['A', 'B', 'C', 'D', 'E']
print(df)
#    A   B  C
# 0 -2   0  A
# 1 -1  10  B
# 2  0  20  C
# 3  1  30  D
# 4  2  40  E

# print(df < 0)
# TypeError: '<' not supported between instances of 'str' and 'int'

Since DataFrame columns are Series, they can be processed individually. It is also possible to add them as new columns.

print(df['C'].where(df['A'] < 0, 'X'))
# 0    A
# 1    B
# 2    X
# 3    X
# 4    X
# Name: C, dtype: object

df['D'] = df['C'].where(df['A'] < 0, 'X')
print(df)
#    A   B  C  D
# 0 -2   0  A  A
# 1 -1  10  B  B
# 2  0  20  C  X
# 3  1  30  D  X
# 4  2  40  E  X

You can also extract and process only numeric columns and then concatenate them with non-numeric columns. Use the select_dtypes() method and the pd.concat() function.

df_num = df.select_dtypes('number')
print(df_num.where(df_num > 0, -10))
#     A   B
# 0 -10 -10
# 1 -10  10
# 2 -10  20
# 3   1  30
# 4   2  40

print(df.select_dtypes(exclude='number'))
#    C  D
# 0  A  A
# 1  B  B
# 2  C  X
# 3  D  X
# 4  E  X

print(pd.concat([df_num.where(df_num > 0, -10), df.select_dtypes(exclude='number')],
                axis=1))
#     A   B  C  D
# 0 -10 -10  A  A
# 1 -10  10  B  B
# 2 -10  20  C  X
# 3   1  30  D  X
# 4   2  40  E  X

mask() replaces True, keeps False unchanged

The mask() method is provided for both DataFrame and Series.

The mask() method works inversely compared to where(): it keeps values unchanged where the condition in the first argument is False and replaces True values with NaN or a value specified in the second argument.

The usage of arguments is the same as in where(). See the where() section above for details.

Below are some examples.

s = pd.Series(['Alice', 'Bob', 'Charlie', 'Dave', 'Ellen'])
print(s)
# 0      Alice
# 1        Bob
# 2    Charlie
# 3       Dave
# 4      Ellen
# dtype: object

print(s.mask(s.str.endswith('e')))
# 0      NaN
# 1      Bob
# 2      NaN
# 3      NaN
# 4    Ellen
# dtype: object

print(s.mask(s.str.endswith('e'), 'X'))
# 0        X
# 1      Bob
# 2        X
# 3        X
# 4    Ellen
# dtype: object

print(s.mask(s.str.endswith('e'), s.str.upper()))
# 0      ALICE
# 1        Bob
# 2    CHARLIE
# 3       DAVE
# 4      Ellen
# dtype: object
df = pd.DataFrame({'A': [-2, -1, 0, 1, 2], 'B': [0, 10, 20, 30, 40]})
print(df)
#    A   B
# 0 -2   0
# 1 -1  10
# 2  0  20
# 3  1  30
# 4  2  40

print(df.mask((df < 0) | (df > 20)))
#      A     B
# 0  NaN   0.0
# 1  NaN  10.0
# 2  0.0  20.0
# 3  1.0   NaN
# 4  2.0   NaN

print(df.mask((df < 0) | (df > 20), 100))
#      A    B
# 0  100    0
# 1  100   10
# 2    0   20
# 3    1  100
# 4    2  100

print(df.mask((df < 0) | (df > 20), df * 100 + 10))
#      A     B
# 0 -190     0
# 1  -90    10
# 2    0    20
# 3    1  3010
# 4    2  4010

In the above examples for Series, conditions check if strings end with certain characters. Exact matching and regular expression-based matching are also possible. See the following articles for details.

np.where() replace both True and False

The np.where() function from NumPy can be used to replace values in DataFrame or Series according to conditions.

The where() and mask() methods in pandas allow for replacing either True or False values, but not both simultaneously. Untouched values retain their original state.

In contrast, NumPy's np.where() function enables simultaneous replacement of both True and False values. Specify the condition as the first argument, the replacement for True values as the second, and the replacement for False values as the third. For these second and third arguments, scalars, arrays, Series, or DataFrame can be used.

s = pd.Series([-2, -1, 0, 1, 2])
print(s)
# 0   -2
# 1   -1
# 2    0
# 3    1
# 4    2
# dtype: int64

print(np.where(s < 0, -100, 1))
# [-100 -100    1    1    1]

print(np.where(s < 0, s * 10, s * 100 + 10))
# [-20 -10  10 110 210]

print(type(np.where(s < 0, -100, 1)))
# <class 'numpy.ndarray'>
df = pd.DataFrame({'A': [-2, -1, 0, 1, 2], 'B': [0, 10, 20, 30, 40]})
print(df)
#    A   B
# 0 -2   0
# 1 -1  10
# 2  0  20
# 3  1  30
# 4  2  40

print(np.where((df < 0) | (df > 20), -100, 1))
# [[-100    1]
#  [-100    1]
#  [   1    1]
#  [   1 -100]
#  [   1 -100]]

print(np.where((df < 0) | (df > 20), df * 10, df * 100 + 10))
# [[ -20   10]
#  [ -10 1010]
#  [  10 2010]
#  [ 110  300]
#  [ 210  400]]

print(type(np.where((df < 0) | (df > 20), -100, 1)))
# <class 'numpy.ndarray'>

np.where() returns a NumPy array (ndarray). Using the index and columns attributes of the original DataFrame or Series, you can generate a new DataFrame or Series.

print(pd.Series(np.where(s < 0, -100, 1), index=s.index))
# 0   -100
# 1   -100
# 2      1
# 3      1
# 4      1
# dtype: int64
print(pd.DataFrame(np.where((df < 0) | (df > 20), -100, 1),
                   index=df.index, columns=df.columns))
#      A    B
# 0 -100    1
# 1 -100    1
# 2    1    1
# 3    1 -100
# 4    1 -100

When adding as a new column, the ndarray can be specified directly on the right-hand side.

df['C'] = np.where(df['A'] < 0, -100, 1)
print(df)
#    A   B    C
# 0 -2   0 -100
# 1 -1  10 -100
# 2  0  20    1
# 3  1  30    1
# 4  2  40    1

Assign values using Boolean indexing with loc, iloc

Specifying a bool type Series or array in loc or iloc allows for extracting values at positions where True.

df = pd.DataFrame({'A': [-2, -1, 0, 1, 2], 'B': [0, 10, 20, 30, 40]})
print(df)
#    A   B
# 0 -2   0
# 1 -1  10
# 2  0  20
# 3  1  30
# 4  2  40

print(df.loc[df['A'] < 0, 'A'])
# 0   -2
# 1   -1
# Name: A, dtype: int64

loc and iloc can be used for assigning values. Scalars, Series, or arrays can be specified on the right-hand side.

df.loc[df['A'] < 0, 'A'] = -10
print(df)
#     A   B
# 0 -10   0
# 1 -10  10
# 2   0  20
# 3   1  30
# 4   2  40

df.loc[df['A'] >= 0, 'A'] = df['B'] * 10
print(df)
#      A   B
# 0  -10   0
# 1  -10  10
# 2  200  20
# 3  300  30
# 4  400  40

Specifying a new column name adds a new column. Values that do not meet the condition become missing values NaN. Note that the data type (dtype) of a column containing NaN becomes float.

df.loc[df['A'] < 0, 'C'] = -100
print(df)
#      A   B      C
# 0  -10   0 -100.0
# 1  -10  10 -100.0
# 2  200  20    NaN
# 3  300  30    NaN
# 4  400  40    NaN

Related Categories

Related Articles