pandasで条件に応じて値を置換（where, mask）

Modified: 2023-12-14 | Tags: Python, pandas

pandasで条件に応じて値を置換する方法を説明する。if文を使うわけではないが、DataFrameやSeriesに対してif then ...やif then ... else ...のような条件分岐の処理が可能。

条件がFalseの要素を置換するにはwhere()メソッド、Trueの要素を置換するにはmask()メソッド、TrueとFalseの要素をどちらも置換するにはNumPyのnp.where()関数を使う。

locやilocのブーリアンインデックスによって条件に応じて値を代入することもできる。

目次

where()メソッド: Trueはそのまま、Falseを置換
- Seriesのwhere()メソッド
- DataFrameのwhere()メソッド
mask()メソッド: Trueを置換、Falseはそのまま
np.where()関数: True, Falseをそれぞれ置換
loc, ilocのブーリアンインデックスによる値の代入

特定の値の置換、欠損値NaNの置換や削除については以下の記事を参照。

関連記事: pandas.DataFrame, Seriesの要素の値を置換するreplace
関連記事: pandasで欠損値NaNを置換（穴埋め）するfillna
関連記事: pandasで欠損値NaNを削除（除外）するdropna

本記事のサンプルコードのpandasとNumPyのバージョンは以下の通り。バージョンによって仕様が異なる可能性があるので注意。

import pandas as pd
import numpy as np

print(pd.__version__)
# 2.1.4

print(np.__version__)
# 1.26.2

source: pandas_where_mask.py

where()メソッド: Trueはそのまま、Falseを置換

DataFrame, Seriesのメソッドとしてwhere()が提供されている。

Seriesのwhere()メソッド

以下のSeriesを例とする。

s = pd.Series([-2, -1, 0, 1, 2])
print(s)
# 0   -2
# 1   -1
# 2    0
# 3    1
# 4    2
# dtype: int64

source: pandas_where_mask.py

第一引数cond

where()の第一引数condにbool型のSeriesを指定すると、Trueの要素の値は呼び出し元のオブジェクトのままで、Falseの要素の値はNaNとなる。

例えばSeriesに対する比較演算はbool型のSeriesを返すので、それを第一引数に指定できる。

print(s < 0)
# 0     True
# 1     True
# 2    False
# 3    False
# 4    False
# dtype: bool

print(s.where(s < 0))
# 0   -2.0
# 1   -1.0
# 2    NaN
# 3    NaN
# 4    NaN
# dtype: float64

source: pandas_where_mask.py

比較演算だけでなく、文字列に対する条件なども指定可能。次節のmask()の例を参照。

bool値を要素とするリストやNumPy配列ndarrayなどのarray-likeオブジェクトも指定できる。

print(s.where([False, True, False, True, False]))
# 0    NaN
# 1   -1.0
# 2    NaN
# 3    1.0
# 4    NaN
# dtype: float64

source: pandas_where_mask.py

第二引数other

第二引数otherにスカラー値を指定すると、Falseの要素がその値に置換される。

print(s.where(s < 0, 10))
# 0    -2
# 1    -1
# 2    10
# 3    10
# 4    10
# dtype: int64

source: pandas_where_mask.py

リストやNumPy配列ndarrayなどのarray-likeオブジェクトも指定可能。Falseの要素が同じ位置の要素に置換される。

print(s.where(s < 0, [0, 10, 20, 30, 40]))
# 0    -2
# 1    -1
# 2    20
# 3    30
# 4    40
# dtype: int64

source: pandas_where_mask.py

Seriesも指定できる。位置ではなくインデックス（ラベル）が一致する要素に置換される。

s2 = pd.Series([0, 10, 20, 30, 40], index=[4, 3, 2, 1, 0])
print(s2)
# 4     0
# 3    10
# 2    20
# 1    30
# 0    40
# dtype: int64

print(s.where(s < 0, s2))
# 0    -2
# 1    -1
# 2    20
# 3    10
# 4     0
# dtype: int64

source: pandas_where_mask.py

元のSeriesに処理を加えて置換することも可能。

print(s * 100 + 10)
# 0   -190
# 1    -90
# 2     10
# 3    110
# 4    210
# dtype: int64

print(s.where(s < 0, s * 100 + 10))
# 0     -2
# 1     -1
# 2     10
# 3    110
# 4    210
# dtype: int64

source: pandas_where_mask.py

引数inplace

デフォルトでは元のオブジェクトは変更されず新たなオブジェクトが返される。引数inplaceをTrueとすると元のオブジェクトが更新される。

s.where(s < 0, 10, inplace=True)
print(s)
# 0    -2
# 1    -1
# 2    10
# 3    10
# 4    10
# dtype: int64

source: pandas_where_mask.py

DataFrameのwhere()メソッド

以下のDataFrameを例とする。

df = pd.DataFrame({'A': [-2, -1, 0, 1, 2], 'B': [0, 10, 20, 30, 40]})
print(df)
#    A   B
# 0 -2   0
# 1 -1  10
# 2  0  20
# 3  1  30
# 4  2  40

source: pandas_where_mask.py

基本的な使い方

DataFrameのwhere()も基本的な使い方はSeriesのwhere()と同じ。

where()の第一引数condにbool型のDataFrameを指定すると、Trueの要素の値は呼び出し元のオブジェクトのままで、Falseの要素の値がNaNに置換される。

print((df < 0) | (df > 20))
#        A      B
# 0   True  False
# 1   True  False
# 2  False  False
# 3  False   True
# 4  False   True

print(df.where((df < 0) | (df > 20)))
#      A     B
# 0 -2.0   NaN
# 1 -1.0   NaN
# 2  NaN   NaN
# 3  NaN  30.0
# 4  NaN  40.0

source: pandas_where_mask.py

上の例では|（または）で複数条件を組み合わせている。&, |ではなくand, orを使ったり、括弧を省略したりするとエラーになるので注意。

関連記事: NumPy, pandasのValueError: The truth value ... is ambiguousの対処法

第二引数otherにスカラー値やDataFrameを指定するとFalseの要素を置換できる。

print(df.where((df < 0) | (df > 20), 100))
#      A    B
# 0   -2  100
# 1   -1  100
# 2  100  100
# 3  100   30
# 4  100   40

print(df * 100 + 10)
#      A     B
# 0 -190    10
# 1  -90  1010
# 2   10  2010
# 3  110  3010
# 4  210  4010

print(df.where((df < 0) | (df > 20), df * 100 + 10))
#      A     B
# 0   -2    10
# 1   -1  1010
# 2   10  2010
# 3  110    30
# 4  210    40

source: pandas_where_mask.py

例は省略するが、Seriesのwhere()のように第一引数・第二引数に二次元配列やリストのリストを指定することも可能。引数inplaceも指定できる。

DataFrameの列を個別に処理

例えば、数値と文字列の列が混在しているDataFrameに対して数値との比較演算を行うと、文字列と数値が比較されてしまいエラーとなる。

df['C'] = ['A', 'B', 'C', 'D', 'E']
print(df)
#    A   B  C
# 0 -2   0  A
# 1 -1  10  B
# 2  0  20  C
# 3  1  30  D
# 4  2  40  E

# print(df < 0)
# TypeError: '<' not supported between instances of 'str' and 'int'

source: pandas_where_mask.py

DataFrameの列はSeriesなので、列を選択して個別に処理できる。新たな列として追加することも可能。

関連記事: pandasのインデックス指定で行・列を抽出
関連記事: pandas.DataFrameに列や行を追加（assign, appendなど）

print(df['C'].where(df['A'] < 0, 'X'))
# 0    A
# 1    B
# 2    X
# 3    X
# 4    X
# Name: C, dtype: object

df['D'] = df['C'].where(df['A'] < 0, 'X')
print(df)
#    A   B  C  D
# 0 -2   0  A  A
# 1 -1  10  B  B
# 2  0  20  C  X
# 3  1  30  D  X
# 4  2  40  E  X

source: pandas_where_mask.py

数値列のみを抽出、処理してから数値以外の列と連結することもできる。select_dtypes()メソッドとpd.concat()関数を使う。

関連記事: pandas.DataFrameから特定の型の列を抽出・除外するselect_dtypes
関連記事: pandas.DataFrame, Seriesをソートするsort_values, sort_index

df_num = df.select_dtypes('number')
print(df_num.where(df_num > 0, -10))
#     A   B
# 0 -10 -10
# 1 -10  10
# 2 -10  20
# 3   1  30
# 4   2  40

print(df.select_dtypes(exclude='number'))
#    C  D
# 0  A  A
# 1  B  B
# 2  C  X
# 3  D  X
# 4  E  X

print(pd.concat([df_num.where(df_num > 0, -10), df.select_dtypes(exclude='number')],
                axis=1))
#     A   B  C  D
# 0 -10 -10  A  A
# 1 -10  10  B  B
# 2 -10  20  C  X
# 3   1  30  D  X
# 4   2  40  E  X

source: pandas_where_mask.py

mask()メソッド: Trueを置換、Falseはそのまま

DataFrame, Seriesのメソッドとしてmask()が提供されている。

mask()メソッドはwhere()メソッドとは反対に、第一引数に指定した条件がFalseの要素が呼び出し元のオブジェクトのままで、Trueの要素がNaNまたは第二引数で指定した値に置換される。

引数などの使い方はwhere()と同じ。詳しくは前節のwhere()の説明を参照。

以下、いくつかの例を示す。

s = pd.Series(['Alice', 'Bob', 'Charlie', 'Dave', 'Ellen'])
print(s)
# 0      Alice
# 1        Bob
# 2    Charlie
# 3       Dave
# 4      Ellen
# dtype: object

print(s.mask(s.str.endswith('e')))
# 0      NaN
# 1      Bob
# 2      NaN
# 3      NaN
# 4    Ellen
# dtype: object

print(s.mask(s.str.endswith('e'), 'X'))
# 0        X
# 1      Bob
# 2        X
# 3        X
# 4    Ellen
# dtype: object

print(s.mask(s.str.endswith('e'), s.str.upper()))
# 0      ALICE
# 1        Bob
# 2    CHARLIE
# 3       DAVE
# 4      Ellen
# dtype: object

source: pandas_where_mask.py

df = pd.DataFrame({'A': [-2, -1, 0, 1, 2], 'B': [0, 10, 20, 30, 40]})
print(df)
#    A   B
# 0 -2   0
# 1 -1  10
# 2  0  20
# 3  1  30
# 4  2  40

print(df.mask((df < 0) | (df > 20)))
#      A     B
# 0  NaN   0.0
# 1  NaN  10.0
# 2  0.0  20.0
# 3  1.0   NaN
# 4  2.0   NaN

print(df.mask((df < 0) | (df > 20), 100))
#      A    B
# 0  100    0
# 1  100   10
# 2    0   20
# 3    1  100
# 4    2  100

print(df.mask((df < 0) | (df > 20), df * 100 + 10))
#      A     B
# 0 -190     0
# 1  -90    10
# 2    0    20
# 3    1  3010
# 4    2  4010

source: pandas_where_mask.py

Seriesに対する例では文字列に対する後方一致を条件としている。完全一致や前方一致、正規表現による判定も可能。以下の記事を参照。

関連記事: pandasで特定の文字列を含む行を抽出（完全一致、部分一致）

np.where()関数: True, Falseをそれぞれ置換

NumPyのnp.where()関数を利用して、DataFrameやSeriesに対して条件に応じた値の置換を行うことができる。

関連記事: NumPyで条件に応じた処理を行うnp.whereの使い方

pandasのwhere()またはmask()メソッドでは、条件がTrueかFalseいずれかのみが置換され、他方は元のオブジェクトの値がそのまま使われる。TrueとFalseを同時に置換することはできない。

NumPyのnp.where()関数では、第一引数に条件、第二引数にTrueを置換する値、第三引数にFalseを置換する値を指定する。第二・第三引数にはスカラー値や配列、SeriesやDataFrameを指定可能。

s = pd.Series([-2, -1, 0, 1, 2])
print(s)
# 0   -2
# 1   -1
# 2    0
# 3    1
# 4    2
# dtype: int64

print(np.where(s < 0, -100, 1))
# [-100 -100    1    1    1]

print(np.where(s < 0, s * 10, s * 100 + 10))
# [-20 -10  10 110 210]

print(type(np.where(s < 0, -100, 1)))
# <class 'numpy.ndarray'>

source: pandas_where_mask.py

df = pd.DataFrame({'A': [-2, -1, 0, 1, 2], 'B': [0, 10, 20, 30, 40]})
print(df)
#    A   B
# 0 -2   0
# 1 -1  10
# 2  0  20
# 3  1  30
# 4  2  40

print(np.where((df < 0) | (df > 20), -100, 1))
# [[-100    1]
#  [-100    1]
#  [   1    1]
#  [   1 -100]
#  [   1 -100]]

print(np.where((df < 0) | (df > 20), df * 10, df * 100 + 10))
# [[ -20   10]
#  [ -10 1010]
#  [  10 2010]
#  [ 110  300]
#  [ 210  400]]

print(type(np.where((df < 0) | (df > 20), -100, 1)))
# <class 'numpy.ndarray'>

source: pandas_where_mask.py

np.where()の返り値はNumPy配列ndarray。元のDataFrameやSeriesのindexやcolumns属性を使って、DataFrameやSeriesを生成できる。

関連記事: pandas.DataFrameの構造とその作成方法

print(pd.Series(np.where(s < 0, -100, 1), index=s.index))
# 0   -100
# 1   -100
# 2      1
# 3      1
# 4      1
# dtype: int64

source: pandas_where_mask.py

print(pd.DataFrame(np.where((df < 0) | (df > 20), -100, 1),
                   index=df.index, columns=df.columns))
#      A    B
# 0 -100    1
# 1 -100    1
# 2    1    1
# 3    1 -100
# 4    1 -100

source: pandas_where_mask.py

新たな列として追加する場合は、右辺にndarrayをそのまま指定できる。

df['C'] = np.where(df['A'] < 0, -100, 1)
print(df)
#    A   B    C
# 0 -2   0 -100
# 1 -1  10 -100
# 2  0  20    1
# 3  1  30    1
# 4  2  40    1

source: pandas_where_mask.py

loc, ilocのブーリアンインデックスによる値の代入

locやilocにbool型のSeriesや配列を指定すると、Trueの位置の要素を抽出できる（ブーリアンインデックス）。

関連記事: pandasで任意の位置の値を取得・変更するat, iat, loc, iloc

df = pd.DataFrame({'A': [-2, -1, 0, 1, 2], 'B': [0, 10, 20, 30, 40]})
print(df)
#    A   B
# 0 -2   0
# 1 -1  10
# 2  0  20
# 3  1  30
# 4  2  40

print(df.loc[df['A'] < 0, 'A'])
# 0   -2
# 1   -1
# Name: A, dtype: int64

source: pandas_where_mask.py

loc, ilocでの参照は値の取得だけでなく代入にも使える。右辺にスカラー値やSeries、配列などを指定可能。

df.loc[df['A'] < 0, 'A'] = -10
print(df)
#     A   B
# 0 -10   0
# 1 -10  10
# 2   0  20
# 3   1  30
# 4   2  40

df.loc[df['A'] >= 0, 'A'] = df['B'] * 10
print(df)
#      A   B
# 0  -10   0
# 1  -10  10
# 2  200  20
# 3  300  30
# 4  400  40

source: pandas_where_mask.py

新しい列名を指定すると新しい列が追加される。条件を満たさない要素は欠損値NaNとなる。NaNを含む列の型dtypeはfloatになるので注意。

関連記事: pandasにおける欠損値（nan, None, pd.NA）

df.loc[df['A'] < 0, 'C'] = -100
print(df)
#      A   B      C
# 0  -10   0 -100.0
# 1  -10  10 -100.0
# 2  200  20    NaN
# 3  300  30    NaN
# 4  400  40    NaN

source: pandas_where_mask.py

関連カテゴリー

関連記事