pandasで要素・行・列に関数を適用するmap, apply, applymap

Modified: 2023-12-04 | Tags: Python, pandas

pandasでDataFrameやSeriesの要素・行・列に任意の関数を適用するには、map(), apply(), applymap()メソッドを使う。

Seriesの要素に関数を適用: map(), apply()
- map()の使い方
- apply()の使い方
DataFrameの要素に関数を適用: map(), applymap()
DataFrameの行・列に関数を適用: apply()
DataFrame, Seriesのメソッドや算術演算子を利用
NumPyの関数を利用
処理速度比較

後半で述べるように、一般的な処理はDataFrameやSeriesのメソッドとして提供されている。また、NumPyの関数にDataFrameやSeriesを渡して処理することもできる。map()やapply()は遅いので、可能であれば専用のメソッドやNumPyの関数を使うほうがよい。

本記事のサンプルコードのpandasおよびNumPyのバージョンは以下の通り。バージョンによって仕様が異なる可能性があるので注意。

import pandas as pd
import numpy as np

print(pd.__version__)
# 2.1.2

print(np.__version__)
# 1.26.1

source: pandas_numpy_function.py

Seriesの要素に関数を適用: map(), apply()

Seriesの各要素に関数を適用するには、map()またはapply()メソッドを使う。

map()の使い方

map()の引数に関数を指定すると、各要素に処理が適用された新たなSeriesが返される。例として、整数を16進数の文字列に変換する組み込み関数hex()を使う。

関連記事: Pythonで2進数、8進数、16進数の数値・文字列を相互に変換

s = pd.Series([1, 10, 100])
print(s)
# 0      1
# 1     10
# 2    100
# dtype: int64

print(s.map(hex))
# 0     0x1
# 1     0xa
# 2    0x64
# dtype: object

source: pandas_series_map_apply.py

defで定義した関数やラムダ式も指定可能。

関連記事: Pythonで関数を定義・呼び出し（def, return）
関連記事: Pythonのlambda（ラムダ式、無名関数）の使い方

def my_func(x):
    return x * 10

print(s.map(my_func))
# 0      10
# 1     100
# 2    1000
# dtype: int64

print(s.map(lambda x: x * 10))
# 0      10
# 1     100
# 2    1000
# dtype: int64

source: pandas_series_map_apply.py

なお、上の例はあくまでも説明のためで、このような単純な四則演算などはSeriesを直接処理できる。

print(s * 10)
# 0      10
# 1     100
# 2    1000
# dtype: int64

source: pandas_series_map_apply.py

デフォルトでは欠損値NaNも関数に渡されるが、第二引数na_actionを'ignore'とすると、NaNは関数に渡されずに結果がそのままNaNとなる。NaNがあるとデータ型dtypeが浮動小数点数floatになるため、int()で整数intに変換してからhex()に渡している。

s_nan = pd.Series([1, float('nan'), 100])
print(s_nan)
# 0      1.0
# 1      NaN
# 2    100.0
# dtype: float64

# print(s_nan.map(lambda x: hex(int(x))))
# ValueError: cannot convert float NaN to integer

print(s_nan.map(lambda x: hex(int(x)), na_action='ignore'))
# 0     0x1
# 1     NaN
# 2    0x64
# dtype: object

source: pandas_series_map_apply.py

map()の引数には辞書dictを指定することも可能。その場合は要素の置換となる。詳細は以下の記事を参照。

関連記事: pandas.Seriesのmapメソッドで列の要素を置換

apply()の使い方

map()と同様に、apply()でも第一引数に指定した関数が各要素に適用される。apply()では関数に渡す引数を指定できるという違いがある。

map()では、適用する関数に引数を渡すにはラムダ式などを使う必要がある。例として、文字列を整数に変換するint()で引数base（基数）を指定する。

s = pd.Series(['11', 'AA', 'FF'])
print(s)
# 0    11
# 1    AA
# 2    FF
# dtype: object

# print(s.map(int, base=16))
# TypeError: Series.map() got an unexpected keyword argument 'base'

print(s.map(lambda x: int(x, 16)))
# 0     17
# 1    170
# 2    255
# dtype: int64

source: pandas_series_map_apply.py

apply()では、指定したキーワード引数がそのまま関数に渡される。引数argsに位置引数として指定することも可能。

print(s.apply(int, base=16))
# 0     17
# 1    170
# 2    255
# dtype: int64

print(s.apply(int, args=(16,)))
# 0     17
# 1    170
# 2    255
# dtype: int64

source: pandas_series_map_apply.py

位置引数が一つだけでも引数argsにはタプルやリストとして指定しなければならないので注意。要素数が一個のタプルには末尾にカンマが必要。

関連記事: Pythonで要素が1個のタプルには末尾にカンマが必要

pandas2.1.2時点で、apply()にはmap()におけるna_actionに相当する引数は無い。

DataFrameの要素に関数を適用: map(), applymap()

DataFrameの各要素に関数を適用するにはmap()またはapplymap()メソッドを使う。

pandas2.1.0でapplymap()がmap()という名前に変更され、applymap()は非推奨（deprecated）になった。

pandas2.1.2時点ではapplymap()も使用可能だがFutureWarningが出る。

df = pd.DataFrame([[1, 10, 100], [2, 20, 200]])
print(df)
#    0   1    2
# 0  1  10  100
# 1  2  20  200

print(df.map(hex))
#      0     1     2
# 0  0x1   0xa  0x64
# 1  0x2  0x14  0xc8

print(df.applymap(hex))
#      0     1     2
# 0  0x1   0xa  0x64
# 1  0x2  0x14  0xc8
# 
# /var/folders/rf/b7l8_vgj5mdgvghn_326rn_c0000gn/T/ipykernel_36685/2076800564.py:1: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.

source: pandas_dataframe_map_applymap.py

以下、例としてmap()を使うが、使い方・機能はapplymap()も同じ。pandas2.1.0より前のバージョンではapplymap()を使えばよい。

Seriesのmap()と同様に、DataFrameのmap()でも引数na_actionを指定可能。デフォルトでは欠損値NaNも関数に渡されるが、na_actionを'ignore'とすると、NaNは関数に渡されずに結果がそのままNaNとなる。

df_nan = pd.DataFrame([[1, float('nan'), 100], [2, 20, 200]])
print(df_nan)
#    0     1    2
# 0  1   NaN  100
# 1  2  20.0  200

# print(df_nan.map(lambda x: hex(int(x))))
# ValueError: cannot convert float NaN to integer

print(df_nan.map(lambda x: hex(int(x)), na_action='ignore'))
#      0     1     2
# 0  0x1   NaN  0x64
# 1  0x2  0x14  0xc8

source: pandas_dataframe_map_applymap.py

Seriesのmap()と異なり、DataFrameのmap()では指定したキーワード引数が関数に渡される。

df = pd.DataFrame([['1', 'A', 'F'], ['11', 'AA', 'FF']])
print(df)
#     0   1   2
# 0   1   A   F
# 1  11  AA  FF

print(df.map(int, base=16))
#     0    1    2
# 0   1   10   15
# 1  17  170  255

source: pandas_dataframe_map_applymap.py

pandas2.1.2時点で、Seriesのapply()におけるargsに相当する引数は無いので、位置引数として指定することはできない。

DataFrameの行・列に関数を適用: apply()

DataFrameの行・列に対して関数を適用するにはapply()メソッドを使う。

pandas.DataFrame.apply — pandas 2.1.3 documentation

一度に複数の処理を適用するagg()メソッドについては以下の記事を参照。

関連記事: pandasのagg(), aggregate()の使い方

基本的な使い方

第一引数に適用したい関数を指定する。なお、説明のために組み込み関数sum()を使っているが、合計を算出するのであれば後述のsum()メソッドを使うほうがよい。

df = pd.DataFrame([[10, 20, 30], [40, 50, 60]], index=['X', 'Y'], columns=['A', 'B', 'C'])
print(df)
#     A   B   C
# X  10  20  30
# Y  40  50  60

print(df.apply(sum))
# A    50
# B    70
# C    90
# dtype: int64

source: pandas_dataframe_apply.py

デフォルトでは各列がSeriesとして関数に渡される。Seriesを引数として受け取れない関数だとエラーになる。

print(df.apply(lambda x: type(x)))
# A    <class 'pandas.core.series.Series'>
# B    <class 'pandas.core.series.Series'>
# C    <class 'pandas.core.series.Series'>
# dtype: object

# print(hex(df['A']))
# TypeError: 'Series' object cannot be interpreted as an integer

# print(df.apply(hex))
# TypeError: 'Series' object cannot be interpreted as an integer

source: pandas_dataframe_apply.py

行・列を指定: 引数axis

デフォルトでは各列が関数に渡されるが、引数axisを1または'columns'とすると各行が渡される。

df = pd.DataFrame([[10, 20, 30], [40, 50, 60]], index=['X', 'Y'], columns=['A', 'B', 'C'])
print(df)
#     A   B   C
# X  10  20  30
# Y  40  50  60

print(df.apply(sum, axis=1))
# X     60
# Y    150
# dtype: int64

source: pandas_dataframe_apply.py

関数に引数指定: キーワード引数、引数args

apply()に指定したキーワード引数は適用する関数に渡される。引数argsに位置引数として指定することも可能。

df = pd.DataFrame([[10, 20, 30], [40, 50, 60]], index=['X', 'Y'], columns=['A', 'B', 'C'])
print(df)
#     A   B   C
# X  10  20  30
# Y  40  50  60

def my_func(x, y, z):
    return sum(x) + y + z * 2

print(df.apply(my_func, y=100, z=1000))
# A    2150
# B    2170
# C    2190
# dtype: int64

print(df.apply(my_func, args=(100, 1000)))
# A    2150
# B    2170
# C    2190
# dtype: int64

source: pandas_dataframe_apply.py

Seriesではなくndarrayとして関数に渡す: 引数raw

デフォルトでは各列・行がSeriesとして渡されるが、引数rawをTrueとするとNumPy配列ndarrayとして渡される。

df = pd.DataFrame([[10, 20, 30], [40, 50, 60]], index=['X', 'Y'], columns=['A', 'B', 'C'])
print(df)
#     A   B   C
# X  10  20  30
# Y  40  50  60

print(df.apply(lambda x: type(x), raw=True))
# A    <class 'numpy.ndarray'>
# B    <class 'numpy.ndarray'>
# C    <class 'numpy.ndarray'>
# dtype: object

source: pandas_dataframe_apply.py

Seriesである必要がなければ変換処理が省略できるraw=Trueのほうが高速。処理速度については後述。

Seriesのメソッドや属性を使う処理の場合、raw=Trueとするとエラーになる。

print(df.apply(lambda x: x.name * 3))
# A    AAA
# B    BBB
# C    CCC
# dtype: object

# print(df.apply(lambda x: x.name * 3, raw=True))
# AttributeError: 'numpy.ndarray' object has no attribute 'name'

source: pandas_dataframe_apply.py

任意の行・列の要素に関数を適用

apply()では関数が行・列を受け取り処理する。任意の行・列の要素に関数を適用したい場合は、行・列をSeriesとして抽出し、Seriesのmap()やapply()メソッドを使う。

関連記事: pandasのインデックス指定で行・列を抽出

df = pd.DataFrame([[10, 20, 30], [40, 50, 60]], index=['X', 'Y'], columns=['A', 'B', 'C'])
print(df)
#     A   B   C
# X  10  20  30
# Y  40  50  60

print(df['A'].map(lambda x: x**2))
# X     100
# Y    1600
# Name: A, dtype: int64

print(df.loc['Y'].map(hex))
# A    0x28
# B    0x32
# C    0x3c
# Name: Y, dtype: object

source: pandas_dataframe_apply.py

新たな行・列として追加することも可能。元と同じ行名・列名を指定すると上書きされる。

関連記事: pandas.DataFrameに列や行を追加（assign, appendなど）

df['A'] = df['A'].map(lambda x: x**2)
df.loc['Y_hex'] = df.loc['Y'].map(hex)
print(df)
#            A     B     C
# X        100    20    30
# Y       1600    50    60
# Y_hex  0x640  0x32  0x3c

source: pandas_dataframe_apply.py

DataFrame, Seriesのメソッドや算術演算子を利用

pandasにおいて、一般的な処理はDataFrameやSeriesのメソッドとして提供されているので、map()やapply()を使う必要はない。

df = pd.DataFrame([[1, -2, 3], [-4, 5, -6]], index=['X', 'Y'], columns=['A', 'B', 'C'])
print(df)
#    A  B  C
# X  1 -2  3
# Y -4  5 -6

print(df.abs())
#    A  B  C
# X  1  2  3
# Y  4  5  6

print(df.sum())
# A   -3
# B    3
# C   -3
# dtype: int64

print(df.sum(axis=1))
# X    2
# Y   -5
# dtype: int64

source: pandas_numpy_function.py

提供されているメソッドの一覧は公式ドキュメントを参照。

算術演算子でDataFrameやSeriesを直接処理することも可能。

print(df * 10)
#     A   B   C
# X  10 -20  30
# Y -40  50 -60

print(df['A'].abs() + df['B'] * 100)
# X   -199
# Y    504
# dtype: int64

source: pandas_numpy_function.py

Seriesのstrアクセサから文字列に対するメソッドも利用できる。

関連記事: pandasの文字列メソッドで置換や空白削除などの処理を行う

df = pd.DataFrame([['a', 'ab', 'abc'], ['x', 'xy', 'xyz']], index=['X', 'Y'], columns=['A', 'B', 'C'])
print(df)
#    A   B    C
# X  a  ab  abc
# Y  x  xy  xyz

print(df['A'] + '-' + df['B'].str.upper() + '-' + df['C'].str.title())
# X    a-AB-Abc
# Y    x-XY-Xyz
# dtype: object

source: pandas_numpy_function.py

NumPyの関数を利用

NumPyの関数の引数にDataFrameやSeriesを渡して処理することもできる。

例えば、pandasでは小数点以下切り捨てを行うメソッドは提供されていないが、代わりにnp.floor()を利用可能。DataFrameの場合はDataFrameが、Seriesの場合はSeriesが返される。

関連記事: NumPy配列ndarrayの小数点以下を切り捨て・切り上げ: floor, trunc, ceil

df = pd.DataFrame([[0.1, 0.5, 0.9], [-0.1, -0.5, -0.9]], index=['X', 'Y'], columns=['A', 'B', 'C'])
print(df)
#      A    B    C
# X  0.1  0.5  0.9
# Y -0.1 -0.5 -0.9

print(np.floor(df))
#      A    B    C
# X  0.0  0.0  0.0
# Y -1.0 -1.0 -1.0

print(type(np.floor(df)))
# <class 'pandas.core.frame.DataFrame'>

print(np.floor(df['A']))
# X    0.0
# Y   -1.0
# Name: A, dtype: float64

print(type(np.floor(df['A'])))
# <class 'pandas.core.series.Series'>

source: pandas_numpy_function.py

NumPyの関数で引数axisを指定することも可能。

print(np.sum(df, axis=0))
# A    0.0
# B    0.0
# C    0.0
# dtype: float64

print(np.sum(df, axis=1))
# X    1.5
# Y   -1.5
# dtype: float64

print(type(np.sum(df, axis=0)))
# <class 'pandas.core.series.Series'>

source: pandas_numpy_function.py

処理速度比較

DataFrameのmap()やapply()メソッドと、そのほかの専用メソッド、および、NumPyの関数の処理速度を比較する。

100行100列のDataFrameを例とする。

df = pd.DataFrame(np.arange(-5000, 5000).reshape(100, 100))

print(df.shape)
# (100, 100)

source: pandas_map_apply_timeit.py

以下の例はJupyter Notebookのマジックコマンド%%timeitを利用しており、Pythonスクリプトとして実行しても計測されないので注意。

関連記事: Pythonのtimeitモジュールで処理時間を計測

map()に組み込み関数abs()を指定する場合と、DataFrameのabs()メソッドおよびnp.abs()関数を使う場合の結果は以下の通り。map()が遅いことが分かる。

%%timeit
df.map(abs)
# 2.07 ms ± 16.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
df.abs()
# 5.06 µs ± 55 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

%%timeit
np.abs(df)
# 7.81 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

source: pandas_map_apply_timeit.py

apply()に組み込み関数sum()を指定する場合と、DataFrameのsum()メソッドおよびnp.sum()関数を使う場合の結果は以下の通り。apply()が遅いことが分かる。raw=Trueとすると速くはなるが、sum()やnp.sum()よりは遥かに遅い。

%%timeit
df.apply(sum)
# 932 µs ± 95.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%%timeit
df.apply(sum, raw=True)
# 427 µs ± 4.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%%timeit
df.sum()
# 35 µs ± 140 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

%%timeit
np.sum(df, axis=0)
# 37.3 µs ± 66.9 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

source: pandas_map_apply_timeit.py

map()やapply()メソッドはあくまでも他では実現できない複雑な処理を適用するためのもので、可能な限りそのほかのメソッドやNumPyの関数を使うほうがよいだろう。

pandasで要素・行・列に関数を適用するmap, apply, applymap

Seriesの要素に関数を適用: map(), apply()

map()の使い方

apply()の使い方

DataFrameの要素に関数を適用: map(), applymap()

DataFrameの行・列に関数を適用: apply()

基本的な使い方

行・列を指定: 引数axis

関数に引数指定: キーワード引数、引数args

Seriesではなくndarrayとして関数に渡す: 引数raw

任意の行・列の要素に関数を適用

DataFrame, Seriesのメソッドや算術演算子を利用

NumPyの関数を利用

処理速度比較

関連カテゴリー

関連記事