pandas.DataFrameから特定の型の列を抽出・除外するselect_dtypes

Modified: 2023-12-21 | Tags: Python, pandas

pandasのDataFrameは列ごとにデータ型dtypeを保持している。特定のデータ型dtypeの列を抽出・除外するにはselect_dtypes()メソッドを使う。例えば、数値列のみを抽出したり除外したりできる。

pandas.DataFrame.select_dtypes — pandas 2.1.4 documentation

select_dtypes()の基本的な使い方
- 抽出する型を指定: 引数include
- 除外する型を指定: 引数exclude
select_dtypes()でのデータ型の指定方法
select_dtypes()で文字列の列を指定する場合の注意

pandasのデータ型dtypeについての詳細は以下の記事を参照。

関連記事: pandasのデータ型dtype一覧とastypeによる変換（キャスト）

データ型ではなく列名に対する条件で列を抽出したい場合は以下の記事を参照。

関連記事: pandas.DataFrameの行・列を行名・列名の条件で抽出するfilter

本記事のサンプルコードのpandasとNumPyのバージョンは以下の通り。バージョンによって仕様が異なる可能性があるので注意。以下のDataFrameを例とする。

import pandas as pd
import numpy as np

print(pd.__version__)
# 2.1.4

print(np.__version__)
# 1.26.2

df = pd.DataFrame({'a': [1, 2, 3],
                   'b': np.array([10, 20, 30], dtype=np.int32),
                   'c': [0.1, 0.2, 0.3],
                   'd': ['X', 'Y', 'Z'],
                   'e': [[0, 0], [1, 1], [2, 2]],
                   'f': [True, True, False],
                   'g': pd.to_datetime(['2023-12-01', '2023-12-02', '2023-12-03'])})
print(df)
#    a   b    c  d       e      f          g
# 0  1  10  0.1  X  [0, 0]   True 2023-12-01
# 1  2  20  0.2  Y  [1, 1]   True 2023-12-02
# 2  3  30  0.3  Z  [2, 2]  False 2023-12-03

print(df.dtypes)
# a             int64
# b             int32
# c           float64
# d            object
# e            object
# f              bool
# g    datetime64[ns]
# dtype: object

source: pandas_select_dtypes.py

select_dtypes()の基本的な使い方

抽出する型を指定: 引数include

引数includeに抽出するデータ型dtypeを指定する。

print(df.select_dtypes(include=int))
#    a   b
# 0  1  10
# 1  2  20
# 2  3  30

source: pandas_select_dtypes.py

リストで複数のデータ型dtypeを指定可能。

print(df.select_dtypes(include=['int32', bool]))
#     b      f
# 0  10   True
# 1  20   True
# 2  30  False

source: pandas_select_dtypes.py

対象の列が存在しない場合は空のDataFrameが返される。

関連記事: pandas.DataFrame, Seriesが空か判定するempty

print(df.select_dtypes(include='float32'))
# Empty DataFrame
# Columns: []
# Index: [0, 1, 2]

source: pandas_select_dtypes.py

データ型の指定方法については後述。'number'で数値列のみを抽出することもできる。

print(df.select_dtypes(include='number'))
#    a   b    c
# 0  1  10  0.1
# 1  2  20  0.2
# 2  3  30  0.3

source: pandas_select_dtypes.py

除外する型を指定: 引数exclude

引数excludeに除外するデータ型dtypeを指定する。リストで複数のデータ型dtypeを指定可能。

print(df.select_dtypes(exclude=int))
#      c  d       e      f          g
# 0  0.1  X  [0, 0]   True 2023-12-01
# 1  0.2  Y  [1, 1]   True 2023-12-02
# 2  0.3  Z  [2, 2]  False 2023-12-03

print(df.select_dtypes(exclude=['int32', bool]))
#    a    c  d       e          g
# 0  1  0.1  X  [0, 0] 2023-12-01
# 1  2  0.2  Y  [1, 1] 2023-12-02
# 2  3  0.3  Z  [2, 2] 2023-12-03

source: pandas_select_dtypes.py

includeとexcludeは同時に指定可能だが、同じ型を指定するとエラー。

print(df.select_dtypes(include='number', exclude='int32'))
#    a    c
# 0  1  0.1
# 1  2  0.2
# 2  3  0.3

# print(df.select_dtypes(include=['int32', bool], exclude='int32'))
# ValueError: include and exclude overlap on frozenset({<class 'numpy.int32'>})

source: pandas_select_dtypes.py

select_dtypes()でのデータ型の指定方法

select_dtypes()において、データ型はintやnp.int64などの型オブジェクト、および、'int64'や'i8'などの型名・型コードの文字列で指定できる。

関連記事: pandasのデータ型dtype一覧とastypeによる変換（キャスト）

print(df.select_dtypes(include=['i8', 'int32', np.float64]))
#    a   b    c
# 0  1  10  0.1
# 1  2  20  0.2
# 2  3  30  0.3

source: pandas_select_dtypes.py

そのほか、以下のように指定できる。数値型をまとめて指定できる'number'は便利。

To select all numeric types, use np.number or 'number'

To select datetimes, use np.datetime64, 'datetime' or 'datetime64'

To select timedeltas, use np.timedelta64, 'timedelta' or 'timedelta64'

To select Pandas categorical dtypes, use 'category'

To select Pandas datetimetz dtypes, use 'datetimetz' or 'datetime64[ns, tz]' pandas.DataFrame.select_dtypes — pandas 2.1.4 documentation

print(df.select_dtypes(include=['number', 'datetime']))
#    a   b    c          g
# 0  1  10  0.1 2023-12-01
# 1  2  20  0.2 2023-12-02
# 2  3  30  0.3 2023-12-03

source: pandas_select_dtypes.py

select_dtypes()で文字列の列を指定する場合の注意

文字列strを要素とする列のデータ型dtypeはobjectなので、strや'str'を指定するとエラーになる。

# print(df.select_dtypes(include=str))
# TypeError: string dtypes are not allowed, use 'object' instead

source: pandas_select_dtypes.py

strだけでなくlistやdictなどのPythonの組み込み型を要素とする列のデータ型dtypeもobject。objectを指定した場合、それらの列も選択されるので注意。

print(df.select_dtypes(include=object))
#    d       e
# 0  X  [0, 0]
# 1  Y  [1, 1]
# 2  Z  [2, 2]

print(type(df.at[0, 'd']))
# <class 'str'>

print(type(df.at[0, 'e']))
# <class 'list'>

source: pandas_select_dtypes.py

strを要素とする列のみを抽出するには、適当な行の各要素に組み込み関数type()を適用しstrと一致するかを判定する。その結果をブーリアンインデックスとしてloc[]の列指定に用いればよい。

関連記事: pandasで要素・行・列に関数を適用するmap, apply, applymap
関連記事: pandasで任意の位置の値を取得・変更するat, iat, loc, iloc

print(df.iloc[0].map(type) == str)
# a    False
# b    False
# c    False
# d     True
# e    False
# f    False
# g    False
# Name: 0, dtype: bool

print(df.loc[:, df.iloc[0].map(type) == str])
#    d
# 0  X
# 1  Y
# 2  Z

source: pandas_select_dtypes.py

pandas.DataFrameから特定の型の列を抽出・除外するselect_dtypes

select_dtypes()の基本的な使い方

抽出する型を指定: 引数include

除外する型を指定: 引数exclude

select_dtypes()でのデータ型の指定方法

select_dtypes()で文字列の列を指定する場合の注意

関連カテゴリー

関連記事