pandas: Select columns by dtype with select_dtypes()
In pandas, each column of a DataFrame has a specific data type (dtype). To select columns based on their data types, use the select_dtypes() method. For example, you can extract only numerical columns.
For more details on data types (dtype) in pandas, see the following article.
To extract columns based on conditions for column names rather than data types, see the following article.
The pandas and NumPy versions used in this article are as follows. Note that functionality may vary between versions. The following DataFrame is used as an example.
import pandas as pd
import numpy as np
print(pd.__version__)
# 2.1.4
print(np.__version__)
# 1.26.2
df = pd.DataFrame({'a': [1, 2, 3],
'b': np.array([10, 20, 30], dtype=np.int32),
'c': [0.1, 0.2, 0.3],
'd': ['X', 'Y', 'Z'],
'e': [[0, 0], [1, 1], [2, 2]],
'f': [True, True, False],
'g': pd.to_datetime(['2023-12-01', '2023-12-02', '2023-12-03'])})
print(df)
# a b c d e f g
# 0 1 10 0.1 X [0, 0] True 2023-12-01
# 1 2 20 0.2 Y [1, 1] True 2023-12-02
# 2 3 30 0.3 Z [2, 2] False 2023-12-03
print(df.dtypes)
# a int64
# b int32
# c float64
# d object
# e object
# f bool
# g datetime64[ns]
# dtype: object
Basic usage of select_dtypes()
Specify data types to include: include
Use the include argument to specify the data types to include. You can specify data types with type objects or strings. Details are explained later.
print(df.select_dtypes(include=int))
# a b
# 0 1 10
# 1 2 20
# 2 3 30
You can specify multiple data types in a list.
print(df.select_dtypes(include=['int32', bool]))
# b f
# 0 10 True
# 1 20 True
# 2 30 False
If a column of the specified data type does not exist, an empty DataFrame is returned.
print(df.select_dtypes(include='float32'))
# Empty DataFrame
# Columns: []
# Index: [0, 1, 2]
You can use 'number' to extract only numeric columns.
print(df.select_dtypes(include='number'))
# a b c
# 0 1 10 0.1
# 1 2 20 0.2
# 2 3 30 0.3
Specify data types to exclude: exclude
Use the exclude argument to specify the data types to exclude. Multiple data types can be specified in a list.
print(df.select_dtypes(exclude=int))
# c d e f g
# 0 0.1 X [0, 0] True 2023-12-01
# 1 0.2 Y [1, 1] True 2023-12-02
# 2 0.3 Z [2, 2] False 2023-12-03
print(df.select_dtypes(exclude=['int32', bool]))
# a c d e g
# 0 1 0.1 X [0, 0] 2023-12-01
# 1 2 0.2 Y [1, 1] 2023-12-02
# 2 3 0.3 Z [2, 2] 2023-12-03
include and exclude can be specified at the same time, but specifying the same type will result in an error.
print(df.select_dtypes(include='number', exclude='int32'))
# a c
# 0 1 0.1
# 1 2 0.2
# 2 3 0.3
# print(df.select_dtypes(include=['int32', bool], exclude='int32'))
# ValueError: include and exclude overlap on frozenset({<class 'numpy.int32'>})
How to specify data types in select_dtypes()
In select_dtypes(), data types can be specified with type objects such as int or np.int64, or with type name/type code strings like 'int64' or 'i8'.
print(df.select_dtypes(include=['i8', 'int32', np.float64]))
# a b c
# 0 1 10 0.1
# 1 2 20 0.2
# 2 3 30 0.3
In addition, data types can be specified as follows. 'number' is useful for specifying all numeric types at once.
- To select all numeric types, use
np.numberor'number'- To select datetimes, use
np.datetime64,'datetime'or'datetime64'- To select timedeltas, use
np.timedelta64,'timedelta'or'timedelta64'- To select Pandas categorical dtypes, use
'category'- To select Pandas datetimetz dtypes, use
'datetimetz'or'datetime64[ns, tz]'pandas.DataFrame.select_dtypes — pandas 2.1.4 documentation
print(df.select_dtypes(include=['number', 'datetime']))
# a b c g
# 0 1 10 0.1 2023-12-01
# 1 2 20 0.2 2023-12-02
# 2 3 30 0.3 2023-12-03
Caution when specifying string columns in select_dtypes()
Since the data type of columns containing strings (str) is object, specifying str or 'str' in select_dtypes() will cause an error.
# print(df.select_dtypes(include=str))
# TypeError: string dtypes are not allowed, use 'object' instead
Columns containing not only str but also other Python built-in types such as list or dict are categorized as object type. Note that if object is specified in select_dtypes(), these columns will also be selected.
print(df.select_dtypes(include=object))
# d e
# 0 X [0, 0]
# 1 Y [1, 1]
# 2 Z [2, 2]
print(type(df.at[0, 'd']))
# <class 'str'>
print(type(df.at[0, 'e']))
# <class 'list'>
To exclusively extract columns with str elements, apply the built-in type() function to each element in a row and check for a match with str. The resulting boolean index can then be used in loc[] for column selection.
- pandas: Apply functions to values, rows, columns with map(), apply()
- pandas: Get/Set values with loc, iloc, at, iat
print(df.iloc[0].map(type) == str)
# a False
# b False
# c False
# d True
# e False
# f False
# g False
# Name: 0, dtype: bool
print(df.loc[:, df.iloc[0].map(type) == str])
# d
# 0 X
# 1 Y
# 2 Z