pandas: Select columns by dtype with select_dtypes()
In pandas, each column of a DataFrame
has a specific data type (dtype
). To select columns based on their data types, use the select_dtypes()
method. For example, you can extract only numerical columns.
For more details on data types (dtype
) in pandas, see the following article.
To extract columns based on conditions for column names rather than data types, see the following article.
The pandas and NumPy versions used in this article are as follows. Note that functionality may vary between versions. The following DataFrame
is used as an example.
import pandas as pd
import numpy as np
print(pd.__version__)
# 2.1.4
print(np.__version__)
# 1.26.2
df = pd.DataFrame({'a': [1, 2, 3],
'b': np.array([10, 20, 30], dtype=np.int32),
'c': [0.1, 0.2, 0.3],
'd': ['X', 'Y', 'Z'],
'e': [[0, 0], [1, 1], [2, 2]],
'f': [True, True, False],
'g': pd.to_datetime(['2023-12-01', '2023-12-02', '2023-12-03'])})
print(df)
# a b c d e f g
# 0 1 10 0.1 X [0, 0] True 2023-12-01
# 1 2 20 0.2 Y [1, 1] True 2023-12-02
# 2 3 30 0.3 Z [2, 2] False 2023-12-03
print(df.dtypes)
# a int64
# b int32
# c float64
# d object
# e object
# f bool
# g datetime64[ns]
# dtype: object
Basic usage of select_dtypes()
Specify data types to include: include
Use the include
argument to specify the data types to include. You can specify data types with type objects or strings. Details are explained later.
print(df.select_dtypes(include=int))
# a b
# 0 1 10
# 1 2 20
# 2 3 30
You can specify multiple data types in a list.
print(df.select_dtypes(include=['int32', bool]))
# b f
# 0 10 True
# 1 20 True
# 2 30 False
If a column of the specified data type does not exist, an empty DataFrame
is returned.
print(df.select_dtypes(include='float32'))
# Empty DataFrame
# Columns: []
# Index: [0, 1, 2]
You can use 'number'
to extract only numeric columns.
print(df.select_dtypes(include='number'))
# a b c
# 0 1 10 0.1
# 1 2 20 0.2
# 2 3 30 0.3
Specify data types to exclude: exclude
Use the exclude
argument to specify the data types to exclude. Multiple data types can be specified in a list.
print(df.select_dtypes(exclude=int))
# c d e f g
# 0 0.1 X [0, 0] True 2023-12-01
# 1 0.2 Y [1, 1] True 2023-12-02
# 2 0.3 Z [2, 2] False 2023-12-03
print(df.select_dtypes(exclude=['int32', bool]))
# a c d e g
# 0 1 0.1 X [0, 0] 2023-12-01
# 1 2 0.2 Y [1, 1] 2023-12-02
# 2 3 0.3 Z [2, 2] 2023-12-03
include
and exclude
can be specified at the same time, but specifying the same type will result in an error.
print(df.select_dtypes(include='number', exclude='int32'))
# a c
# 0 1 0.1
# 1 2 0.2
# 2 3 0.3
# print(df.select_dtypes(include=['int32', bool], exclude='int32'))
# ValueError: include and exclude overlap on frozenset({<class 'numpy.int32'>})
How to specify data types in select_dtypes()
In select_dtypes()
, data types can be specified with type objects such as int
or np.int64
, or with type name/type code strings like 'int64'
or 'i8'
.
print(df.select_dtypes(include=['i8', 'int32', np.float64]))
# a b c
# 0 1 10 0.1
# 1 2 20 0.2
# 2 3 30 0.3
In addition, data types can be specified as follows. 'number'
is useful for specifying all numeric types at once.
- To select all numeric types, use
np.number
or'number'
- To select datetimes, use
np.datetime64
,'datetime'
or'datetime64'
- To select timedeltas, use
np.timedelta64
,'timedelta'
or'timedelta64'
- To select Pandas categorical dtypes, use
'category'
- To select Pandas datetimetz dtypes, use
'datetimetz'
or'datetime64[ns, tz]'
pandas.DataFrame.select_dtypes — pandas 2.1.4 documentation
print(df.select_dtypes(include=['number', 'datetime']))
# a b c g
# 0 1 10 0.1 2023-12-01
# 1 2 20 0.2 2023-12-02
# 2 3 30 0.3 2023-12-03
Caution when specifying string columns in select_dtypes()
Since the data type of columns containing strings (str
) is object
, specifying str
or 'str'
in select_dtypes()
will cause an error.
# print(df.select_dtypes(include=str))
# TypeError: string dtypes are not allowed, use 'object' instead
Columns containing not only str
but also other Python built-in types such as list
or dict
are categorized as object
type. Note that if object
is specified in select_dtypes()
, these columns will also be selected.
print(df.select_dtypes(include=object))
# d e
# 0 X [0, 0]
# 1 Y [1, 1]
# 2 Z [2, 2]
print(type(df.at[0, 'd']))
# <class 'str'>
print(type(df.at[0, 'e']))
# <class 'list'>
To exclusively extract columns with str
elements, apply the built-in type()
function to each element in a row and check for a match with str
. The resulting boolean index can then be used in loc[]
for column selection.
- pandas: Apply functions to values, rows, columns with map(), apply()
- pandas: Get/Set values with loc, iloc, at, iat
print(df.iloc[0].map(type) == str)
# a False
# b False
# c False
# d True
# e False
# f False
# g False
# Name: 0, dtype: bool
print(df.loc[:, df.iloc[0].map(type) == str])
# d
# 0 X
# 1 Y
# 2 Z