pandas: Select columns by dtype with select_dtypes()

Modified: | Tags: Python, pandas

In pandas, each column of a DataFrame has a specific data type (dtype). To select columns based on their data types, use the select_dtypes() method. For example, you can extract only numerical columns.

For more details on data types (dtype) in pandas, see the following article.

To extract columns based on conditions for column names rather than data types, see the following article.

The pandas and NumPy versions used in this article are as follows. Note that functionality may vary between versions. The following DataFrame is used as an example.

import pandas as pd
import numpy as np

print(pd.__version__)
# 2.1.4

print(np.__version__)
# 1.26.2

df = pd.DataFrame({'a': [1, 2, 3],
                   'b': np.array([10, 20, 30], dtype=np.int32),
                   'c': [0.1, 0.2, 0.3],
                   'd': ['X', 'Y', 'Z'],
                   'e': [[0, 0], [1, 1], [2, 2]],
                   'f': [True, True, False],
                   'g': pd.to_datetime(['2023-12-01', '2023-12-02', '2023-12-03'])})
print(df)
#    a   b    c  d       e      f          g
# 0  1  10  0.1  X  [0, 0]   True 2023-12-01
# 1  2  20  0.2  Y  [1, 1]   True 2023-12-02
# 2  3  30  0.3  Z  [2, 2]  False 2023-12-03

print(df.dtypes)
# a             int64
# b             int32
# c           float64
# d            object
# e            object
# f              bool
# g    datetime64[ns]
# dtype: object

Basic usage of select_dtypes()

Specify data types to include: include

Use the include argument to specify the data types to include. You can specify data types with type objects or strings. Details are explained later.

print(df.select_dtypes(include=int))
#    a   b
# 0  1  10
# 1  2  20
# 2  3  30

You can specify multiple data types in a list.

print(df.select_dtypes(include=['int32', bool]))
#     b      f
# 0  10   True
# 1  20   True
# 2  30  False

If a column of the specified data type does not exist, an empty DataFrame is returned.

print(df.select_dtypes(include='float32'))
# Empty DataFrame
# Columns: []
# Index: [0, 1, 2]

You can use 'number' to extract only numeric columns.

print(df.select_dtypes(include='number'))
#    a   b    c
# 0  1  10  0.1
# 1  2  20  0.2
# 2  3  30  0.3

Specify data types to exclude: exclude

Use the exclude argument to specify the data types to exclude. Multiple data types can be specified in a list.

print(df.select_dtypes(exclude=int))
#      c  d       e      f          g
# 0  0.1  X  [0, 0]   True 2023-12-01
# 1  0.2  Y  [1, 1]   True 2023-12-02
# 2  0.3  Z  [2, 2]  False 2023-12-03

print(df.select_dtypes(exclude=['int32', bool]))
#    a    c  d       e          g
# 0  1  0.1  X  [0, 0] 2023-12-01
# 1  2  0.2  Y  [1, 1] 2023-12-02
# 2  3  0.3  Z  [2, 2] 2023-12-03

include and exclude can be specified at the same time, but specifying the same type will result in an error.

print(df.select_dtypes(include='number', exclude='int32'))
#    a    c
# 0  1  0.1
# 1  2  0.2
# 2  3  0.3

# print(df.select_dtypes(include=['int32', bool], exclude='int32'))
# ValueError: include and exclude overlap on frozenset({<class 'numpy.int32'>})

How to specify data types in select_dtypes()

In select_dtypes(), data types can be specified with type objects such as int or np.int64, or with type name/type code strings like 'int64' or 'i8'.

print(df.select_dtypes(include=['i8', 'int32', np.float64]))
#    a   b    c
# 0  1  10  0.1
# 1  2  20  0.2
# 2  3  30  0.3

In addition, data types can be specified as follows. 'number' is useful for specifying all numeric types at once.

  • To select all numeric types, use np.number or 'number'
  • To select datetimes, use np.datetime64, 'datetime' or 'datetime64'
  • To select timedeltas, use np.timedelta64, 'timedelta' or 'timedelta64'
  • To select Pandas categorical dtypes, use 'category'
  • To select Pandas datetimetz dtypes, use 'datetimetz' or 'datetime64[ns, tz]' pandas.DataFrame.select_dtypes — pandas 2.1.4 documentation
print(df.select_dtypes(include=['number', 'datetime']))
#    a   b    c          g
# 0  1  10  0.1 2023-12-01
# 1  2  20  0.2 2023-12-02
# 2  3  30  0.3 2023-12-03

Caution when specifying string columns in select_dtypes()

Since the data type of columns containing strings (str) is object, specifying str or 'str' in select_dtypes() will cause an error.

# print(df.select_dtypes(include=str))
# TypeError: string dtypes are not allowed, use 'object' instead

Columns containing not only str but also other Python built-in types such as list or dict are categorized as object type. Note that if object is specified in select_dtypes(), these columns will also be selected.

print(df.select_dtypes(include=object))
#    d       e
# 0  X  [0, 0]
# 1  Y  [1, 1]
# 2  Z  [2, 2]

print(type(df.at[0, 'd']))
# <class 'str'>

print(type(df.at[0, 'e']))
# <class 'list'>

To exclusively extract columns with str elements, apply the built-in type() function to each element in a row and check for a match with str. The resulting boolean index can then be used in loc[] for column selection.

print(df.iloc[0].map(type) == str)
# a    False
# b    False
# c    False
# d     True
# e    False
# f    False
# g    False
# Name: 0, dtype: bool

print(df.loc[:, df.iloc[0].map(type) == str])
#    d
# 0  X
# 1  Y
# 2  Z

Related Categories

Related Articles