note.nkmk.me

pandas: Extract columns from pandas.DataFrame based on dtype

Posted: 2021-10-05 / Tags: Python, pandas

pandas.DataFrame has the data type dtype for each column.

To extract only columns with specific dtype, use the select_dtypes() method of pandas.DataFrame.

This article describes following contents.

  • Basic usage of select_dtypes()
    • Specify dtype to extract: include
    • Specify dtype to exclude: exclude

The following pandas.DataFrame with columns of various data types is used as an example.

import pandas as pd

df = pd.DataFrame({'a': [1, 2, 1, 3],
                   'b': [0.4, 1.1, 0.1, 0.8],
                   'c': ['X', 'Y', 'X', 'Z'],
                   'd': [[0, 0], [0, 1], [1, 0], [1, 1]],
                   'e': [True, True, False, True]})

df['f'] = pd.to_datetime(['2018-01-01', '2018-03-15', '2018-02-20', '2018-03-15'])

print(df)
#    a    b  c       d      e          f
# 0  1  0.4  X  [0, 0]   True 2018-01-01
# 1  2  1.1  Y  [0, 1]   True 2018-03-15
# 2  1  0.1  X  [1, 0]  False 2018-02-20
# 3  3  0.8  Z  [1, 1]   True 2018-03-15

print(df.dtypes)
# a             int64
# b           float64
# c            object
# d            object
# e              bool
# f    datetime64[ns]
# dtype: object
Sponsored Link

Basic usage of select_dtypes()

Specify dtype to extract: include

Specify the dtype to extract with the include parameter.

print(df.select_dtypes(include=int))
#    a
# 0  1
# 1  2
# 2  1
# 3  3

Python built-in types such as int and float can be specified as is. You can also specify it as 'int' as a string, or as 'int64' strictly including the number of bits. The standard number of bits varies depending on the environment.

print(df.select_dtypes(include='int'))
#    a
# 0  1
# 1  2
# 2  1
# 3  3

print(df.select_dtypes(include='int64'))
#    a
# 0  1
# 1  2
# 2  1
# 3  3

Of course, if the number of bits is included, it will not be selected unless the number of bits matches.

print(df.select_dtypes(include='int32'))
# Empty DataFrame
# Columns: []
# Index: [0, 1, 2, 3]

Multiple dtype can be specified with a list. datetime64[ns] can be specified with 'datetime'.

print(df.select_dtypes(include=[int, float, 'datetime']))
#    a    b          f
# 0  1  0.4 2018-01-01
# 1  2  1.1 2018-03-15
# 2  1  0.1 2018-02-20
# 3  3  0.8 2018-03-15

Numeric types such as int and float can be specified together with 'number'.

print(df.select_dtypes(include='number'))
#    a    b
# 0  1  0.4
# 1  2  1.1
# 2  1  0.1
# 3  3  0.8

The dtype of a column whose element is str is object, but the object column also contains Python built-in types other than str. Note that include=object also extracts columns with list type elements, as in the example, although this is not often the case.

print(df.select_dtypes(include=object))
#    c       d
# 0  X  [0, 0]
# 1  Y  [0, 1]
# 2  X  [1, 0]
# 3  Z  [1, 1]

print(type(df.at[0, 'c']))
# <class 'str'>

print(type(df.at[0, 'd']))
# <class 'list'>

Unless you handle it intentionally, you may not need to worry too much about it, because objects other than str are not likely to be elements of pandas.DataFrame.

Specify dtype to exclude: exclude

Specify the dtype to exclude with the exclude parameter.

Multiple dtype can be specified with a list.

print(df.select_dtypes(exclude='number'))
#    c       d      e          f
# 0  X  [0, 0]   True 2018-01-01
# 1  Y  [0, 1]   True 2018-03-15
# 2  X  [1, 0]  False 2018-02-20
# 3  Z  [1, 1]   True 2018-03-15

print(df.select_dtypes(exclude=[bool, 'datetime']))
#    a    b  c       d
# 0  1  0.4  X  [0, 0]
# 1  2  1.1  Y  [0, 1]
# 2  1  0.1  X  [1, 0]
# 3  3  0.8  Z  [1, 1]

include and exclude can be specified at the same time, but an error raises if the same type is specified for both.

print(df.select_dtypes(include='number', exclude=int))
#      b
# 0  0.4
# 1  1.1
# 2  0.1
# 3  0.8

# print(df.select_dtypes(include=[int, bool], exclude=int))
# ValueError: include and exclude overlap on frozenset({<class 'numpy.int64'>})
Sponsored Link
Share

Related Categories

Related Articles