pandasのデータ型dtype一覧とastypeによる変換（キャスト）

Modified: 2023-07-25 | Tags: Python, pandas

pandas.Seriesは一つのデータ型dtype、pandas.DataFrameは列ごとにそれぞれデータ型dtypeを保持している。

dtypeは、コンストラクタで新たにオブジェクトを生成する際やCSVファイルなどから読み込む際に指定できる。また、astype()メソッドで型変換（キャスト）することも可能。

pandasの主要なデータ型dtype一覧
object型と文字列
astype()によるデータ型dtypeの変換（キャスト）
CSVファイル読み込み時のデータ型dtype指定
- すべての列に同じデータ型dtypeを指定
- 列ごとにデータ型dtypeを指定
暗黙の型変換

データ型dtypeによって列を抽出する方法は以下の記事を参照。

関連記事: pandas.DataFrameから特定の型の列を抽出・除外するselect_dtypes

NumPyのデータ型dtypeとastype()については以下の記事を参照。

関連記事: NumPyのデータ型dtype一覧とastypeによる変換（キャスト）

本記事のサンプルコードのpandasは2.0.3。バージョンによっては挙動が異なる場合があるので注意。NumPyもインポートしておく。

import pandas as pd
import numpy as np

print(pd.__version__)
# 2.0.3

source: pandas_dtype.py

pandasの主要なデータ型dtype一覧

pandasの主要なデータ型dtypeは以下の通り。

データ型`dtype`	型コード	説明
`int8`	`i1`	符号あり8ビット整数型
`int16`	`i2`	符号あり16ビット整数型
`int32`	`i4`	符号あり32ビット整数型
`int64`	`i8`	符号あり64ビット整数型
`uint8`	`u1`	符号なし8ビット整数型
`uint16`	`u2`	符号なし16ビット整数型
`uint32`	`u4`	符号なし32ビット整数型
`uint64`	`u8`	符号なし64ビット整数型
`float16`	`f2`	半精度浮動小数点型（符号部1ビット、指数部5ビット、仮数部10ビット）
`float32`	`f4`	単精度浮動小数点型（符号部1ビット、指数部8ビット、仮数部23ビット）
`float64`	`f8`	倍精度浮動小数点型（符号部1ビット、指数部11ビット、仮数部52ビット）
`float128`	`f16`	四倍精度浮動小数点型（符号部1ビット、指数部15ビット、仮数部112ビット）
`complex64`	`c8`	複素数（実部・虚部がそれぞれ`float32`）
`complex128`	`c16`	複素数（実部・虚部がそれぞれ`float64`）
`complex256`	`c32`	複素数（実部・虚部がそれぞれ`float128`）
`bool`	`?`	ブール型（`True` or `False`）
`object`	`O`	Pythonオブジェクト型

データ型名の末尾の数字はビット（bit）で表し、型コード末尾の数字はバイト（Byte）で表す。同じ型でも値が違うので注意。

bool型の型コード?は不明という意味ではなく文字通り?が割り当てられている。

日時を表すdatetime64型については以下の記事を参照。

関連記事: pandas.DataFrame, Seriesを時系列データとして処理

関数やメソッドの引数でデータ型dtypeを指定するとき、例えばfloat64型は以下の3通りで指定可能。

型オブジェクト: np.float64
型名の文字列: 'float64'
型コードの文字列: 'f8'

s = pd.Series([0, 1, 2], dtype=np.float64)
print(s.dtype)
# float64

s = pd.Series([0, 1, 2], dtype='float64')
print(s.dtype)
# float64

s = pd.Series([0, 1, 2], dtype='f8')
print(s.dtype)
# float64

source: pandas_dtype.py

intやfloat, strのようなPythonの型を指定することもできる。この場合、等価なdtypeに自動的に変換される。Python3、64ビット環境での例は以下の通り。uintはPythonの型にはないが便宜上まとめて挙げておく。

Pythonの型	等価な`dtype`の例
`int`	`int64`
`float`	`float64`
`str`	`object`（各要素が`str`型）
(`uint`)	`uint64`

引数で指定する場合はintやfloatでも文字列'int', 'float'でもよい。Pythonの型にはないuintは文字列'uint'のみ可。

s = pd.Series([0, 1, 2], dtype='float')
print(s.dtype)
# float64

s = pd.Series([0, 1, 2], dtype=float)
print(s.dtype)
# float64

s = pd.Series([0, 1, 2], dtype='uint')
print(s.dtype)
# uint64

source: pandas_dtype.py

整数、浮動小数点数においてそれぞれの型が取り得る値の範囲（最小値・最大値）はnp.iinfo(), np.finfo()で確認できる。詳細は以下の記事を参照。

関連記事: NumPyのデータ型dtype一覧とastypeによる変換（キャスト）

なお、上のデータ型一覧は基本的にNumPyに準じたものだが、pandasが独自に拡張したデータ型もある。

Essential basic functionality - dtypes — pandas 2.0.3 documentation

object型と文字列

ここでは文字列str型とobject型について説明する。

なお、pandasバージョン1.0.0から文字列のためのデータ型としてStringDtypeが導入された。今後はこちらが主流になる可能性があるが、ここでは触れない。詳細は公式ドキュメントを参照。

Working with text data — pandas 2.0.3 documentation

特殊なデータ型であるobject

object型は特殊なデータ型で、Pythonオブジェクトへのポインターを格納する。各要素のデータはそれぞれ別の型を持つ場合がある。

pandasでは文字列を含むSeriesやDataFrameの列はobject型となるが、各要素はそれぞれの型を持っており、すべての要素が文字列str型とは限らない。

以下に例を示す。ここではmap()メソッドで各要素に組み込み関数type()を適用して型を確認している。np.nanは欠損値を表す。

関連記事: pandasで要素・行・列に関数を適用するmap, apply, applymap
関連記事: Pythonで型を取得・判定するtype関数, isinstance関数
関連記事: pandasにおける欠損値（nan, None, pd.NA）

s_object = pd.Series([0, 'abcde', np.nan])
print(s_object)
# 0        0
# 1    abcde
# 2      NaN
# dtype: object

print(s_object.map(type))
# 0      <class 'int'>
# 1      <class 'str'>
# 2    <class 'float'>
# dtype: object

source: pandas_dtype.py

astype()による型変換（詳細は後述）でstrを指定すると、NaNを含むすべての要素がstr型に変換される。この場合も、dtypeはobjectのまま。

s_str_astype = s_object.astype(str)
print(s_str_astype)
# 0        0
# 1    abcde
# 2      nan
# dtype: object

print(s_str_astype.map(type))
# 0    <class 'str'>
# 1    <class 'str'>
# 2    <class 'str'>
# dtype: object

source: pandas_dtype.py

コンストラクタでdtypeにstrを指定した場合、NaNはfloatのまま。なお、これはバージョン2.0.3での挙動であり、0.22.0ではNaNもstrに変換されていた。

s_str_constructor = pd.Series([0, 'abcde', np.nan], dtype=str)
print(s_str_constructor)
# 0        0
# 1    abcde
# 2      NaN
# dtype: object

print(s_str_constructor.map(type))
# 0      <class 'str'>
# 1      <class 'str'>
# 2    <class 'float'>
# dtype: object

source: pandas_dtype.py

注意: 文字列メソッド

dtypeが同じobject型でも、要素の型によってstrアクセサによる文字列メソッドの結果が異なるので注意。

例えば文字列の文字数を返すstr.len()を適用すると、数値型の要素はNaNを返す。

s_object = pd.Series([0, 'abcde', np.nan])
print(s_object)
# 0        0
# 1    abcde
# 2      NaN
# dtype: object

print(s_object.str.len())
# 0    NaN
# 1    5.0
# 2    NaN
# dtype: float64

source: pandas_dtype.py

文字列メソッドの結果にNaNが含まれている場合は列のデータ型がobjectでも各要素がstr型ではない可能性がある。文字列メソッドの前にastype(str)を適用すればよい。

s_str_astype = s_object.astype(str)
print(s_str_astype)
# 0        0
# 1    abcde
# 2      nan
# dtype: object

print(s_str_astype.str.len())
# 0    1
# 1    5
# 2    3
# dtype: int64

source: pandas_dtype.py

文字列メソッドの利用については以下の記事も参照。

関連記事: pandasの文字列メソッドで置換や空白削除などの処理を行う
関連記事: pandasで特定の文字列を含む行を抽出（完全一致、部分一致）
関連記事: pandasの文字列を区切り文字や正規表現で複数の列に分割

注意: 欠損値NaN

欠損値NaNはisnull()で判定したり、dropna()で削除したりできる。

関連記事: pandasで欠損値NaNが含まれているか判定、個数をカウント
関連記事: pandasで欠損値NaNを削除（除外）するdropna

s_object = pd.Series([0, 'abcde', np.nan])
print(s_object)
# 0        0
# 1    abcde
# 2      NaN
# dtype: object

print(s_object.map(type))
# 0      <class 'int'>
# 1      <class 'str'>
# 2    <class 'float'>
# dtype: object

print(s_object.isnull())
# 0    False
# 1    False
# 2     True
# dtype: bool

print(s_object.dropna())
# 0        0
# 1    abcde
# dtype: object

source: pandas_dtype.py

文字列strにキャストすると欠損値は文字列'nan'となり、欠損値を処理するメソッドの対象とならないので注意。

s_str_astype = s_object.astype(str)
print(s_str_astype)
# 0        0
# 1    abcde
# 2      nan
# dtype: object

print(s_str_astype.map(type))
# 0    <class 'str'>
# 1    <class 'str'>
# 2    <class 'str'>
# dtype: object

print(s_str_astype.isnull())
# 0    False
# 1    False
# 2    False
# dtype: bool

print(s_str_astype.dropna())
# 0        0
# 1    abcde
# 2      nan
# dtype: object

source: pandas_dtype.py

キャストする前に欠損値の処理を行うか、replace()で文字列'nan'を欠損値に置き換えればよい。

関連記事: pandas.DataFrame, Seriesの要素の値を置換するreplace

s_str_astype_nan = s_str_astype.replace('nan', np.nan)
print(s_str_astype_nan)
# 0        0
# 1    abcde
# 2      NaN
# dtype: object

print(s_str_astype_nan.map(type))
# 0      <class 'str'>
# 1      <class 'str'>
# 2    <class 'float'>
# dtype: object

print(s_str_astype_nan.isnull())
# 0    False
# 1    False
# 2     True
# dtype: bool

source: pandas_dtype.py

astype()によるデータ型dtypeの変換（キャスト）

pandas.DataFrame, pandas.Seriesのメソッドastype()を使うとデータ型dtypeを変更できる。

astype()はデータ型dtypeが変更された新たなpandas.DataFrame, pandas.Seriesを返し、元のオブジェクトは変更されない。

pandas.Seriesのデータ型dtypeを変更

pandas.Seriesのメソッドastype()の引数にデータ型dtypeを指定すると、そのdtypeに変更された新たなpandas.Seriesが返される。

s = pd.Series([1, 2, 3])
print(s)
# 0    1
# 1    2
# 2    3
# dtype: int64

s_f = s.astype('float64')
print(s_f)
# 0    1.0
# 1    2.0
# 2    3.0
# dtype: float64

source: pandas_astype.py

上述のようにdtypeは様々な形で指定できる。

s_f = s.astype('float')
print(s_f.dtype)
# float64

s_f = s.astype(float)
print(s_f.dtype)
# float64

s_f = s.astype('f8')
print(s_f.dtype)
# float64

source: pandas_astype.py

pandas.DataFrame全体のデータ型dtypeを一括で変更

pandas.DataFrameは列ごとにデータ型dtypeを保持している。

それぞれのdtypeはdtypes属性で取得・確認できる。

df = pd.DataFrame({'a': [11, 21, 31], 'b': [12, 22, 32], 'c': [13, 23, 33]})
print(df)
#     a   b   c
# 0  11  12  13
# 1  21  22  23
# 2  31  32  33

print(df.dtypes)
# a    int64
# b    int64
# c    int64
# dtype: object

source: pandas_astype.py

pandas.DataFrameのメソッドastype()の引数にデータ型dtypeを指定すると、すべての列のdtypeが変更された新たなpandas.DataFrameが返される。

df_f = df.astype('float64')
print(df_f)
#       a     b     c
# 0  11.0  12.0  13.0
# 1  21.0  22.0  23.0
# 2  31.0  32.0  33.0

print(df_f.dtypes)
# a    float64
# b    float64
# c    float64
# dtype: object

source: pandas_astype.py

pandas.DataFrameの任意の列のデータ型dtypeを個別に変更

astype()の引数に{列名: データ型}の辞書を指定すると、任意の列のデータ型dtypeを個別に変更できる。

一列だけでも複数列でも指定可能。

df = pd.DataFrame({'a': [11, 21, 31], 'b': [12, 22, 32], 'c': [13, 23, 33]})
print(df)
#     a   b   c
# 0  11  12  13
# 1  21  22  23
# 2  31  32  33

print(df.dtypes)
# a    int64
# b    int64
# c    int64
# dtype: object

df_fcol = df.astype({'a': float})
print(df_fcol)
#       a   b   c
# 0  11.0  12  13
# 1  21.0  22  23
# 2  31.0  32  33

print(df_fcol.dtypes)
# a    float64
# b      int64
# c      int64
# dtype: object

df_fcol2 = df.astype({'a': 'float32', 'c': 'int8'})
print(df_fcol2)
#       a   b   c
# 0  11.0  12  13
# 1  21.0  22  23
# 2  31.0  32  33

print(df_fcol2.dtypes)
# a    float32
# b      int64
# c       int8
# dtype: object

source: pandas_astype.py

CSVファイル読み込み時のデータ型dtype指定

pandasでは関数read_csv()でCSVファイルを読み込むことができる。引数dtypeで任意の型を指定できる。

関連記事: pandasでCSV/TSVファイル読み込み（read_csv, read_table）

以下のCSVファイルを例とする。

,a,b,c,d
ONE,1,"001",100,x
TWO,2,"020",,y
THREE,3,"300",300,z

source: sample_header_index_dtype.csv

引数dtypeを省略すると、列ごとにデータ型が自動的に選ばれる。

df = pd.read_csv('data/src/sample_header_index_dtype.csv', index_col=0)
print(df)
#        a    b      c  d
# ONE    1    1  100.0  x
# TWO    2   20    NaN  y
# THREE  3  300  300.0  z

print(df.dtypes)
# a      int64
# b      int64
# c    float64
# d     object
# dtype: object

source: pandas_read_csv_dtype.py

すべての列に同じデータ型dtypeを指定

引数dtypeに任意のデータ型を指定すると、index_colで指定した列も含めてすべての列がその型に変換される。

指定したデータ型に変換できない列が存在する場合はエラー。

# pd.read_csv('data/src/sample_header_index_dtype.csv',
#             index_col=0, dtype=float)
# ValueError: could not convert string to float: 'ONE'

source: pandas_read_csv_dtype.py

dtype=strの場合、欠損値NaNはstr型に変換されない。

df_str = pd.read_csv('data/src/sample_header_index_dtype.csv',
                     index_col=0, dtype=str)
print(df_str)
#        a    b    c  d
# ONE    1  001  100  x
# TWO    2  020  NaN  y
# THREE  3  300  300  z

print(df_str.dtypes)
# a    object
# b    object
# c    object
# d    object
# dtype: object

print(df_str.applymap(type))
#                    a              b                c              d
# ONE    <class 'str'>  <class 'str'>    <class 'str'>  <class 'str'>
# TWO    <class 'str'>  <class 'str'>  <class 'float'>  <class 'str'>
# THREE  <class 'str'>  <class 'str'>    <class 'str'>  <class 'str'>

source: pandas_read_csv_dtype.py

dtypeを指定せずに読み込んだ後にastype()でstrにキャストした場合は欠損値も文字列'nan'に変換される。

df = pd.read_csv('data/src/sample_header_index_dtype.csv', index_col=0)
print(df.astype(str))
#        a    b      c  d
# ONE    1    1  100.0  x
# TWO    2   20    nan  y
# THREE  3  300  300.0  z

print(df.astype(str).applymap(type))
#                    a              b              c              d
# ONE    <class 'str'>  <class 'str'>  <class 'str'>  <class 'str'>
# TWO    <class 'str'>  <class 'str'>  <class 'str'>  <class 'str'>
# THREE  <class 'str'>  <class 'str'>  <class 'str'>  <class 'str'>

source: pandas_read_csv_dtype.py

列ごとにデータ型dtypeを指定

astype()と同様に、read_csv()のdtypeでも辞書形式で列ごとの指定が可能。指定した列以外は自動で選ばれた型となる。

df_col = pd.read_csv('data/src/sample_header_index_dtype.csv',
                     index_col=0, dtype={'a': float, 'b': str})
print(df_col)
#          a    b      c  d
# ONE    1.0  001  100.0  x
# TWO    2.0  020    NaN  y
# THREE  3.0  300  300.0  z

print(df_col.dtypes)
# a    float64
# b     object
# c    float64
# d     object
# dtype: object

source: pandas_read_csv_dtype.py

列番号でも指定できる。インデックス列を指定している場合、インデックス列も含めた列番号で指定する必要があるので注意。

df_col = pd.read_csv('data/src/sample_header_index_dtype.csv',
                     index_col=0, dtype={1: float, 2: str})
print(df_col)
#          a    b      c  d
# ONE    1.0  001  100.0  x
# TWO    2.0  020    NaN  y
# THREE  3.0  300  300.0  z

print(df_col.dtypes)
# a    float64
# b     object
# c    float64
# d     object
# dtype: object

source: pandas_read_csv_dtype.py

暗黙の型変換

astype()による明示的な型変換だけでなく、演算などによって暗黙の型変換が行われる場合がある。

整数intの列と浮動小数点数floatの列を持つpandas.DataFrameを例とする。

df_mix = pd.DataFrame({'col_int': [0, 1, 2], 'col_float': [0.0, 0.1, 0.2]}, index=['A', 'B', 'C'])
print(df_mix)
#    col_int  col_float
# A        0        0.0
# B        1        0.1
# C        2        0.2

print(df_mix.dtypes)
# col_int        int64
# col_float    float64
# dtype: object

source: pandas_implicit_type_conversion.py

演算による暗黙の型変換

例えば、intの列とfloatの列との+演算子による加算の結果はfloatとなる。

print(df_mix['col_int'] + df_mix['col_float'])
# A    0.0
# B    1.1
# C    2.2
# dtype: float64

source: pandas_implicit_type_conversion.py

スカラー値との演算でも同様に暗黙の型変換が行われる。/演算子による除算の結果はfloat。

print(df_mix / 1)
#    col_int  col_float
# A      0.0        0.0
# B      1.0        0.1
# C      2.0        0.2

print((df_mix / 1).dtypes)
# col_int      float64
# col_float    float64
# dtype: object

source: pandas_implicit_type_conversion.py

+, -, *, //, **では、整数int同士の場合はint、浮動小数点数floatが含まれる場合はfloatとなる。NumPy配列ndarrayの暗黙の型変換に準ずる。

関連記事: NumPyのデータ型dtype一覧とastypeによる変換（キャスト）

print(df_mix * 1)
#    col_int  col_float
# A        0        0.0
# B        1        0.1
# C        2        0.2

print((df_mix * 1).dtypes)
# col_int        int64
# col_float    float64
# dtype: object

print(df_mix * 1.0)
#    col_int  col_float
# A      0.0        0.0
# B      1.0        0.1
# C      2.0        0.2

print((df_mix * 1.0).dtypes)
# col_int      float64
# col_float    float64
# dtype: object

source: pandas_implicit_type_conversion.py

行の取得や転置による暗黙の型変換

locやilocで1行をpandas.Seriesとして取得したり、Tやtranspose()で転置を行ったりする際にも型変換が行われる場合がある。例では整数intの要素が浮動小数点数floatに変換される。

print(df_mix.loc['A'])
# col_int      0.0
# col_float    0.0
# Name: A, dtype: float64

print(df_mix.T)
#              A    B    C
# col_int    0.0  1.0  2.0
# col_float  0.0  0.1  0.2

print(df_mix.T.dtypes)
# A    float64
# B    float64
# C    float64
# dtype: object

source: pandas_implicit_type_conversion.py

これは、行取得や転置などによって複数の型の要素が混在する列やpandas.Seriesが生じてしまうのが原因。pandas.Seriesおよびpandas.DataFrameの各列は単一のデータ型dtypeでなければならないので、型変換が発生する。

詳細は以下の記事を参照。

関連記事: pandasで任意の位置の値を取得・変更するat, iat, loc, iloc
関連記事: pandas.DataFrameの行と列を入れ替える（転置）

要素への代入による暗黙の型変換

要素に値を代入する際にも型変換が行われる場合がある。

例えば、intの列にfloatの要素を代入すると、その列がfloatに変換される。また、floatの列にintの要素を代入するとその要素がfloatに変換される。

df_mix.at['A', 'col_int'] = 10.1
df_mix.at['A', 'col_float'] = 10
print(df_mix)
#    col_int  col_float
# A     10.1       10.0
# B      1.0        0.1
# C      2.0        0.2

print(df_mix.dtypes)
# col_int      float64
# col_float    float64
# dtype: object

source: pandas_implicit_type_conversion.py

数値列に文字列の要素を代入すると、列はobjectになり要素ごとに異なる型となる。

df_mix.at['A', 'col_float'] = 'abc'
print(df_mix)
#    col_int col_float
# A     10.1       abc
# B      1.0       0.1
# C      2.0       0.2

print(df_mix.dtypes)
# col_int      float64
# col_float     object
# dtype: object

print(df_mix.applymap(type))
#            col_int        col_float
# A  <class 'float'>    <class 'str'>
# B  <class 'float'>  <class 'float'>
# C  <class 'float'>  <class 'float'>

source: pandas_implicit_type_conversion.py

なお、上のサンプルコードはバージョン2.0.3での結果。バージョン0.22.0では異なる型の要素を代入しても列の型変換は行われず、代入された要素の型が変わっていた。バージョンによって振る舞いが異なる場合があるので要注意。

pandasのデータ型dtype一覧とastypeによる変換（キャスト）

pandasの主要なデータ型dtype一覧

object型と文字列

特殊なデータ型であるobject

注意: 文字列メソッド

注意: 欠損値NaN

astype()によるデータ型dtypeの変換（キャスト）

pandas.Seriesのデータ型dtypeを変更

pandas.DataFrame全体のデータ型dtypeを一括で変更

pandas.DataFrameの任意の列のデータ型dtypeを個別に変更

CSVファイル読み込み時のデータ型dtype指定

すべての列に同じデータ型dtypeを指定

列ごとにデータ型dtypeを指定

暗黙の型変換

演算による暗黙の型変換

行の取得や転置による暗黙の型変換

要素への代入による暗黙の型変換

関連カテゴリー

関連記事