pandas.DataFrameの構造とその作成方法

Modified: 2023-07-23 | Tags: Python, pandas

pandas.DataFrameは二次元の表形式のデータ（テーブルデータ）を表す、pandasの基本的な型。

ここでは、はじめにpandas.DataFrameの構造と基本操作について説明し、そのあとでコンストラクタpandas.DataFrame()による作成方法およびファイルからの読み込み方法について説明する。

pandas.DataFrameの構造
コンストラクタpandas.DataFrame()の基本的な使い方
二次元配列・リストからDataFrameを作成
複数の一次元配列・リストからDataFrameを作成
- 各列に配列・リストを割り当てる
- 各行に配列・リストを割り当てる
辞書のリスト・辞書からDataFrameを作成
CSVファイルやExcelファイルから読み込み

一次元データであるpandas.Seriesからpandas.DataFrameを生成する方法については以下の記事を参照。

関連記事: pandas.DataFrameとSeriesを相互に変換

本記事のサンプルコードのpandasのバージョンは以下の通り。バージョンによって仕様が異なる可能性があるので注意。NumPyもインポートしている。

import pandas as pd
import numpy as np

print(pd.__version__)
# 2.0.3

source: pandas_dataframe_basic.py

pandas.DataFrameの構造

3つの構成要素: values, columns, index

DataFrameは、データ本体であるvalues、列名（列ラベル）columns、行名（行ラベル）indexの3つの要素から構成されている。

最もシンプルなDataFrameは以下のようなもの。DataFrameの作成については後で説明するので、ここでは特に気にしなくてよい。

df_simple = pd.DataFrame(np.arange(12).reshape(3, 4))
print(df_simple)
#    0  1   2   3
# 0  0  1   2   3
# 1  4  5   6   7
# 2  8  9  10  11

source: pandas_dataframe_basic.py

values, columns, indexはDataFrameの属性としてアクセスできる。

データ値values

valuesはNumPy配列ndarray。

print(df_simple.values)
# [[ 0  1  2  3]
#  [ 4  5  6  7]
#  [ 8  9 10 11]]

print(type(df_simple.values))
# <class 'numpy.ndarray'>

source: pandas_dataframe_basic.py

列名columnsと行名index

columnsとindexはここでは特に設定していないためRangeIndex型で0始まりの連番となる。tolist()メソッドでリスト化できる。

print(df_simple.columns)
# RangeIndex(start=0, stop=4, step=1)

print(type(df_simple.columns))
# <class 'pandas.core.indexes.range.RangeIndex'>

print(df_simple.index)
# RangeIndex(start=0, stop=3, step=1)

print(type(df_simple.index))
# <class 'pandas.core.indexes.range.RangeIndex'>

print(df_simple.columns.tolist())
# [0, 1, 2, 3]

print(type(df_simple.columns.tolist()))
# <class 'list'>

source: pandas_dataframe_basic.py

生成時に引数columns, indexを設定することで、各列・各行に任意の名前（ラベル）をつけることができる。

df = pd.DataFrame(np.arange(12).reshape(3, 4),
                  columns=['col_0', 'col_1', 'col_2', 'col_3'],
                  index=['row_0', 'row_1', 'row_2'])
print(df)
#        col_0  col_1  col_2  col_3
# row_0      0      1      2      3
# row_1      4      5      6      7
# row_2      8      9     10     11

source: pandas_dataframe_basic.py

このときのcolumnsとindexはIndex型。これもtolist()メソッドでリスト化できる。

print(df.columns)
# Index(['col_0', 'col_1', 'col_2', 'col_3'], dtype='object')

print(type(df.columns))
# <class 'pandas.core.indexes.base.Index'>

print(df.index)
# Index(['row_0', 'row_1', 'row_2'], dtype='object')

print(type(df.index))
# <class 'pandas.core.indexes.base.Index'>

print(df.columns.tolist())
# ['col_0', 'col_1', 'col_2', 'col_3']

print(type(df.columns.tolist()))
# <class 'list'>

source: pandas_dataframe_basic.py

Index型のcolumnsとindexは基本的には行・列の名前（ラベル）が格納された配列だと考えてよく、インデックス[]で要素を取得できる。ただし、通常のリストやNumPy配列ndarrayと異なり、別の値を代入して変更することはできない。

print(df.columns[0])
# col_0

# df.columns[0] = 'Col_0'
# TypeError: Index does not support mutable operations

source: pandas_dataframe_basic.py

行名や列名を変更するには、DataFrameのrename()メソッドやset_axis()メソッドなどを使う。

関連記事: pandas.DataFrameの行名・列名の変更

より発展的な内容として、階層的なインデックスを持つことも可能。以下の記事を参照。

関連記事: pandasのMultiindexの指定・追加・解除・ソート・レベル変更

行・列・要素の選択・抽出および変更

以下のDataFrameを例とする。

df = pd.DataFrame(np.arange(12).reshape(3, 4),
                  columns=['col_0', 'col_1', 'col_2', 'col_3'],
                  index=['row_0', 'row_1', 'row_2'])
print(df)
#        col_0  col_1  col_2  col_3
# row_0      0      1      2      3
# row_1      4      5      6      7
# row_2      8      9     10     11

source: pandas_dataframe_basic.py

行・列・要素の選択・抽出

df['列名']で列を選択できる。選択した列は一次元データを表すpandas.Seriesとなる。

print(df['col_1'])
# row_0    1
# row_1    5
# row_2    9
# Name: col_1, dtype: int64

print(type(df['col_1']))
# <class 'pandas.core.series.Series'>

source: pandas_dataframe_basic.py

属性のようにdf.列名として列を選択することも可能。ただし、列名がDataFrameのメソッドと同じ場合は使えないので注意。

print(df.col_1)
# row_0    1
# row_1    5
# row_2    9
# Name: col_1, dtype: int64

source: pandas_dataframe_basic.py

より一般的な行・列・要素の選択にはloc[行名, 列名]を使う。単独の行名・列名のほか、スライスやリストで範囲を指定することもできる。

スライス:は全範囲を表すため、loc[行名, :]として行を選択できる。この場合、末尾の, :は省略可。ndarrayのスライスと同様の考え方。

関連記事: NumPy配列ndarrayのスライスによる部分配列の選択と代入

print(df.loc['row_1'])
# col_0    4
# col_1    5
# col_2    6
# col_3    7
# Name: row_1, dtype: int64

source: pandas_dataframe_basic.py

リストやスライスで複数行・複数列を指定した場合はpandas.DataFrameとなる。

print(df.loc[['row_0', 'row_2'], ['col_1', 'col_3']])
#        col_1  col_3
# row_0      1      3
# row_2      9     11

print(type(df.loc[['row_0', 'row_2'], ['col_1', 'col_3']]))
# <class 'pandas.core.frame.DataFrame'>

source: pandas_dataframe_basic.py

単独の要素を選択したい場合はloc[]でもよいが、at[]のほうがより高速。

print(df.at['row_0', 'col_1'])
# 1

print(type(df.at['row_0', 'col_1']))
# <class 'numpy.int64'>

source: pandas_dataframe_basic.py

loc[], at[]は行名・列名で位置を指定するが、行番号・列番号で指定するiloc[], iat[]もある。

print(df.iloc[[0, 2], [1, 3]])
#        col_1  col_3
# row_0      1      3
# row_2      9     11

print(df.iat[0, 1])
# 1

source: pandas_dataframe_basic.py

loc[], iloc[], at[], iat[]、および、行・列の選択についての詳細は以下の記事を参照。

関連記事: pandasで任意の位置の値を取得・変更するat, iat, loc, iloc
関連記事: pandasのインデックス指定で行・列を抽出

行名・列名や行番号・列番号ではなく、条件による抽出も可能。

関連記事: pandas.DataFrameの行を条件で抽出するquery
関連記事: pandasで複数条件のAND, OR, NOTから行を抽出（選択）

print(df.query('col_0 > 2'))
#        col_0  col_1  col_2  col_3
# row_1      4      5      6      7
# row_2      8      9     10     11

source: pandas_dataframe_basic.py

行・列・要素の変更

loc[]やiloc[], at[], iat[]で選択した範囲に新たな値を代入できる。

df.at['row_1', 'col_2'] = 600
print(df)
#        col_0  col_1  col_2  col_3
# row_0      0      1      2      3
# row_1      4      5    600      7
# row_2      8      9     10     11

df.iloc[:, 1] = [10, 50, 90]
print(df)
#        col_0  col_1  col_2  col_3
# row_0      0     10      2      3
# row_1      4     50    600      7
# row_2      8     90     10     11

source: pandas_dataframe_basic.py

列ごとに様々な型を持つDataFrame

pandas.DataFrameは各列に異なる型のデータを格納できる。各列のデータ型はdtypes属性で確認できる。

df_multi = pd.DataFrame({'col_0': [0, 1, 2],
                         'col_1': [0.0, 0.1, 0.2],
                         'col_2': ['A', 'B', 'C']})
print(df_multi)
#    col_0  col_1 col_2
# 0      0    0.0     A
# 1      1    0.1     B
# 2      2    0.2     C

print(df_multi.dtypes)
# col_0      int64
# col_1    float64
# col_2     object
# dtype: object

source: pandas_dataframe_basic.py

データ型は変更することも可能。以下の記事を参照。

関連記事: pandasのデータ型dtype一覧とastypeによる変換（キャスト）

数値だけでなく文字列や日時を含むデータを簡便に扱えるのはpandas.DataFrameの大きな利点である。

関連記事: pandasで特定の文字列を含む行を抽出（完全一致、部分一致）
関連記事: pandasの文字列メソッドで置換や空白削除などの処理を行う
関連記事: pandasで日付・時間の列を処理（文字列変換、年月日抽出など）

コンストラクタpandas.DataFrame()の基本的な使い方

コンストラクタpandas.DataFrame()では、第一引数dataにデータ本体となるリストや配列、辞書などを指定する。ここでは例として二次元のNumPy配列ndarrayを使う。他の例については後述。

print(np.arange(9).reshape(3, 3))
# [[0 1 2]
#  [3 4 5]
#  [6 7 8]]

print(pd.DataFrame(np.arange(9).reshape(3, 3)))
#    0  1  2
# 0  0  1  2
# 1  3  4  5
# 2  6  7  8

source: pandas_dataframe_constructor.py

上の例のように、デフォルトでは列名columnsと行名indexは連番の数値となる。引数columns, indexに任意の列名・行名を指定できる。

print(pd.DataFrame(np.arange(9).reshape(3, 3),
                   columns=['col_0', 'col_1', 'col_2'],
                   index=['row_0', 'row_1', 'row_2']))
#        col_0  col_1  col_2
# row_0      0      1      2
# row_1      3      4      5
# row_2      6      7      8

source: pandas_dataframe_constructor.py

デフォルトでは、各列のデータ型は自動的に選択される。引数dtypeに任意のデータ型を指定することも可能。

print(pd.DataFrame(np.arange(9).reshape(3, 3),
                   dtype=float))
#      0    1    2
# 0  0.0  1.0  2.0
# 1  3.0  4.0  5.0
# 2  6.0  7.0  8.0

source: pandas_dataframe_constructor.py

引数dtypeに指定できるのは単一のデータ型のみで、すべての列がその型となる。各列に異なるデータ型を指定するには生成後にastype()を使う。

関連記事: pandasのデータ型dtype一覧とastypeによる変換（キャスト）

二次元配列・リストからDataFrameを作成

コンストラクタpandas.DataFrame()の第一引数dataに二次元のNumPy配列ndarrayを指定すると、その配列をvaluesとするpandas.DataFrameが生成される。

print(np.arange(9).reshape(3, 3))
# [[0 1 2]
#  [3 4 5]
#  [6 7 8]]

print(pd.DataFrame(np.arange(9).reshape(3, 3),
                   columns=['col_0', 'col_1', 'col_2'],
                   index=['row_0', 'row_1', 'row_2']))
#        col_0  col_1  col_2
# row_0      0      1      2
# row_1      3      4      5
# row_2      6      7      8

source: pandas_dataframe_constructor.py

二次元のリスト（リストのリスト）でもよい。

print(pd.DataFrame([[0, 1, 2], [3, 4, 5], [6, 7, 8]],
                   columns=['col_0', 'col_1', 'col_2'],
                   index=['row_0', 'row_1', 'row_2']))
#        col_0  col_1  col_2
# row_0      0      1      2
# row_1      3      4      5
# row_2      6      7      8

source: pandas_dataframe_constructor.py

格納されたリストの要素数がバラバラの場合は、足りない部分が欠損値NaNとなる。

print(pd.DataFrame([[0, 1, 2], [3, 4, 5], [6, 7, 8, 9, 10]],
                   columns=['col_0', 'col_1', 'col_2', 'col_3', 'col_4'],
                   index=['row_0', 'row_1', 'row_2']))
#        col_0  col_1  col_2  col_3  col_4
# row_0      0      1      2    NaN    NaN
# row_1      3      4      5    NaN    NaN
# row_2      6      7      8    9.0   10.0

source: pandas_dataframe_constructor.py

欠損値NaNの扱いについては以下の記事を参照。

関連記事: pandasで欠損値NaNを除外（削除）・置換（穴埋め）・抽出

複数の一次元配列・リストからDataFrameを作成

各列に配列・リストを割り当てる

コンストラクタpandas.DataFrame()の第一引数dataには、キーを列名、値をリストや配列などとした辞書dictを指定できる。

print(pd.DataFrame({'col_0': [0, 1, 2],
                    'col_1': np.arange(3, 6),
                    'col_2': (6, 7, 8)},
                   index=['row_0', 'row_1', 'row_2']))
#        col_0  col_1  col_2
# row_0      0      3      6
# row_1      1      4      7
# row_2      2      5      8

source: pandas_dataframe_constructor.py

辞書の値に指定するリストや配列などの要素数が一致していないとエラーになる。上述の二次元リスト（リストのリスト）のように足りない部分が欠損値NaNになることはない。

# print(pd.DataFrame({'col_0': [0, 1, 2, 100],
#                     'col_1': np.arange(3, 6),
#                     'col_2': (6, 7, 8)}))
# ValueError: All arrays must be of the same length

source: pandas_dataframe_constructor.py

各行に配列・リストを割り当てる

複数のリストや配列を列ではなく行に割り当てるには、.Tで転置する方法がある。上述のように各列に配列やリストを割り当ててpandas.DataFrameを生成してから転置する。

関連記事: pandas.DataFrameの行と列を入れ替える（転置）

print(pd.DataFrame({'row_0': [0, 1, 2],
                    'row_1': np.arange(3, 6),
                    'row_2': (6, 7, 8)},
                   index=['col_0', 'col_1', 'col_2']).T)
#        col_0  col_1  col_2
# row_0      0      1      2
# row_1      3      4      5
# row_2      6      7      8

source: pandas_dataframe_constructor.py

pandas.DataFrame.from_dict()で引数orient='index'とする方法もある。第一引数に辞書を指定する。

pandas.DataFrame.from_dict — pandas 2.0.3 documentation

print(pd.DataFrame.from_dict({'row_0': [0, 1, 2],
                              'row_1': np.array([3, 4, 5]),
                              'row_2': [6, 7, 8]},
                             orient='index',
                             columns=['col_0', 'col_1', 'col_2']))
#        col_0  col_1  col_2
# row_0      0      1      2
# row_1      3      4      5
# row_2      6      7      8

source: pandas_dataframe_constructor.py

辞書のリスト・辞書からDataFrameを作成

コンストラクタpandas.DataFrame()の第一引数dataには辞書のリストも指定できる。

それぞれの辞書のキーが列名となる。

print(pd.DataFrame([{'col_0': 0, 'col_1': 1, 'col_2': 2},
                    {'col_0': 3, 'col_1': 4, 'col_2': 5},
                    {'col_0': 6, 'col_1': 7, 'col_2': 8}],
                   index=['row_0', 'row_1', 'row_2']))
#        col_0  col_1  col_2
# row_0      0      1      2
# row_1      3      4      5
# row_2      6      7      8

source: pandas_dataframe_constructor.py

それぞれの辞書のキーが一致していない場合は存在しない値が欠損値NaNで埋められる。

print(pd.DataFrame([{'col_0': 0, 'col_1': 1, 'col_2': 2},
                    {'col_0': 3, 'col_2': 5, 'col_3': 100},
                    {'col_0': 6, 'col_1': 7, 'col_2': 8}],
                   index=['row_0', 'row_1', 'row_2']))
#        col_0  col_1  col_2  col_3
# row_0      0    1.0      2    NaN
# row_1      3    NaN      5  100.0
# row_2      6    7.0      8    NaN

source: pandas_dataframe_constructor.py

辞書を値とする辞書でもよい。外側の辞書のキーが列名、内側の辞書のキーが行名となる。

print(pd.DataFrame({'col_0': {'row_0': 0, 'row_1': 1, 'row_2': 2},
                    'col_1': {'row_0': 3, 'row_2': 4, 'row_3': 5},
                    'col_2': {'row_0': 6, 'row_1': 7, 'row_2': 8}}))
#        col_0  col_1  col_2
# row_0    0.0    3.0    6.0
# row_1    1.0    NaN    7.0
# row_2    2.0    4.0    8.0
# row_3    NaN    5.0    NaN

source: pandas_dataframe_constructor.py

辞書からDataFrameを作成するユースケースとして実際にありえるのはWebAPIなどで取得したデータを処理する場合。そのような場合は、pandas.json_normalize()を使うとネストした辞書などのより複雑な構造にも対応できる。以下の記事を参照。

関連記事: pandasのjson_normalizeで辞書のリストをDataFrameに変換

またJSON形式の文字列やファイルを直接読み込むこともできる。

関連記事: pandasでJSON文字列・ファイルを読み込み（read_json）

CSVファイルやExcelファイルから読み込み

実務上はコンストラクタでpandas.DataFrameを生成することはほとんどなく、ファイルから読み込むことが多い。

上述のJSONファイルのほか、CSVファイルやExcelファイル（xls, xlsx）をpandas.DataFrameとして読み込むことができる。

関連記事: pandasでCSV/TSVファイル読み込み（read_csv, read_table）
関連記事: pandasでExcelファイル（xlsx, xls）の読み込み（read_excel）

pandas.DataFrameの構造とその作成方法

pandas.DataFrameの構造

3つの構成要素: values, columns, index

データ値values

列名columnsと行名index

行・列・要素の選択・抽出および変更

行・列・要素の選択・抽出

行・列・要素の変更

列ごとに様々な型を持つDataFrame

コンストラクタpandas.DataFrame()の基本的な使い方

二次元配列・リストからDataFrameを作成

複数の一次元配列・リストからDataFrameを作成

各列に配列・リストを割り当てる

各行に配列・リストを割り当てる

辞書のリスト・辞書からDataFrameを作成

CSVファイルやExcelファイルから読み込み

関連カテゴリー

関連記事