pandas: Convert between DataFrame and Series
This article explains how to convert between pandas.DataFrame
and pandas.Series
.
While the term "convert" is used for convenience, it actually refers to the process of generating a DataFrame
from a Series
, or retrieving a column or row of a DataFrame
as a Series
.
It is important to note, as explained at the end, that the original and the generated or retrieved objects may share memory. Consequently, changing a value in one could affect the other.
For converting DataFrame
and Series
to and from NumPy arrays (ndarray
) and Python's built-in lists, refer to the following articles.
- Convert pandas.DataFrame, Series and numpy.ndarray to each other
- Convert pandas.DataFrame, Series and list to each other
The pandas version used in this article is as follows. Note that functionality may vary between versions.
import pandas as pd
print(pd.__version__)
# 2.1.4
Convert Series
to DataFrame
To convert a Series
to a DataFrame
, use the to_frame()
method or the pd.DataFrame()
constructor.
to_frame()
The to_frame()
method returns a DataFrame
with the calling Series
as a column. A column name can be specified as the first argument.
s = pd.Series([0, 1, 2], index=['A', 'B', 'C'])
print(s)
# A 0
# B 1
# C 2
# dtype: int64
print(s.to_frame())
# 0
# A 0
# B 1
# C 2
print(s.to_frame('X'))
# X
# A 0
# B 1
# C 2
If the name
attribute is set for the Series
, it becomes the column name. If a first argument is specified in to_frame()
, it takes precedence.
s_name = pd.Series([0, 1, 2], index=['A', 'B', 'C'], name='X')
print(s_name)
# A 0
# B 1
# C 2
# Name: X, dtype: int64
print(s_name.to_frame())
# X
# A 0
# B 1
# C 2
print(s_name.to_frame('Y'))
# Y
# A 0
# B 1
# C 2
pd.DataFrame()
Passing a Series
to the pd.DataFrame()
constructor creates a DataFrame
with the Series
as a column, while passing a list of Series
creates a DataFrame
with the Series
as rows.
s = pd.Series([0, 1, 2], index=['A', 'B', 'C'])
print(s)
# A 0
# B 1
# C 2
# dtype: int64
print(pd.DataFrame(s))
# 0
# A 0
# B 1
# C 2
print(pd.DataFrame([s]))
# A B C
# 0 0 1 2
If the name
attribute is set for the Series
, it becomes the column or row name.
s_name = pd.Series([0, 1, 2], index=['A', 'B', 'C'], name='X')
print(s_name)
# A 0
# B 1
# C 2
# Name: X, dtype: int64
print(pd.DataFrame(s_name))
# X
# A 0
# B 1
# C 2
print(pd.DataFrame([s_name]))
# A B C
# X 0 1 2
Generate DataFrame
from multiple Series
A DataFrame
can be generated from multiple Series
using either the pd.DataFrame()
constructor or the pd.concat()
function. The following example uses two Series
, but the same process applies when using three or more Series
.
When indexes are common
An example using the pd.DataFrame()
constructor is as follows. Note that implicit type conversion occurs when Series
of different data types (dtype
) are used as rows.
s1 = pd.Series([0, 1, 2], index=['A', 'B', 'C'])
s2 = pd.Series([0.0, 0.1, 0.2], index=['A', 'B', 'C'])
print(pd.DataFrame({'col1': s1, 'col2': s2}))
# col1 col2
# A 0 0.0
# B 1 0.1
# C 2 0.2
print(pd.DataFrame([s1, s2]))
# A B C
# 0 0.0 1.0 2.0
# 1 0.0 0.1 0.2
You can also use the pd.concat()
function.
print(pd.concat([s1, s2], axis=1))
# 0 1
# A 0 0.0
# B 1 0.1
# C 2 0.2
If name
attributes are set for the original Series
, they will be used as column or row names in the resulting DataFrame
. Note that column names must be explicitly provided when using a dictionary to specify the data in the constructor.
s1_name = pd.Series([0, 1, 2], index=['A', 'B', 'C'], name='X')
s2_name = pd.Series([0.0, 0.1, 0.2], index=['A', 'B', 'C'], name='Y')
print(pd.DataFrame({s1_name.name: s1_name, s2_name.name: s2_name}))
# X Y
# A 0 0.0
# B 1 0.1
# C 2 0.2
print(pd.DataFrame([s1_name, s2_name]))
# A B C
# X 0.0 1.0 2.0
# Y 0.0 0.1 0.2
print(pd.concat([s1_name, s2_name], axis=1))
# X Y
# A 0 0.0
# B 1 0.1
# C 2 0.2
When indexes are different
A DataFrame
is generated based on the indexes of Series
. If Series
have different indexes, missing values (NaN
) will occur.
s1 = pd.Series([0, 1, 2], index=['A', 'B', 'C'])
s3 = pd.Series([0.1, 0.2, 0.3], index=['B', 'C', 'D'])
print(pd.DataFrame({'col1': s1, 'col3': s3}))
# col1 col3
# A 0.0 NaN
# B 1.0 0.1
# C 2.0 0.2
# D NaN 0.3
print(pd.DataFrame([s1, s3]))
# A B C D
# 0 0.0 1.0 2.0 NaN
# 1 NaN 0.1 0.2 0.3
print(pd.concat([s1, s3], axis=1))
# 0 1
# A 0.0 NaN
# B 1.0 0.1
# C 2.0 0.2
# D NaN 0.3
For handling missing values in pandas, refer to the following article.
Using pd.concat()
with join='inner'
retains only the common indexes.
print(pd.concat([s1, s3], axis=1, join='inner'))
# 0 1
# B 1 0.1
# C 2 0.2
To change indexes, use methods like set_axis()
.
print(s3.set_axis(s1.index))
# A 0.1
# B 0.2
# C 0.3
# dtype: float64
print(pd.DataFrame({'col1': s1, 'col3': s3.set_axis(s1.index)}))
# col1 col3
# A 0 0.1
# B 1 0.2
# C 2 0.3
To ignore the indexes, you can specify the Series
as a NumPy array (ndarray
) using the values
attribute. Note that using pd.concat()
in this way results in an error.
print(s1.values)
# [0 1 2]
print(type(s1.values))
# <class 'numpy.ndarray'>
print(pd.DataFrame({'col1': s1.values, 'col3': s3.values}))
# col1 col3
# 0 0 0.1
# 1 1 0.2
# 2 2 0.3
print(pd.DataFrame([s1.values, s3.values]))
# 0 1 2
# 0 0.0 1.0 2.0
# 1 0.1 0.2 0.3
# print(pd.concat([s1.values, s3.values], axis=1))
# TypeError: cannot concatenate object of type '<class 'numpy.ndarray'>'; only Series and DataFrame objs are valid
The pd.DataFrame()
constructor allows specifying any row and column names with the index
and columns
arguments.
print(pd.DataFrame([s1.values, s3.values], index=['X', 'Y'], columns=['A', 'B', 'C']))
# A B C
# X 0.0 1.0 2.0
# Y 0.1 0.2 0.3
When the number of values differs
Even when combining Series
with differing numbers of values, a DataFrame
is generated based on the index. Any missing elements are filled with NaN
.
s1 = pd.Series([0, 1, 2], index=['A', 'B', 'C'])
s4 = pd.Series([0.1, 0.3], index=['B', 'D'])
print(pd.DataFrame({'col1': s1, 'col4': s4}))
# col1 col4
# A 0.0 NaN
# B 1.0 0.1
# C 2.0 NaN
# D NaN 0.3
print(pd.DataFrame([s1, s4]))
# A B C D
# 0 0.0 1.0 2.0 NaN
# 1 NaN 0.1 NaN 0.3
print(pd.concat([s1, s4], axis=1))
# 0 1
# A 0.0 NaN
# B 1.0 0.1
# C 2.0 NaN
# D NaN 0.3
print(pd.concat([s1, s4], axis=1, join='inner'))
# 0 1
# B 1 0.1
As mentioned above, use methods like set_axis()
to change indexes.
print(pd.DataFrame({'col1': s1, 'col4': s4.set_axis(['A', 'B'])}))
# col1 col4
# A 0 0.1
# B 1 0.3
# C 2 NaN
The behavior of using the values
attribute (ndarray
) in the constructor varies depending on how it is used. When used as values in a dictionary, it results in an error if the arrays are of different lengths. However, using values as elements in a list is acceptable.
# print(pd.DataFrame({'col1': s1.values, 'col4': s4.values}))
# ValueError: All arrays must be of the same length
print(pd.DataFrame([s1.values, s4.values]))
# 0 1 2
# 0 0.0 1.0 2.0
# 1 0.1 0.3 NaN
Convert DataFrame
to Series
Rows and columns of DataFrame
can be retrieved as Series
using []
, loc[]
, or iloc[]
. Refer to the following articles for details.
- pandas: Select rows/columns by index (numbers and names)
- pandas: Get/Set values with loc, iloc, at, iat
Retrieve DataFrame
columns as Series
Specifying a column name with []
or loc[]
, or a column number with iloc[]
as a scalar value, retrieves that column as a Series
.
df = pd.DataFrame({'col0': [0, 1, 2], 'col1': [3, 4, 5], 'col2': [6, 7, 8]},
index=['row0', 'row1', 'row2'])
print(df)
# col0 col1 col2
# row0 0 3 6
# row1 1 4 7
# row2 2 5 8
print(df['col0'])
# row0 0
# row1 1
# row2 2
# Name: col0, dtype: int64
print(df.loc[:, 'col0'])
# row0 0
# row1 1
# row2 2
# Name: col0, dtype: int64
print(df.iloc[:, 0])
# row0 0
# row1 1
# row2 2
# Name: col0, dtype: int64
With loc[]
or iloc[]
, it is also possible to select specific rows using a list or slice.
print(df.iloc[[0, 2], 0])
# row0 0
# row2 2
# Name: col0, dtype: int64
print(df.iloc[:2, 0])
# row0 0
# row1 1
# Name: col0, dtype: int64
Selecting a single column with a list or slice results in a DataFrame
with one column, not a Series
.
print(df.loc[:, ['col0']])
# col0
# row0 0
# row1 1
# row2 2
print(df.iloc[:, :1])
# col0
# row0 0
# row1 1
# row2 2
Retrieve DataFrame
rows as Series
Specifying a row name with loc[]
, or a row number with iloc[]
as a scalar value, retrieves that row as a Series
.
df = pd.DataFrame({'col0': [0, 1, 2], 'col1': [3, 4, 5], 'col2': [6, 7, 8]},
index=['row0', 'row1', 'row2'])
print(df)
# col0 col1 col2
# row0 0 3 6
# row1 1 4 7
# row2 2 5 8
print(df.loc['row0', :])
# col0 0
# col1 3
# col2 6
# Name: row0, dtype: int64
print(df.iloc[0, :])
# col0 0
# col1 3
# col2 6
# Name: row0, dtype: int64
When selecting an entire row, the column specification :
can be omitted.
print(df.loc['row0'])
# col0 0
# col1 3
# col2 6
# Name: row0, dtype: int64
print(df.iloc[0])
# col0 0
# col1 3
# col2 6
# Name: row0, dtype: int64
It is also possible to select specific columns using a list or slice.
print(df.iloc[0, [0, 2]])
# col0 0
# col2 6
# Name: row0, dtype: int64
print(df.iloc[0, :2])
# col0 0
# col1 3
# Name: row0, dtype: int64
Selecting a single row with a list or slice results in a DataFrame
with one row, not a Series
.
print(df.loc[['row0']])
# col0 col1 col2
# row0 0 3 6
print(df.iloc[:1])
# col0 col1 col2
# row0 0 3 6
Pay attention to data types (dtype
)
While DataFrame
has data types (dtype
) for each column, Series
has one data type.
Be careful when retrieving a row of a DataFrame
as a Series
.
For example, retrieving a row from a DataFrame
that has columns of integer (int
) and floating-point number (float
) types as a Series
results in a float
data type, with the values in the int
column converted to float
.
df_multi = pd.DataFrame({'col0': [0, 1, 2], 'col1': [0.0, 0.1, 0.2]},
index=['row0', 'row1', 'row2'])
print(df_multi)
# col0 col1
# row0 0 0.0
# row1 1 0.1
# row2 2 0.2
s_row = df_multi.loc['row2']
print(s_row)
# col0 2.0
# col1 0.2
# Name: row2, dtype: float64
If a DataFrame
includes columns of type object
, retrieving a row as a Series
results in an object
data type.
df_multi['col2'] = ['a', 'b', 'c']
print(df_multi)
# col0 col1 col2
# row0 0 0.0 a
# row1 1 0.1 b
# row2 2 0.2 c
print(df_multi.dtypes)
# col0 int64
# col1 float64
# col2 object
# dtype: object
s_row = df_multi.loc['row2']
print(s_row)
# col0 2
# col1 0.2
# col2 c
# Name: row2, dtype: object
With the object
type, values retain their original types.
print(type(s_row['col0']))
# <class 'numpy.int64'>
print(type(s_row['col1']))
# <class 'numpy.float64'>
print(type(s_row['col2']))
# <class 'str'>
Views and copies
During conversion between DataFrame
and Series
, the resulting object may either be a view or a copy of the original. A view shares memory with the original object, and changing one affects the other.
Convert Series
to DataFrame
to_frame()
The to_frame()
method returns a view if possible. A copy can be created with copy()
.
s = pd.Series([0, 1], index=['A', 'B'])
df = s.to_frame()
s['A'] = 100
print(df)
# 0
# A 100
# B 1
s = pd.Series([0, 1], index=['A', 'B'])
df_copy = s.copy().to_frame()
s['A'] = 100
print(df_copy)
# 0
# A 0
# B 1
pd.DataFrame()
The pd.DataFrame()
constructor returns a view by default if possible. Setting the copy
argument to True
returns a copy.
s = pd.Series([0, 1], index=['A', 'B'])
df = pd.DataFrame(s)
s['A'] = 100
print(df)
# 0
# A 100
# B 1
s = pd.Series([0, 1], index=['A', 'B'])
df_copy = pd.DataFrame(s, copy=True)
s['A'] = 100
print(df_copy)
# 0
# A 0
# B 1
pd.concat()
The pd.concat()
function returns a copy by default. Setting the copy
argument to False
returns a view if possible.
s1 = pd.Series([0, 1], index=['A', 'B'])
s2 = pd.Series([0.0, 0.1], index=['A', 'B'])
df = pd.concat([s1, s2], axis=1)
s1['A'] = 100
print(df)
# 0 1
# A 0 0.0
# B 1 0.1
s1 = pd.Series([0, 1], index=['A', 'B'])
s2 = pd.Series([0.0, 0.1], index=['A', 'B'])
df_copy_false = pd.concat([s1, s2], axis=1, copy=False)
s1['A'] = 100
print(df_copy_false)
# 0 1
# A 100 0.0
# B 1 0.1
Note that setting copy=True
in functions like pd.DataFrame()
and pd.concat()
ensures a copy is made, while copy=False
tries to create a view if possible.
Even with copy=False
, a copy might be generated instead of a view depending on the memory layout. Be aware that it is not guaranteed that a view will always be created.
Convert DataFrame
to Series
Retrieving either a row or a column from a DataFrame
as a Series
generally results in a view of the original DataFrame
.
df = pd.DataFrame({'col0': [0, 1, 2], 'col1': [3, 4, 5], 'col2': [6, 7, 8]},
index=['row0', 'row1', 'row2'])
print(df)
# col0 col1 col2
# row0 0 3 6
# row1 1 4 7
# row2 2 5 8
s = df['col0']
s['row0'] = 10
print(s)
# row0 10
# row1 1
# row2 2
# Name: col0, dtype: int64
print(df)
# col0 col1 col2
# row0 10 3 6
# row1 1 4 7
# row2 2 5 8
Create a copy with copy()
to handle separately.
s_copy = df['col1'].copy()
s_copy['row0'] = 100
print(s_copy)
# row0 100
# row1 4
# row2 5
# Name: col1, dtype: int64
print(df)
# col0 col1 col2
# row0 10 3 6
# row1 1 4 7
# row2 2 5 8
When using a list for selection, a copy is created instead of a view.
s_list = df.loc[['row0', 'row2'], 'col2']
s_list['row0'] = 1000
print(s_list)
# row0 1000
# row2 8
# Name: col2, dtype: int64
print(df)
# col0 col1 col2
# row0 10 3 6
# row1 1 4 7
# row2 2 5 8
When selecting a portion of a DataFrame
with loc[]
or iloc[]
to create a new DataFrame
, whether a view or a copy is created depends on the type of range specification used, such as scalar values, lists, or slices.