Convert between pandas DataFrame/Series and NumPy array
This article explains how to convert between pandas DataFrame
/Series
and NumPy arrays (ndarray
).
To convert a DataFrame
or Series
to an ndarray
, use the to_numpy()
method or the values
attribute. To convert an ndarray
to a DataFrame
or Series
, use their constructors.
For conversions between DataFrame
/Series
and Python built-in lists, as well as between DataFrame
and Series
, refer to the following articles.
- Convert between pandas DataFrame/Series and Python list
- pandas: Convert between DataFrame and Series
The pandas and NumPy versions used in this article are as follows. Note that functionality may vary between versions.
import pandas as pd
import numpy as np
print(pd.__version__)
# 2.1.4
print(np.__version__)
# 1.26.2
Convert DataFrame
and Series
to NumPy arrays
To convert a DataFrame
or Series
to a NumPy array (ndarray
), use the to_numpy()
method or the values
attribute.
to_numpy()
To convert a DataFrame
or Series
to an ndarray
, use the to_numpy()
method.
- pandas.DataFrame.to_numpy — pandas 2.1.4 documentation
- pandas.Series.to_numpy — pandas 2.1.4 documentation
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}, index=['X', 'Y'])
print(df)
# A B
# X 1 3
# Y 2 4
print(df.to_numpy())
# [[1 3]
# [2 4]]
print(type(df.to_numpy()))
# <class 'numpy.ndarray'>
s = df['A']
print(s)
# X 1
# Y 2
# Name: A, dtype: int64
print(s.to_numpy())
# [1 2]
print(type(s.to_numpy()))
# <class 'numpy.ndarray'>
Row names (index
) and column names (columns
) are ignored, and only the data columns are converted to an ndarray
. If you want to treat index
as data, use reset_index()
.
The following examples focus on DataFrame
, but note that the basic usage is similar for Series
.
Specify data type: dtype
If the data type (dtype
) of each column in the DataFrame
is the same, an ndarray
of that data type is created.
df_int = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}, index=['X', 'Y'])
print(df_int)
# A B
# X 1 3
# Y 2 4
print(df_int.dtypes)
# A int64
# B int64
# dtype: object
print(df_int.to_numpy())
# [[1 3]
# [2 4]]
print(df_int.to_numpy().dtype)
# int64
If the dtype
of each column in the DataFrame
differs, an ndarray
of a common convertible type is created. For example, a DataFrame
with mixed integer (int
) and floating-point number (float
) columns is converted into a float
type ndarray
.
df_int_float = pd.DataFrame({'A': [1, 2], 'B': [0.1, 0.2]}, index=['X', 'Y'])
print(df_int_float)
# A B
# X 1 0.1
# Y 2 0.2
print(df_int_float.dtypes)
# A int64
# B float64
# dtype: object
print(df_int_float.to_numpy())
# [[1. 0.1]
# [2. 0.2]]
print(df_int_float.to_numpy().dtype)
# float64
If there is no common convertible type, for example, a DataFrame
with mixed numeric and string columns, the ndarray
will be of object
type. In this object
type, each element is treated as a Python object.
df_int_str = pd.DataFrame({'A': [1, 2], 'B': ['abc', 'xyz']}, index=['X', 'Y'])
print(df_int_str)
# A B
# X 1 abc
# Y 2 xyz
print(df_int_str.dtypes)
# A int64
# B object
# dtype: object
print(df_int_str.to_numpy())
# [[1 'abc']
# [2 'xyz']]
print(df_int_str.to_numpy().dtype)
# object
print(df_int_str.to_numpy()[0, 0])
# 1
print(type(df_int_str.to_numpy()[0, 0]))
# <class 'int'>
print(df_int_str.to_numpy()[0, 1])
# abc
print(type(df_int_str.to_numpy()[0, 1]))
# <class 'str'>
Use the dtype
argument in to_numpy()
to specify the data type. Specifying an incompatible type will result in an error.
print(df_int_float.to_numpy(dtype='float32'))
# [[1. 0.1]
# [2. 0.2]]
print(df_int_float.to_numpy(dtype=int))
# [[1 0]
# [2 0]]
# print(df_int_str.to_numpy(dtype=int))
# ValueError: invalid literal for int() with base 10: 'abc'
For more details on data types (dtype
) in pandas and NumPy, refer to the following articles.
- pandas: How to use astype() to cast dtype of DataFrame
- NumPy: Cast ndarray to a specific dtype with astype()
To convert only certain data types of a DataFrame
to an ndarray
, use select_dtypes()
. It is also possible to extract only numeric columns.
df_int_float_str = pd.DataFrame({'A': [1, 2], 'B': [0.1, 0.2], 'C': ['abc', 'xyz']}, index=['X', 'Y'])
print(df_int_float_str)
# A B C
# X 1 0.1 abc
# Y 2 0.2 xyz
print(df_int_float_str.select_dtypes('number').to_numpy())
# [[1. 0.1]
# [2. 0.2]]
Specify whether to create a copy: copy
By default, to_numpy()
creates a view if possible. If the created ndarray
is a view of the original DataFrame
, the two objects share memory, and changing one will change the other.
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}, index=['X', 'Y'])
a = df.to_numpy()
print(np.shares_memory(df, a))
# True
a[0, 0] = 100
print(a)
# [[100 3]
# [ 2 4]]
print(df)
# A B
# X 100 3
# Y 2 4
Setting the copy
argument of to_numpy()
to True
creates a copy that does not share memory.
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}, index=['X', 'Y'])
a_copy = df.to_numpy(copy=True)
print(np.shares_memory(df, a_copy))
# False
a_copy[0, 0] = 100
print(a_copy)
# [[100 3]
# [ 2 4]]
print(df)
# A B
# X 1 3
# Y 2 4
Note that copy=True
always results in a copy. In contrast, copy=False
(default) does not always create a view. For example, when the dtype
argument leads to a type conversion, a copy is created since a view is not possible.
df_int_float = pd.DataFrame({'A': [1, 2], 'B': [0.1, 0.2]}, index=['X', 'Y'])
a_float = df_int_float.to_numpy()
print(np.shares_memory(df_int_float, a_float))
# False
For more details on views and copies in pandas and NumPy, refer to the following articles.
- pandas: Views and copies in DataFrame
- NumPy: Determine if ndarray is view or copy and if it shares memory
values
The values
attribute of DataFrame
and Series
can also be used to convert to an ndarray
.
- pandas.DataFrame.values — pandas 2.1.4 documentation
- pandas.Series.values — pandas 2.1.4 documentation
Although the official documentation recommends using the to_numpy()
method, no warning is issued when using the values
attribute as of pandas version 2.1.4.
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}, index=['X', 'Y'])
print(df)
# A B
# X 1 3
# Y 2 4
print(df.values)
# [[1 3]
# [2 4]]
print(type(df.values))
# <class 'numpy.ndarray'>
s = df['A']
print(s)
# X 1
# Y 2
# Name: A, dtype: int64
print(s.values)
# [1 2]
print(type(s.values))
# <class 'numpy.ndarray'>
The behavior of the values
attribute is the same as the default behavior of to_numpy()
.
If the dtype
of each column in the DataFrame
differs, an ndarray
of a common convertible type is created. If no such common type exists, an object
type ndarray
is created.
df_int_float = pd.DataFrame({'A': [1, 2], 'B': [0.1, 0.2]}, index=['X', 'Y'])
print(df_int_float.values)
# [[1. 0.1]
# [2. 0.2]]
print(df_int_float.values.dtype)
# float64
df_int_str = pd.DataFrame({'A': [1, 2], 'B': ['abc', 'xyz']}, index=['X', 'Y'])
print(df_int_str.values)
# [[1 'abc']
# [2 'xyz']]
print(df_int_str.values.dtype)
# object
A view is created if possible, otherwise, a copy is created.
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}, index=['X', 'Y'])
print(np.shares_memory(df, df.values))
# True
df_int_float = pd.DataFrame({'A': [1, 2], 'B': [0.1, 0.2]}, index=['X', 'Y'])
print(np.shares_memory(df_int_float, df_int_float.values))
# False
Use copy()
to explicitly create a copy.
print(np.shares_memory(df, df.values.copy()))
# False
Convert NumPy arrays to DataFrame
and Series
To convert a NumPy array (ndarray
) to a DataFrame
or Series
, use their constructors. The ndarray
can be specified in the first argument of the constructor.
pd.DataFrame()
The pd.DataFrame()
constructor can create a DataFrame
by specifying an ndarray
in the first argument.
a_2d = np.array([[1, 2], [3, 4]])
print(a_2d)
# [[1 2]
# [3 4]]
print(pd.DataFrame(a_2d))
# 0 1
# 0 1 2
# 1 3 4
A one-dimensional array creates a single-column DataFrame
, but a three-dimensional or higher array results in an error.
a_1d = np.array([1, 2])
print(a_1d)
# [1 2]
print(pd.DataFrame(a_1d))
# 0
# 0 1
# 1 2
a_3d = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
print(a_3d)
# [[[1 2]
# [3 4]]
#
# [[5 6]
# [7 8]]]
# print(pd.DataFrame(a_3d))
# ValueError: Must pass 2-d input. shape=(2, 2, 2)
Row names (index
), column names (columns
), and data type (dtype
) can be specified in the arguments.
print(pd.DataFrame(data=a_2d, index=['X', 'Y'], columns=['A', 'B'], dtype=float))
# A B
# X 1.0 2.0
# Y 3.0 4.0
Views and copies
When specifying an ndarray
as the first argument, by default, pd.DataFrame()
creates a view if possible. If the created DataFrame
is a view of the original ndarray
, the two objects share memory, and changing one will change the other.
a = np.array([[1, 2], [3, 4]])
df = pd.DataFrame(a)
print(np.shares_memory(a, df))
# True
a[0, 0] = 100
print(a)
# [[100 2]
# [ 3 4]]
print(df)
# 0 1
# 0 100 2
# 1 3 4
Setting the copy
argument of pd.DataFrame()
to True
creates a copy.
a = np.array([[1, 2], [3, 4]])
df_copy = pd.DataFrame(a, copy=True)
print(np.shares_memory(a, df_copy))
# False
a[0, 0] = 100
print(a)
# [[100 2]
# [ 3 4]]
print(df_copy)
# 0 1
# 0 1 2
# 1 3 4
Note that setting copy=True
guarantees a copy, whereas copy=False
(default for ndarray
) may not always result in a view. For example, if a type conversion occurs due to the dtype
argument, a copy will be created since a view cannot be.
a = np.array([[1, 2], [3, 4]])
df_float = pd.DataFrame(a, dtype=float)
print(np.shares_memory(a, df_float))
# False
pd.Series()
The pd.Series()
constructor has a basic usage similar to pd.DataFrame()
.
A Series
can be created by specifying a one-dimensional ndarray
in the first argument.
a_1d = np.array([1, 2])
print(a_1d)
# [1 2]
print(pd.Series(a_1d))
# 0 1
# 1 2
# dtype: int64
An error occurs for two-dimensional or higher arrays.
a_2d = np.array([[1, 2], [3, 4]])
print(a_2d)
# [[1 2]
# [3 4]]
# print(pd.Series(a_2d))
# ValueError: Data must be 1-dimensional, got ndarray of shape (2, 2) instead
Labels (index
), name (name
), and data type (dtype
) can also be specified in the arguments.
print(pd.Series(a_1d, index=['A', 'B'], name='my_series', dtype=float))
# A 1.0
# B 2.0
# Name: my_series, dtype: float64
In terms of views and copies, pd.Series()
also behaves similarly to pd.DataFrame()
.
By default, pd.Series()
attempts to create a view. Setting the copy
argument to True
forces it to create a copy instead. However, copy=False
(default) does not guarantee a view. For example, if you specify a dtype
that requires a type conversion, pd.Series()
will create a copy, since a view is not feasible in this case.
a = np.array([1, 2])
print(np.shares_memory(a, pd.Series(a)))
# True
print(np.shares_memory(a, pd.Series(a, copy=True)))
# False
print(np.shares_memory(a, pd.Series(a, dtype=float)))
# False