Convert between pandas DataFrame/Series and NumPy array

Modified: | Tags: Python, pandas, NumPy

This article explains how to convert between pandas DataFrame/Series and NumPy arrays (ndarray).

To convert a DataFrame or Series to an ndarray, use the to_numpy() method or the values attribute. To convert an ndarray to a DataFrame or Series, use their constructors.

For conversions between DataFrame/Series and Python built-in lists, as well as between DataFrame and Series, refer to the following articles.

The pandas and NumPy versions used in this article are as follows. Note that functionality may vary between versions.

import pandas as pd
import numpy as np

print(pd.__version__)
# 2.1.4

print(np.__version__)
# 1.26.2

Convert DataFrame and Series to NumPy arrays

To convert a DataFrame or Series to a NumPy array (ndarray), use the to_numpy() method or the values attribute.

to_numpy()

To convert a DataFrame or Series to an ndarray, use the to_numpy() method.

df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}, index=['X', 'Y'])
print(df)
#    A  B
# X  1  3
# Y  2  4

print(df.to_numpy())
# [[1 3]
#  [2 4]]

print(type(df.to_numpy()))
# <class 'numpy.ndarray'>
s = df['A']
print(s)
# X    1
# Y    2
# Name: A, dtype: int64

print(s.to_numpy())
# [1 2]

print(type(s.to_numpy()))
# <class 'numpy.ndarray'>

Row names (index) and column names (columns) are ignored, and only the data columns are converted to an ndarray. If you want to treat index as data, use reset_index().

The following examples focus on DataFrame, but note that the basic usage is similar for Series.

Specify data type: dtype

If the data type (dtype) of each column in the DataFrame is the same, an ndarray of that data type is created.

df_int = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}, index=['X', 'Y'])
print(df_int)
#    A  B
# X  1  3
# Y  2  4

print(df_int.dtypes)
# A    int64
# B    int64
# dtype: object

print(df_int.to_numpy())
# [[1 3]
#  [2 4]]

print(df_int.to_numpy().dtype)
# int64

If the dtype of each column in the DataFrame differs, an ndarray of a common convertible type is created. For example, a DataFrame with mixed integer (int) and floating-point number (float) columns is converted into a float type ndarray.

df_int_float = pd.DataFrame({'A': [1, 2], 'B': [0.1, 0.2]}, index=['X', 'Y'])
print(df_int_float)
#    A    B
# X  1  0.1
# Y  2  0.2

print(df_int_float.dtypes)
# A      int64
# B    float64
# dtype: object

print(df_int_float.to_numpy())
# [[1.  0.1]
#  [2.  0.2]]

print(df_int_float.to_numpy().dtype)
# float64

If there is no common convertible type, for example, a DataFrame with mixed numeric and string columns, the ndarray will be of object type. In this object type, each element is treated as a Python object.

df_int_str = pd.DataFrame({'A': [1, 2], 'B': ['abc', 'xyz']}, index=['X', 'Y'])
print(df_int_str)
#    A    B
# X  1  abc
# Y  2  xyz

print(df_int_str.dtypes)
# A     int64
# B    object
# dtype: object

print(df_int_str.to_numpy())
# [[1 'abc']
#  [2 'xyz']]

print(df_int_str.to_numpy().dtype)
# object

print(df_int_str.to_numpy()[0, 0])
# 1

print(type(df_int_str.to_numpy()[0, 0]))
# <class 'int'>

print(df_int_str.to_numpy()[0, 1])
# abc

print(type(df_int_str.to_numpy()[0, 1]))
# <class 'str'>

Use the dtype argument in to_numpy() to specify the data type. Specifying an incompatible type will result in an error.

print(df_int_float.to_numpy(dtype='float32'))
# [[1.  0.1]
#  [2.  0.2]]

print(df_int_float.to_numpy(dtype=int))
# [[1 0]
#  [2 0]]

# print(df_int_str.to_numpy(dtype=int))
# ValueError: invalid literal for int() with base 10: 'abc'

For more details on data types (dtype) in pandas and NumPy, refer to the following articles.

To convert only certain data types of a DataFrame to an ndarray, use select_dtypes(). It is also possible to extract only numeric columns.

df_int_float_str = pd.DataFrame({'A': [1, 2], 'B': [0.1, 0.2], 'C': ['abc', 'xyz']}, index=['X', 'Y'])
print(df_int_float_str)
#    A    B    C
# X  1  0.1  abc
# Y  2  0.2  xyz

print(df_int_float_str.select_dtypes('number').to_numpy())
# [[1.  0.1]
#  [2.  0.2]]

Specify whether to create a copy: copy

By default, to_numpy() creates a view if possible. If the created ndarray is a view of the original DataFrame, the two objects share memory, and changing one will change the other.

df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}, index=['X', 'Y'])
a = df.to_numpy()

print(np.shares_memory(df, a))
# True

a[0, 0] = 100
print(a)
# [[100   3]
#  [  2   4]]

print(df)
#      A  B
# X  100  3
# Y    2  4

Setting the copy argument of to_numpy() to True creates a copy that does not share memory.

df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}, index=['X', 'Y'])
a_copy = df.to_numpy(copy=True)

print(np.shares_memory(df, a_copy))
# False

a_copy[0, 0] = 100
print(a_copy)
# [[100   3]
#  [  2   4]]

print(df)
#    A  B
# X  1  3
# Y  2  4

Note that copy=True always results in a copy. In contrast, copy=False (default) does not always create a view. For example, when the dtype argument leads to a type conversion, a copy is created since a view is not possible.

df_int_float = pd.DataFrame({'A': [1, 2], 'B': [0.1, 0.2]}, index=['X', 'Y'])
a_float = df_int_float.to_numpy()

print(np.shares_memory(df_int_float, a_float))
# False

For more details on views and copies in pandas and NumPy, refer to the following articles.

values

The values attribute of DataFrame and Series can also be used to convert to an ndarray.

Although the official documentation recommends using the to_numpy() method, no warning is issued when using the values attribute as of pandas version 2.1.4.

df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}, index=['X', 'Y'])
print(df)
#    A  B
# X  1  3
# Y  2  4

print(df.values)
# [[1 3]
#  [2 4]]

print(type(df.values))
# <class 'numpy.ndarray'>
s = df['A']
print(s)
# X    1
# Y    2
# Name: A, dtype: int64

print(s.values)
# [1 2]

print(type(s.values))
# <class 'numpy.ndarray'>

The behavior of the values attribute is the same as the default behavior of to_numpy().

If the dtype of each column in the DataFrame differs, an ndarray of a common convertible type is created. If no such common type exists, an object type ndarray is created.

df_int_float = pd.DataFrame({'A': [1, 2], 'B': [0.1, 0.2]}, index=['X', 'Y'])
print(df_int_float.values)
# [[1.  0.1]
#  [2.  0.2]]

print(df_int_float.values.dtype)
# float64

df_int_str = pd.DataFrame({'A': [1, 2], 'B': ['abc', 'xyz']}, index=['X', 'Y'])
print(df_int_str.values)
# [[1 'abc']
#  [2 'xyz']]

print(df_int_str.values.dtype)
# object

A view is created if possible, otherwise, a copy is created.

df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}, index=['X', 'Y'])
print(np.shares_memory(df, df.values))
# True

df_int_float = pd.DataFrame({'A': [1, 2], 'B': [0.1, 0.2]}, index=['X', 'Y'])
print(np.shares_memory(df_int_float, df_int_float.values))
# False

Use copy() to explicitly create a copy.

print(np.shares_memory(df, df.values.copy()))
# False

Convert NumPy arrays to DataFrame and Series

To convert a NumPy array (ndarray) to a DataFrame or Series, use their constructors. The ndarray can be specified in the first argument of the constructor.

pd.DataFrame()

The pd.DataFrame() constructor can create a DataFrame by specifying an ndarray in the first argument.

a_2d = np.array([[1, 2], [3, 4]])
print(a_2d)
# [[1 2]
#  [3 4]]

print(pd.DataFrame(a_2d))
#    0  1
# 0  1  2
# 1  3  4

A one-dimensional array creates a single-column DataFrame, but a three-dimensional or higher array results in an error.

a_1d = np.array([1, 2])
print(a_1d)
# [1 2]

print(pd.DataFrame(a_1d))
#    0
# 0  1
# 1  2

a_3d = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
print(a_3d)
# [[[1 2]
#   [3 4]]
# 
#  [[5 6]
#   [7 8]]]

# print(pd.DataFrame(a_3d))
# ValueError: Must pass 2-d input. shape=(2, 2, 2)

Row names (index), column names (columns), and data type (dtype) can be specified in the arguments.

print(pd.DataFrame(data=a_2d, index=['X', 'Y'], columns=['A', 'B'], dtype=float))
#      A    B
# X  1.0  2.0
# Y  3.0  4.0

Views and copies

When specifying an ndarray as the first argument, by default, pd.DataFrame() creates a view if possible. If the created DataFrame is a view of the original ndarray, the two objects share memory, and changing one will change the other.

a = np.array([[1, 2], [3, 4]])
df = pd.DataFrame(a)

print(np.shares_memory(a, df))
# True

a[0, 0] = 100
print(a)
# [[100   2]
#  [  3   4]]

print(df)
#      0  1
# 0  100  2
# 1    3  4

Setting the copy argument of pd.DataFrame() to True creates a copy.

a = np.array([[1, 2], [3, 4]])
df_copy = pd.DataFrame(a, copy=True)

print(np.shares_memory(a, df_copy))
# False

a[0, 0] = 100
print(a)
# [[100   2]
#  [  3   4]]

print(df_copy)
#    0  1
# 0  1  2
# 1  3  4

Note that setting copy=True guarantees a copy, whereas copy=False (default for ndarray) may not always result in a view. For example, if a type conversion occurs due to the dtype argument, a copy will be created since a view cannot be.

a = np.array([[1, 2], [3, 4]])
df_float = pd.DataFrame(a, dtype=float)

print(np.shares_memory(a, df_float))
# False

pd.Series()

The pd.Series() constructor has a basic usage similar to pd.DataFrame().

A Series can be created by specifying a one-dimensional ndarray in the first argument.

a_1d = np.array([1, 2])
print(a_1d)
# [1 2]

print(pd.Series(a_1d))
# 0    1
# 1    2
# dtype: int64

An error occurs for two-dimensional or higher arrays.

a_2d = np.array([[1, 2], [3, 4]])
print(a_2d)
# [[1 2]
#  [3 4]]

# print(pd.Series(a_2d))
# ValueError: Data must be 1-dimensional, got ndarray of shape (2, 2) instead

Labels (index), name (name), and data type (dtype) can also be specified in the arguments.

print(pd.Series(a_1d, index=['A', 'B'], name='my_series', dtype=float))
# A    1.0
# B    2.0
# Name: my_series, dtype: float64

In terms of views and copies, pd.Series() also behaves similarly to pd.DataFrame().

By default, pd.Series() attempts to create a view. Setting the copy argument to True forces it to create a copy instead. However, copy=False (default) does not guarantee a view. For example, if you specify a dtype that requires a type conversion, pd.Series() will create a copy, since a view is not feasible in this case.

a = np.array([1, 2])
print(np.shares_memory(a, pd.Series(a)))
# True

print(np.shares_memory(a, pd.Series(a, copy=True)))
# False

print(np.shares_memory(a, pd.Series(a, dtype=float)))
# False

Related Categories

Related Articles