pandas: Views and copies in DataFrame

Posted: | Tags: Python, pandas

This article explains views and copies in pandas.

When selecting a part of an existing DataFrame using loc[] or iloc[] to create a new one, the result may be a view, sharing memory with the original, or a copy, with independently allocated memory.

Since views refer to the same memory, any modification in one object will also be reflected in the other.

For views and copies in NumPy, refer to the following article. It also introduces np.shares_memory(), which is used in the sample code of this article.

The pandas and NumPy versions used in this article are as follows. Note that functionality may vary between versions.

import pandas as pd
import numpy as np

print(pd.__version__)
# 2.1.4

print(np.__version__)
# 1.26.2

Important points about views and copies in pandas.DataFrame

A key point to understand about views and copies in pandas is that, as of version 2.1.4, there's no definitive method to determine whether a DataFrame is a view or a copy.

Outside of simple cases, it’s very hard to predict whether it will return a view or a copy (it depends on the memory layout of the array, about which pandas makes no guarantees)
Indexing and selecting data — pandas 2.1.4 documentation

As demonstrated in the examples below, neither np.shares_memory() nor the _is_view attribute of a DataFrame always provide accurate results.

While calling the copy() method of a DataFrame reliably creates a copy, there is no certain way to ensure a view is created, making it risky to assume that views are generated when processing various data.

The main point of this article is not to specify when a view or a copy is returned, but rather to highlight the importance of caution, as it's often unclear whether you're dealing with a view or a copy.

Partial selection with loc and iloc

loc[] allows selection by row and column names, whereas iloc[] allows selection by row and column numbers. Both methods support selections using scalar values, slices, lists, etc.

This section demonstrates creating a new DataFrame through various selection methods, then examines the np.shares_memory() and _is_view attributes, and finally, tests whether changes in the original DataFrame affect the new one.

Keep in mind that the following sample code and results are just examples, and whether a view or a copy is generated depends on various conditions.

When all columns are of the same data type (dtype)

In the case where all columns are of the same data type (dtype):

df_homo = pd.DataFrame({'A': [0, 1, 2], 'B': [3, 4, 5]})
print(df_homo)
#    A  B
# 0  0  3
# 1  1  4
# 2  2  5

print(df_homo.dtypes)
# A    int64
# B    int64
# dtype: object

Select using a slice:

df_homo_slice = df_homo.iloc[:2]
print(df_homo_slice)
#    A  B
# 0  0  3
# 1  1  4

print(np.shares_memory(df_homo, df_homo_slice))
# True

print(df_homo_slice._is_view)
# True

Select using a list:

df_homo_list = df_homo.iloc[[0, 1]]
print(df_homo_list)
#    A  B
# 0  0  3
# 1  1  4

print(np.shares_memory(df_homo, df_homo_list))
# False

print(df_homo_list._is_view)
# False

Boolean indexing:

df_homo_bool = df_homo.loc[[True, False, True]]
print(df_homo_bool)
#    A  B
# 0  0  3
# 2  2  5

print(np.shares_memory(df_homo, df_homo_bool))
# False

print(df_homo_bool._is_view)
# False

Select using a scalar value:

s_homo_scalar = df_homo.iloc[0]
print(s_homo_scalar)
# A    0
# B    3
# Name: 0, dtype: int64

print(np.shares_memory(df_homo, s_homo_scalar))
# True

print(s_homo_scalar._is_view)
# True

Select using [column_name]:

s_homo_col = df_homo['A']
print(s_homo_col)
# 0    0
# 1    1
# 2    2
# Name: A, dtype: int64

print(np.shares_memory(df_homo, s_homo_col))
# True

print(s_homo_col._is_view)
# True

Select using a list of column names:

df_homo_col_list = df_homo[['A', 'B']]
print(df_homo_col_list)
#    A  B
# 0  0  3
# 1  1  4
# 2  2  5

print(np.shares_memory(df_homo, df_homo_col_list))
# False

print(df_homo_col_list._is_view)
# False

Modify the value in the original DataFrame and check if the value in the created DataFrame has changed.

df_homo.iat[0, 0] = 100
print(df_homo)
#      A  B
# 0  100  3
# 1    1  4
# 2    2  5

print(df_homo_slice)
#      A  B
# 0  100  3
# 1    1  4

print(df_homo_list)
#    A  B
# 0  0  3
# 1  1  4

print(df_homo_bool)
#    A  B
# 0  0  3
# 2  2  5

print(s_homo_scalar)
# A    100
# B      3
# Name: 0, dtype: int64

print(s_homo_col)
# 0    100
# 1      1
# 2      2
# Name: A, dtype: int64

print(df_homo_col_list)
#    A  B
# 0  0  3
# 1  1  4
# 2  2  5

In these examples, the results align with np.shares_memory() and the _is_view attribute, indicating that specifying with a list or boolean indexing creates a copy, while other methods result in a view.

When columns of different data types exist

The situation is more complex when columns of different data types (dtype) are present.

The following answer on Stack Overflow indicates that a copy is always returned, but there may be exceptions.

An indexer that gets on a multiple-dtyped object is always a copy.
python - What rules does Pandas use to generate a view vs a copy? - Stack Overflow

Consider the following DataFrame.

df_hetero = pd.DataFrame({'A': [0, 1, 2], 'B': ['x', 'y', 'z']})
print(df_hetero)
#    A  B
# 0  0  x
# 1  1  y
# 2  2  z

print(df_hetero.dtypes)
# A     int64
# B    object
# dtype: object

Select using a slice:

df_hetero_slice_row = df_hetero.iloc[:2]
print(df_hetero_slice_row)
#    A  B
# 0  0  x
# 1  1  y

print(np.shares_memory(df_hetero, df_hetero_slice_row))
# False

print(df_hetero_slice_row._is_view)
# False

df_hetero_slice_row_col = df_hetero.iloc[:2, 0:]
print(df_hetero_slice_row_col)
#    A  B
# 0  0  x
# 1  1  y

print(np.shares_memory(df_hetero, df_hetero_slice_row_col))
# False

print(df_hetero_slice_row_col._is_view)
# False

Select using a list:

df_hetero_list = df_hetero.iloc[[0, 1]]
print(df_hetero_list)
#    A  B
# 0  0  x
# 1  1  y

print(np.shares_memory(df_hetero, df_hetero_list))
# False

print(df_hetero_list._is_view)
# False

Boolean indexing:

df_hetero_bool = df_hetero.loc[[True, False, True]]
print(df_hetero_bool)
#    A  B
# 0  0  x
# 2  2  z

print(df_hetero_bool._is_view)
# False

print(df_hetero_bool._is_view)
# False

Select using a scalar value:

s_hetero_scalar = df_hetero.iloc[0]
print(s_hetero_scalar)
# A    0
# B    x
# Name: 0, dtype: object

print(np.shares_memory(df_hetero, s_hetero_scalar))
# False

print(s_hetero_scalar._is_view)
# False

Select using [column_name]:

s_hetero_col = df_hetero['A']
print(s_hetero_col)
# 0    0
# 1    1
# 2    2
# Name: A, dtype: int64

print(np.shares_memory(df_hetero, s_hetero_col))
# False

print(s_hetero_col._is_view)
# True

Select using a list of column names:

df_hetero_col_list = df_hetero[['A', 'B']]
print(df_hetero_col_list)
#    A  B
# 0  0  x
# 1  1  y
# 2  2  z

print(np.shares_memory(df_hetero, df_hetero_col_list))
# False

print(df_hetero_col_list._is_view)
# False

Modify the value in the original DataFrame and check if the value in the created DataFrame has changed.

df_hetero.iat[0, 0] = 100
print(df_hetero)
#      A  B
# 0  100  x
# 1    1  y
# 2    2  z

print(df_hetero_slice_row)
#      A  B
# 0  100  x
# 1    1  y

print(df_hetero_slice_row_col)
#    A  B
# 0  0  x
# 1  1  y

print(df_hetero_list)
#    A  B
# 0  0  x
# 1  1  y

print(df_hetero_bool)
#    A  B
# 0  0  x
# 2  2  z

print(s_hetero_scalar)
# A    0
# B    x
# Name: 0, dtype: object

print(s_hetero_col)
# 0    100
# 1      1
# 2      2
# Name: A, dtype: int64

print(df_hetero_col_list)
#    A  B
# 0  0  x
# 1  1  y
# 2  2  z

When selecting only rows with a slice, both np.shares_memory() and _is_view attribute return False, yet the memory is actually shared.

Additionally, when selecting using [column_name], np.shares_memory() returns False and the _is_view attribute returns True. However, changes in the original DataFrame are still reflected, suggesting that np.shares_memory() is not accurately indicating a shared memory situation in this context.

Memory sharing between numpy.ndarray and pandas.DataFrame

DataFrame and ndarray can be converted to each other, and memory may be shared between them.

In this case, the results from np.shares_memory() are generally reliable.

Both DataFrame and ndarray have their own copy() methods, allowing for the creation of copies in each type.

Generate pandas.DataFrame from numpy.ndarray

When generating a DataFrame from an ndarray:

a = np.array([[0, 1, 2], [3, 4, 5]])
print(a)
# [[0 1 2]
#  [3 4 5]]

df = pd.DataFrame(a)
print(df)
#    0  1  2
# 0  0  1  2
# 1  3  4  5

Both np.shares_memory() and the _is_view attribute of DataFrame return True.

print(np.shares_memory(a, df))
# True

print(df._is_view)
# True

Changing the value of the ndarray is reflected in the DataFrame, confirming that it is indeed a view.

a[0, 0] = 100
print(a)
# [[100   1   2]
#  [  3   4   5]]

print(df)
#      0  1  2
# 0  100  1  2
# 1    3  4  5

It's not always a view, though; in the case of strings, it becomes a copy.

a_str = np.array([['a', 'b', 'c'], ['x', 'y', 'z']])
print(a_str)
# [['a' 'b' 'c']
#  ['x' 'y' 'z']]

df_str = pd.DataFrame(a_str)
print(df_str)
#    0  1  2
# 0  a  b  c
# 1  x  y  z

print(np.shares_memory(a_str, df_str))
# False

print(df_str._is_view)
# False

a_str[0, 0] = 'A'
print(a_str)
# [['A' 'b' 'c']
#  ['x' 'y' 'z']]

print(df_str)
#    0  1  2
# 0  a  b  c
# 1  x  y  z

Generate numpy.ndarray from pandas.DataFrame

When generating an ndarray from a DataFrame, a view is created if all columns of DataFrame have the same data type (dtype).

df_homo = pd.DataFrame([[0, 1, 2], [3, 4, 5]])
print(df_homo)
#    0  1  2
# 0  0  1  2
# 1  3  4  5

print(df_homo.dtypes)
# 0    int64
# 1    int64
# 2    int64
# dtype: object

a_homo = df_homo.values
print(a_homo)
# [[0 1 2]
#  [3 4 5]]

print(np.shares_memory(a_homo, df_homo))
# True

df_homo.iat[0, 0] = 100
print(df_homo)
#      0  1  2
# 0  100  1  2
# 1    3  4  5

print(a_homo)
# [[100   1   2]
#  [  3   4   5]]

A copy is created if the data types are different.

df_hetero = pd.DataFrame([[0, 'x'], [1, 'y']])
print(df_hetero)
#    0  1
# 0  0  x
# 1  1  y

print(df_hetero.dtypes)
# 0     int64
# 1    object
# dtype: object

a_hetero = df_hetero.values
print(a_hetero)
# [[0 'x']
#  [1 'y']]

print(np.shares_memory(a_hetero, df_hetero))
# False

df_hetero.iat[0, 0] = 100
print(df_hetero)
#      0  1
# 0  100  x
# 1    1  y

print(a_hetero)
# [[0 'x']
#  [1 'y']]

Related Categories

Related Articles