pandas: Views and copies in DataFrame
This article explains views and copies in pandas.
When selecting a part of an existing DataFrame using loc[] or iloc[] to create a new one, the result may be a view, sharing memory with the original, or a copy, with independently allocated memory.
Since views refer to the same memory, any modification in one object will also be reflected in the other.
For views and copies in NumPy, refer to the following article. It also introduces np.shares_memory(), which is used in the sample code of this article.
The pandas and NumPy versions used in this article are as follows. Note that functionality may vary between versions.
import pandas as pd
import numpy as np
print(pd.__version__)
# 2.1.4
print(np.__version__)
# 1.26.2
Important points about views and copies in pandas.DataFrame
A key point to understand about views and copies in pandas is that, as of version 2.1.4, there's no definitive method to determine whether a DataFrame is a view or a copy.
Outside of simple cases, it’s very hard to predict whether it will return a view or a copy (it depends on the memory layout of the array, about which pandas makes no guarantees)
Indexing and selecting data — pandas 2.1.4 documentation
As demonstrated in the examples below, neither np.shares_memory() nor the _is_view attribute of a DataFrame always provide accurate results.
While calling the copy() method of a DataFrame reliably creates a copy, there is no certain way to ensure a view is created, making it risky to assume that views are generated when processing various data.
The main point of this article is not to specify when a view or a copy is returned, but rather to highlight the importance of caution, as it's often unclear whether you're dealing with a view or a copy.
Partial selection with loc and iloc
loc[] allows selection by row and column names, whereas iloc[] allows selection by row and column numbers. Both methods support selections using scalar values, slices, lists, etc.
This section demonstrates creating a new DataFrame through various selection methods, then examines the np.shares_memory() and _is_view attributes, and finally, tests whether changes in the original DataFrame affect the new one.
Keep in mind that the following sample code and results are just examples, and whether a view or a copy is generated depends on various conditions.
When all columns are of the same data type (dtype)
In the case where all columns are of the same data type (dtype):
df_homo = pd.DataFrame({'A': [0, 1, 2], 'B': [3, 4, 5]})
print(df_homo)
# A B
# 0 0 3
# 1 1 4
# 2 2 5
print(df_homo.dtypes)
# A int64
# B int64
# dtype: object
Select using a slice:
df_homo_slice = df_homo.iloc[:2]
print(df_homo_slice)
# A B
# 0 0 3
# 1 1 4
print(np.shares_memory(df_homo, df_homo_slice))
# True
print(df_homo_slice._is_view)
# True
Select using a list:
df_homo_list = df_homo.iloc[[0, 1]]
print(df_homo_list)
# A B
# 0 0 3
# 1 1 4
print(np.shares_memory(df_homo, df_homo_list))
# False
print(df_homo_list._is_view)
# False
Boolean indexing:
df_homo_bool = df_homo.loc[[True, False, True]]
print(df_homo_bool)
# A B
# 0 0 3
# 2 2 5
print(np.shares_memory(df_homo, df_homo_bool))
# False
print(df_homo_bool._is_view)
# False
Select using a scalar value:
s_homo_scalar = df_homo.iloc[0]
print(s_homo_scalar)
# A 0
# B 3
# Name: 0, dtype: int64
print(np.shares_memory(df_homo, s_homo_scalar))
# True
print(s_homo_scalar._is_view)
# True
Select using [column_name]:
s_homo_col = df_homo['A']
print(s_homo_col)
# 0 0
# 1 1
# 2 2
# Name: A, dtype: int64
print(np.shares_memory(df_homo, s_homo_col))
# True
print(s_homo_col._is_view)
# True
Select using a list of column names:
df_homo_col_list = df_homo[['A', 'B']]
print(df_homo_col_list)
# A B
# 0 0 3
# 1 1 4
# 2 2 5
print(np.shares_memory(df_homo, df_homo_col_list))
# False
print(df_homo_col_list._is_view)
# False
Modify the value in the original DataFrame and check if the value in the created DataFrame has changed.
df_homo.iat[0, 0] = 100
print(df_homo)
# A B
# 0 100 3
# 1 1 4
# 2 2 5
print(df_homo_slice)
# A B
# 0 100 3
# 1 1 4
print(df_homo_list)
# A B
# 0 0 3
# 1 1 4
print(df_homo_bool)
# A B
# 0 0 3
# 2 2 5
print(s_homo_scalar)
# A 100
# B 3
# Name: 0, dtype: int64
print(s_homo_col)
# 0 100
# 1 1
# 2 2
# Name: A, dtype: int64
print(df_homo_col_list)
# A B
# 0 0 3
# 1 1 4
# 2 2 5
In these examples, the results align with np.shares_memory() and the _is_view attribute, indicating that specifying with a list or boolean indexing creates a copy, while other methods result in a view.
When columns of different data types exist
The situation is more complex when columns of different data types (dtype) are present.
The following answer on Stack Overflow indicates that a copy is always returned, but there may be exceptions.
An indexer that gets on a multiple-dtyped object is always a copy.
python - What rules does Pandas use to generate a view vs a copy? - Stack Overflow
Consider the following DataFrame.
df_hetero = pd.DataFrame({'A': [0, 1, 2], 'B': ['x', 'y', 'z']})
print(df_hetero)
# A B
# 0 0 x
# 1 1 y
# 2 2 z
print(df_hetero.dtypes)
# A int64
# B object
# dtype: object
Select using a slice:
df_hetero_slice_row = df_hetero.iloc[:2]
print(df_hetero_slice_row)
# A B
# 0 0 x
# 1 1 y
print(np.shares_memory(df_hetero, df_hetero_slice_row))
# False
print(df_hetero_slice_row._is_view)
# False
df_hetero_slice_row_col = df_hetero.iloc[:2, 0:]
print(df_hetero_slice_row_col)
# A B
# 0 0 x
# 1 1 y
print(np.shares_memory(df_hetero, df_hetero_slice_row_col))
# False
print(df_hetero_slice_row_col._is_view)
# False
Select using a list:
df_hetero_list = df_hetero.iloc[[0, 1]]
print(df_hetero_list)
# A B
# 0 0 x
# 1 1 y
print(np.shares_memory(df_hetero, df_hetero_list))
# False
print(df_hetero_list._is_view)
# False
Boolean indexing:
df_hetero_bool = df_hetero.loc[[True, False, True]]
print(df_hetero_bool)
# A B
# 0 0 x
# 2 2 z
print(df_hetero_bool._is_view)
# False
print(df_hetero_bool._is_view)
# False
Select using a scalar value:
s_hetero_scalar = df_hetero.iloc[0]
print(s_hetero_scalar)
# A 0
# B x
# Name: 0, dtype: object
print(np.shares_memory(df_hetero, s_hetero_scalar))
# False
print(s_hetero_scalar._is_view)
# False
Select using [column_name]:
s_hetero_col = df_hetero['A']
print(s_hetero_col)
# 0 0
# 1 1
# 2 2
# Name: A, dtype: int64
print(np.shares_memory(df_hetero, s_hetero_col))
# False
print(s_hetero_col._is_view)
# True
Select using a list of column names:
df_hetero_col_list = df_hetero[['A', 'B']]
print(df_hetero_col_list)
# A B
# 0 0 x
# 1 1 y
# 2 2 z
print(np.shares_memory(df_hetero, df_hetero_col_list))
# False
print(df_hetero_col_list._is_view)
# False
Modify the value in the original DataFrame and check if the value in the created DataFrame has changed.
df_hetero.iat[0, 0] = 100
print(df_hetero)
# A B
# 0 100 x
# 1 1 y
# 2 2 z
print(df_hetero_slice_row)
# A B
# 0 100 x
# 1 1 y
print(df_hetero_slice_row_col)
# A B
# 0 0 x
# 1 1 y
print(df_hetero_list)
# A B
# 0 0 x
# 1 1 y
print(df_hetero_bool)
# A B
# 0 0 x
# 2 2 z
print(s_hetero_scalar)
# A 0
# B x
# Name: 0, dtype: object
print(s_hetero_col)
# 0 100
# 1 1
# 2 2
# Name: A, dtype: int64
print(df_hetero_col_list)
# A B
# 0 0 x
# 1 1 y
# 2 2 z
When selecting only rows with a slice, both np.shares_memory() and _is_view attribute return False, yet the memory is actually shared.
Additionally, when selecting using [column_name], np.shares_memory() returns False and the _is_view attribute returns True. However, changes in the original DataFrame are still reflected, suggesting that np.shares_memory() is not accurately indicating a shared memory situation in this context.
Memory sharing between numpy.ndarray and pandas.DataFrame
DataFrame and ndarray can be converted to each other, and memory may be shared between them.
In this case, the results from np.shares_memory() are generally reliable.
Both DataFrame and ndarray have their own copy() methods, allowing for the creation of copies in each type.
Generate pandas.DataFrame from numpy.ndarray
When generating a DataFrame from an ndarray:
a = np.array([[0, 1, 2], [3, 4, 5]])
print(a)
# [[0 1 2]
# [3 4 5]]
df = pd.DataFrame(a)
print(df)
# 0 1 2
# 0 0 1 2
# 1 3 4 5
Both np.shares_memory() and the _is_view attribute of DataFrame return True.
print(np.shares_memory(a, df))
# True
print(df._is_view)
# True
Changing the value of the ndarray is reflected in the DataFrame, confirming that it is indeed a view.
a[0, 0] = 100
print(a)
# [[100 1 2]
# [ 3 4 5]]
print(df)
# 0 1 2
# 0 100 1 2
# 1 3 4 5
It's not always a view, though; in the case of strings, it becomes a copy.
a_str = np.array([['a', 'b', 'c'], ['x', 'y', 'z']])
print(a_str)
# [['a' 'b' 'c']
# ['x' 'y' 'z']]
df_str = pd.DataFrame(a_str)
print(df_str)
# 0 1 2
# 0 a b c
# 1 x y z
print(np.shares_memory(a_str, df_str))
# False
print(df_str._is_view)
# False
a_str[0, 0] = 'A'
print(a_str)
# [['A' 'b' 'c']
# ['x' 'y' 'z']]
print(df_str)
# 0 1 2
# 0 a b c
# 1 x y z
Generate numpy.ndarray from pandas.DataFrame
When generating an ndarray from a DataFrame, a view is created if all columns of DataFrame have the same data type (dtype).
df_homo = pd.DataFrame([[0, 1, 2], [3, 4, 5]])
print(df_homo)
# 0 1 2
# 0 0 1 2
# 1 3 4 5
print(df_homo.dtypes)
# 0 int64
# 1 int64
# 2 int64
# dtype: object
a_homo = df_homo.values
print(a_homo)
# [[0 1 2]
# [3 4 5]]
print(np.shares_memory(a_homo, df_homo))
# True
df_homo.iat[0, 0] = 100
print(df_homo)
# 0 1 2
# 0 100 1 2
# 1 3 4 5
print(a_homo)
# [[100 1 2]
# [ 3 4 5]]
A copy is created if the data types are different.
df_hetero = pd.DataFrame([[0, 'x'], [1, 'y']])
print(df_hetero)
# 0 1
# 0 0 x
# 1 1 y
print(df_hetero.dtypes)
# 0 int64
# 1 object
# dtype: object
a_hetero = df_hetero.values
print(a_hetero)
# [[0 'x']
# [1 'y']]
print(np.shares_memory(a_hetero, df_hetero))
# False
df_hetero.iat[0, 0] = 100
print(df_hetero)
# 0 1
# 0 100 x
# 1 1 y
print(a_hetero)
# [[0 'x']
# [1 'y']]