pandas: Views and copies in DataFrame
This article explains views and copies in pandas.
When selecting a part of an existing DataFrame
using loc[]
or iloc[]
to create a new one, the result may be a view, sharing memory with the original, or a copy, with independently allocated memory.
Since views refer to the same memory, any modification in one object will also be reflected in the other.
For views and copies in NumPy, refer to the following article. It also introduces np.shares_memory()
, which is used in the sample code of this article.
The pandas and NumPy versions used in this article are as follows. Note that functionality may vary between versions.
import pandas as pd
import numpy as np
print(pd.__version__)
# 2.1.4
print(np.__version__)
# 1.26.2
Important points about views and copies in pandas.DataFrame
A key point to understand about views and copies in pandas is that, as of version 2.1.4, there's no definitive method to determine whether a DataFrame
is a view or a copy.
Outside of simple cases, it’s very hard to predict whether it will return a view or a copy (it depends on the memory layout of the array, about which pandas makes no guarantees)
Indexing and selecting data — pandas 2.1.4 documentation
As demonstrated in the examples below, neither np.shares_memory()
nor the _is_view
attribute of a DataFrame
always provide accurate results.
While calling the copy()
method of a DataFrame
reliably creates a copy, there is no certain way to ensure a view is created, making it risky to assume that views are generated when processing various data.
The main point of this article is not to specify when a view or a copy is returned, but rather to highlight the importance of caution, as it's often unclear whether you're dealing with a view or a copy.
Partial selection with loc
and iloc
loc[]
allows selection by row and column names, whereas iloc[]
allows selection by row and column numbers. Both methods support selections using scalar values, slices, lists, etc.
This section demonstrates creating a new DataFrame
through various selection methods, then examines the np.shares_memory()
and _is_view
attributes, and finally, tests whether changes in the original DataFrame
affect the new one.
Keep in mind that the following sample code and results are just examples, and whether a view or a copy is generated depends on various conditions.
When all columns are of the same data type (dtype
)
In the case where all columns are of the same data type (dtype
):
df_homo = pd.DataFrame({'A': [0, 1, 2], 'B': [3, 4, 5]})
print(df_homo)
# A B
# 0 0 3
# 1 1 4
# 2 2 5
print(df_homo.dtypes)
# A int64
# B int64
# dtype: object
Select using a slice:
df_homo_slice = df_homo.iloc[:2]
print(df_homo_slice)
# A B
# 0 0 3
# 1 1 4
print(np.shares_memory(df_homo, df_homo_slice))
# True
print(df_homo_slice._is_view)
# True
Select using a list:
df_homo_list = df_homo.iloc[[0, 1]]
print(df_homo_list)
# A B
# 0 0 3
# 1 1 4
print(np.shares_memory(df_homo, df_homo_list))
# False
print(df_homo_list._is_view)
# False
Boolean indexing:
df_homo_bool = df_homo.loc[[True, False, True]]
print(df_homo_bool)
# A B
# 0 0 3
# 2 2 5
print(np.shares_memory(df_homo, df_homo_bool))
# False
print(df_homo_bool._is_view)
# False
Select using a scalar value:
s_homo_scalar = df_homo.iloc[0]
print(s_homo_scalar)
# A 0
# B 3
# Name: 0, dtype: int64
print(np.shares_memory(df_homo, s_homo_scalar))
# True
print(s_homo_scalar._is_view)
# True
Select using [column_name]
:
s_homo_col = df_homo['A']
print(s_homo_col)
# 0 0
# 1 1
# 2 2
# Name: A, dtype: int64
print(np.shares_memory(df_homo, s_homo_col))
# True
print(s_homo_col._is_view)
# True
Select using a list of column names:
df_homo_col_list = df_homo[['A', 'B']]
print(df_homo_col_list)
# A B
# 0 0 3
# 1 1 4
# 2 2 5
print(np.shares_memory(df_homo, df_homo_col_list))
# False
print(df_homo_col_list._is_view)
# False
Modify the value in the original DataFrame
and check if the value in the created DataFrame
has changed.
df_homo.iat[0, 0] = 100
print(df_homo)
# A B
# 0 100 3
# 1 1 4
# 2 2 5
print(df_homo_slice)
# A B
# 0 100 3
# 1 1 4
print(df_homo_list)
# A B
# 0 0 3
# 1 1 4
print(df_homo_bool)
# A B
# 0 0 3
# 2 2 5
print(s_homo_scalar)
# A 100
# B 3
# Name: 0, dtype: int64
print(s_homo_col)
# 0 100
# 1 1
# 2 2
# Name: A, dtype: int64
print(df_homo_col_list)
# A B
# 0 0 3
# 1 1 4
# 2 2 5
In these examples, the results align with np.shares_memory()
and the _is_view
attribute, indicating that specifying with a list or boolean indexing creates a copy, while other methods result in a view.
When columns of different data types exist
The situation is more complex when columns of different data types (dtype
) are present.
The following answer on Stack Overflow indicates that a copy is always returned, but there may be exceptions.
An indexer that gets on a multiple-dtyped object is always a copy.
python - What rules does Pandas use to generate a view vs a copy? - Stack Overflow
Consider the following DataFrame
.
df_hetero = pd.DataFrame({'A': [0, 1, 2], 'B': ['x', 'y', 'z']})
print(df_hetero)
# A B
# 0 0 x
# 1 1 y
# 2 2 z
print(df_hetero.dtypes)
# A int64
# B object
# dtype: object
Select using a slice:
df_hetero_slice_row = df_hetero.iloc[:2]
print(df_hetero_slice_row)
# A B
# 0 0 x
# 1 1 y
print(np.shares_memory(df_hetero, df_hetero_slice_row))
# False
print(df_hetero_slice_row._is_view)
# False
df_hetero_slice_row_col = df_hetero.iloc[:2, 0:]
print(df_hetero_slice_row_col)
# A B
# 0 0 x
# 1 1 y
print(np.shares_memory(df_hetero, df_hetero_slice_row_col))
# False
print(df_hetero_slice_row_col._is_view)
# False
Select using a list:
df_hetero_list = df_hetero.iloc[[0, 1]]
print(df_hetero_list)
# A B
# 0 0 x
# 1 1 y
print(np.shares_memory(df_hetero, df_hetero_list))
# False
print(df_hetero_list._is_view)
# False
Boolean indexing:
df_hetero_bool = df_hetero.loc[[True, False, True]]
print(df_hetero_bool)
# A B
# 0 0 x
# 2 2 z
print(df_hetero_bool._is_view)
# False
print(df_hetero_bool._is_view)
# False
Select using a scalar value:
s_hetero_scalar = df_hetero.iloc[0]
print(s_hetero_scalar)
# A 0
# B x
# Name: 0, dtype: object
print(np.shares_memory(df_hetero, s_hetero_scalar))
# False
print(s_hetero_scalar._is_view)
# False
Select using [column_name]
:
s_hetero_col = df_hetero['A']
print(s_hetero_col)
# 0 0
# 1 1
# 2 2
# Name: A, dtype: int64
print(np.shares_memory(df_hetero, s_hetero_col))
# False
print(s_hetero_col._is_view)
# True
Select using a list of column names:
df_hetero_col_list = df_hetero[['A', 'B']]
print(df_hetero_col_list)
# A B
# 0 0 x
# 1 1 y
# 2 2 z
print(np.shares_memory(df_hetero, df_hetero_col_list))
# False
print(df_hetero_col_list._is_view)
# False
Modify the value in the original DataFrame
and check if the value in the created DataFrame
has changed.
df_hetero.iat[0, 0] = 100
print(df_hetero)
# A B
# 0 100 x
# 1 1 y
# 2 2 z
print(df_hetero_slice_row)
# A B
# 0 100 x
# 1 1 y
print(df_hetero_slice_row_col)
# A B
# 0 0 x
# 1 1 y
print(df_hetero_list)
# A B
# 0 0 x
# 1 1 y
print(df_hetero_bool)
# A B
# 0 0 x
# 2 2 z
print(s_hetero_scalar)
# A 0
# B x
# Name: 0, dtype: object
print(s_hetero_col)
# 0 100
# 1 1
# 2 2
# Name: A, dtype: int64
print(df_hetero_col_list)
# A B
# 0 0 x
# 1 1 y
# 2 2 z
When selecting only rows with a slice, both np.shares_memory()
and _is_view
attribute return False
, yet the memory is actually shared.
Additionally, when selecting using [column_name]
, np.shares_memory()
returns False
and the _is_view
attribute returns True
. However, changes in the original DataFrame
are still reflected, suggesting that np.shares_memory()
is not accurately indicating a shared memory situation in this context.
Memory sharing between numpy.ndarray
and pandas.DataFrame
DataFrame
and ndarray
can be converted to each other, and memory may be shared between them.
In this case, the results from np.shares_memory()
are generally reliable.
Both DataFrame
and ndarray
have their own copy()
methods, allowing for the creation of copies in each type.
Generate pandas.DataFrame
from numpy.ndarray
When generating a DataFrame
from an ndarray
:
a = np.array([[0, 1, 2], [3, 4, 5]])
print(a)
# [[0 1 2]
# [3 4 5]]
df = pd.DataFrame(a)
print(df)
# 0 1 2
# 0 0 1 2
# 1 3 4 5
Both np.shares_memory()
and the _is_view
attribute of DataFrame
return True
.
print(np.shares_memory(a, df))
# True
print(df._is_view)
# True
Changing the value of the ndarray
is reflected in the DataFrame
, confirming that it is indeed a view.
a[0, 0] = 100
print(a)
# [[100 1 2]
# [ 3 4 5]]
print(df)
# 0 1 2
# 0 100 1 2
# 1 3 4 5
It's not always a view, though; in the case of strings, it becomes a copy.
a_str = np.array([['a', 'b', 'c'], ['x', 'y', 'z']])
print(a_str)
# [['a' 'b' 'c']
# ['x' 'y' 'z']]
df_str = pd.DataFrame(a_str)
print(df_str)
# 0 1 2
# 0 a b c
# 1 x y z
print(np.shares_memory(a_str, df_str))
# False
print(df_str._is_view)
# False
a_str[0, 0] = 'A'
print(a_str)
# [['A' 'b' 'c']
# ['x' 'y' 'z']]
print(df_str)
# 0 1 2
# 0 a b c
# 1 x y z
Generate numpy.ndarray
from pandas.DataFrame
When generating an ndarray
from a DataFrame
, a view is created if all columns of DataFrame
have the same data type (dtype
).
df_homo = pd.DataFrame([[0, 1, 2], [3, 4, 5]])
print(df_homo)
# 0 1 2
# 0 0 1 2
# 1 3 4 5
print(df_homo.dtypes)
# 0 int64
# 1 int64
# 2 int64
# dtype: object
a_homo = df_homo.values
print(a_homo)
# [[0 1 2]
# [3 4 5]]
print(np.shares_memory(a_homo, df_homo))
# True
df_homo.iat[0, 0] = 100
print(df_homo)
# 0 1 2
# 0 100 1 2
# 1 3 4 5
print(a_homo)
# [[100 1 2]
# [ 3 4 5]]
A copy is created if the data types are different.
df_hetero = pd.DataFrame([[0, 'x'], [1, 'y']])
print(df_hetero)
# 0 1
# 0 0 x
# 1 1 y
print(df_hetero.dtypes)
# 0 int64
# 1 object
# dtype: object
a_hetero = df_hetero.values
print(a_hetero)
# [[0 'x']
# [1 'y']]
print(np.shares_memory(a_hetero, df_hetero))
# False
df_hetero.iat[0, 0] = 100
print(df_hetero)
# 0 1
# 0 100 x
# 1 1 y
print(a_hetero)
# [[0 'x']
# [1 'y']]