pandas: Random sampling from DataFrame with sample()
You can get a random sample from pandas.DataFrame and Series by the sample() method. This is useful for checking data in a large pandas.DataFrame, Series.
- pandas.DataFrame.sample — pandas 1.4.2 documentation
- pandas.Series.sample — pandas 1.4.2 documentation
This article describes the following contents.
- Default behavior of
sample() - Rows or columns:
axis - The number of rows and columns:
n - The fraction of rows and columns:
frac - The seed for the random number generator:
random_state - With or without replacement:
replace - Reset index:
ignore_index,reset_index()
Use the iris data set included as a sample in seaborn.
import pandas as pd
import seaborn as sns
df = sns.load_dataset("iris")
print(df.shape)
# (150, 5)
The following examples are for pandas.DataFrame, but pandas.Series also has sample(). The usage is the same for both.
Note that you can check large size pandas.DataFrame and Series with head() and tail(), which return the first/last n rows.
Default behavior of sample()
By default, one row is randomly selected.
print(df.sample())
# sepal_length sepal_width petal_length petal_width species
# 133 6.3 2.8 5.1 1.5 virginica
Rows or columns: axis
If the axis parameter is set to 1, a column is randomly extracted instead of a row.
print(df.sample(axis=1))
# petal_width
# 0 0.2
# 1 0.2
# 2 0.2
# 3 0.2
# 4 0.2
# .. ...
# 145 2.3
# 146 1.9
# 147 2.0
# 148 2.3
# 149 1.8
#
# [150 rows x 1 columns]
The number of rows and columns: n
The number of rows or columns to be selected can be specified in the n parameter.
print(df.sample(n=3))
# sepal_length sepal_width petal_length petal_width species
# 29 4.7 3.2 1.6 0.2 setosa
# 67 5.8 2.7 4.1 1.0 versicolor
# 18 5.7 3.8 1.7 0.3 setosa
The fraction of rows and columns: frac
The fraction of rows and columns to be selected can be specified in the frac parameter. frac=1 means 100%.
print(df.sample(frac=0.04))
# sepal_length sepal_width petal_length petal_width species
# 15 5.7 4.4 1.5 0.4 setosa
# 66 5.6 3.0 4.5 1.5 versicolor
# 131 7.9 3.8 6.4 2.0 virginica
# 64 5.6 2.9 3.6 1.3 versicolor
# 81 5.5 2.4 3.7 1.0 versicolor
# 137 6.4 3.1 5.5 1.8 virginica
You cannot specify n and frac at the same time.
# print(df.sample(n=3, frac=0.04))
# ValueError: Please enter a value for `frac` OR `n`, not both
The seed for the random number generator: random_state
The seed for the random number generator can be specified in the random_state parameter. The same rows/columns are returned for the same random_state.
print(df.sample(n=3, random_state=0))
# sepal_length sepal_width petal_length petal_width species
# 114 5.8 2.8 5.1 2.4 virginica
# 62 6.0 2.2 4.0 1.0 versicolor
# 33 5.5 4.2 1.4 0.2 setosa
print(df.sample(n=3, random_state=0))
# sepal_length sepal_width petal_length petal_width species
# 114 5.8 2.8 5.1 2.4 virginica
# 62 6.0 2.2 4.0 1.0 versicolor
# 33 5.5 4.2 1.4 0.2 setosa
With or without replacement: replace
If the replace parameter is set to True, rows and columns are sampled with replacement. The same row/column may be selected repeatedly.
The default value for replace is False (sampling without replacement).
print(df.head(3))
# sepal_length sepal_width petal_length petal_width species
# 0 5.1 3.5 1.4 0.2 setosa
# 1 4.9 3.0 1.4 0.2 setosa
# 2 4.7 3.2 1.3 0.2 setosa
print(df.head(3).sample(n=3, replace=True))
# sepal_length sepal_width petal_length petal_width species
# 0 5.1 3.5 1.4 0.2 setosa
# 0 5.1 3.5 1.4 0.2 setosa
# 2 4.7 3.2 1.3 0.2 setosa
If replace=True, you can specify a value greater than the original number of rows/columns in n or a value greater than 1 in frac.
print(df.head(3).sample(n=5, replace=True))
# sepal_length sepal_width petal_length petal_width species
# 1 4.9 3.0 1.4 0.2 setosa
# 2 4.7 3.2 1.3 0.2 setosa
# 0 5.1 3.5 1.4 0.2 setosa
# 0 5.1 3.5 1.4 0.2 setosa
# 1 4.9 3.0 1.4 0.2 setosa
print(df.head(3).sample(frac=2, replace=True))
# sepal_length sepal_width petal_length petal_width species
# 2 4.7 3.2 1.3 0.2 setosa
# 1 4.9 3.0 1.4 0.2 setosa
# 2 4.7 3.2 1.3 0.2 setosa
# 2 4.7 3.2 1.3 0.2 setosa
# 0 5.1 3.5 1.4 0.2 setosa
# 2 4.7 3.2 1.3 0.2 setosa
Reset index: ignore_index, reset_index()
If you want to reindex the result (0, 1, ... , n-1), set the ignore_index parameter of sample() to True.
print(df.sample(n=3, ignore_index=True))
# sepal_length sepal_width petal_length petal_width species
# 0 5.2 2.7 3.9 1.4 versicolor
# 1 6.3 2.5 4.9 1.5 versicolor
# 2 5.7 3.0 4.2 1.2 versicolor
The ignore_index was added in pandas 1.3.0. For earlier versions, you can use the reset_index() method. Set the drop parameter to True to delete the original index.
print(df.sample(n=3).reset_index(drop=True))
# sepal_length sepal_width petal_length petal_width species
# 0 4.9 3.1 1.5 0.2 setosa
# 1 7.9 3.8 6.4 2.0 virginica
# 2 6.3 2.8 5.1 1.5 virginica