note.nkmk.me

pandas: Random sampling of rows, columns from DataFrame with sample()

Posted: 2019-07-12 / Tags: Python, pandas

For checking the data of pandas.DataFrame and pandas.Series with many rows, The sample() method that selects rows or columns randomly (random sampling) is useful.

Here, the following contents will be described.

  • Default behavior of sample()
  • The number of rows and columns: n
  • The fraction of rows and columns: frac
  • The seed for the random number generator: random_state
  • With or without replacement: replace
  • Rows or columns: axis

Note that there are other methods that can be used to check large size pandas.DataFrame and pandas.Series, such as head() and tail() which return the first / last n rows.

As an example, use the iris data set included as a sample in seaborn.

import pandas as pd
import seaborn as sns

df = sns.load_dataset("iris")
print(df.shape)
# (150, 5)

The following example is for pandas.DataFrame, but pandas.Series also has sample(). The usage is the same for both.

Sponsored Link

Default behavior of sample()

By default, one row is returned randomly.

print(df.sample())
#      sepal_length  sepal_width  petal_length  petal_width    species
# 108           6.7          2.5           5.8          1.8  virginica

The number of rows and columns: n

The number of rows or columns to be selected can be specified in the parameter n.

print(df.sample(n=3))
#     sepal_length  sepal_width  petal_length  petal_width     species
# 3            4.6          3.1           1.5          0.2      setosa
# 1            4.9          3.0           1.4          0.2      setosa
# 96           5.7          2.9           4.2          1.3  versicolor

The fraction of rows and columns: frac

The fraction of rows and columns to be selected can be specified in the parameter frac. frac=1 means 100%.

You can not specify n and frac at the same time.

print(df.sample(frac=0.04))
#      sepal_length  sepal_width  petal_length  petal_width     species
# 119           6.0          2.2           5.0          1.5   virginica
# 97            6.2          2.9           4.3          1.3  versicolor
# 46            5.1          3.8           1.6          0.2      setosa
# 137           6.4          3.1           5.5          1.8   virginica
# 56            6.3          3.3           4.7          1.6  versicolor
# 62            6.0          2.2           4.0          1.0  versicolor
Sponsored Link

The seed for the random number generator: random_state

The seed for the random number generator can be specified in the parameter random_state. The same rows / columns will always be returned.

print(df.sample(n=3, random_state=0))
#      sepal_length  sepal_width  petal_length  petal_width     species
# 114           5.8          2.8           5.1          2.4   virginica
# 62            6.0          2.2           4.0          1.0  versicolor
# 33            5.5          4.2           1.4          0.2      setosa

With or without replacement: replace

If the argument replace is set to True, rows and columns are sampled with replacement.re The same row / column may be selected.

The default value for replaca is False (sampling without replacement).

If replace=True, you can specify a value greater than the original number of rows / columns in n, or specify a value greater than 1 in frac.

print(df.head(3).sample(n=3, replace=True))
#    sepal_length  sepal_width  petal_length  petal_width species
# 2           4.7          3.2           1.3          0.2  setosa
# 1           4.9          3.0           1.4          0.2  setosa
# 1           4.9          3.0           1.4          0.2  setosa

print(df.head(3).sample(n=5, replace=True))
#    sepal_length  sepal_width  petal_length  petal_width species
# 1           4.9          3.0           1.4          0.2  setosa
# 0           5.1          3.5           1.4          0.2  setosa
# 1           4.9          3.0           1.4          0.2  setosa
# 0           5.1          3.5           1.4          0.2  setosa
# 0           5.1          3.5           1.4          0.2  setosa

Rows or columns: axis

With axis=1, you can randomly sample columns. As in the previous examples, the default value is axis=0 and rows are sampled.

print(df.head().sample(n=2, axis=1))
#    sepal_width species
# 0          3.5  setosa
# 1          3.0  setosa
# 2          3.2  setosa
# 3          3.1  setosa
# 4          3.6  setosa
Sponsored Link
Share

Related Categories

Related Posts