pandas: Find the quantile with quantile()

Posted: 2024-01-19 | Tags: Python, pandas

In pandas, the quantile() method allows you to find the quantiles for columns or rows in a DataFrame. This method is also available on Series.

Quantiles are defined as follows:

In statistics and probability, quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way.
...
Common quantiles have special names, such as quartiles (four groups), deciles (ten groups), and percentiles (100 groups). Quantile - Wikipedia

Contents

Basic usage of quantile()
Specify the quantile(s) to compute: q
Specify interpolation method: interpolation
- Differences in data type (dtype) by interpolation method
Specify column-wise or row-wise: axis
Specify whether to process non-numeric data: numeric_only
quantile() for strings (str)
quantile() for datetime
quantile() for bool

The describe() method is useful to compute summary statistics including quartiles.

pandas: Get summary statistics for each column with describe()

The pandas version used in this article is as follows. Note that functionality may vary between versions. The following DataFrame is used as an example.

import pandas as pd

print(pd.__version__)
# 2.1.4

df = pd.DataFrame({'col_1': range(11), 'col_2': [i**2 for i in range(11)]})
print(df)
#     col_1  col_2
# 0       0      0
# 1       1      1
# 2       2      4
# 3       3      9
# 4       4     16
# 5       5     25
# 6       6     36
# 7       7     49
# 8       8     64
# 9       9     81
# 10     10    100

source: pandas_quantile.py

Basic usage of `quantile()`

By default, the quantile() method on a DataFrame returns the median (the second quartile or 50th percentile) for each column. This result is presented as a Series. How it handles columns containing non-numeric data is described later.

print(df.quantile())
# col_1     5.0
# col_2    25.0
# Name: 0.5, dtype: float64

print(type(df.quantile()))
# <class 'pandas.core.series.Series'>

source: pandas_quantile.py

When calling quantile() on a Series, the median is returned as a scalar value.

print(df['col_1'].quantile())
# 5.0

print(type(df['col_1'].quantile()))
# <class 'numpy.float64'>

source: pandas_quantile.py

The type of the values depends on the original data type and the interpolation argument described later.

Specify the quantile(s) to compute: `q`

Specify the desired quantile using the first argument, q, which ranges from 0.0 to 1.0.

print(df.quantile(0.2))
# col_1    2.0
# col_2    4.0
# Name: 0.2, dtype: float64

source: pandas_quantile.py

Multiple values can also be specified as a list. In this case, the return value is a DataFrame.

print(df.quantile([0, 0.25, 0.5, 0.75, 1.0]))
#       col_1  col_2
# 0.00    0.0    0.0
# 0.25    2.5    6.5
# 0.50    5.0   25.0
# 0.75    7.5   56.5
# 1.00   10.0  100.0

print(type(df.quantile([0, 0.25, 0.5, 0.75, 1.0])))
# <class 'pandas.core.frame.DataFrame'>

source: pandas_quantile.py

When specifying a list for quantile() on a Series, the return value is a Series.

print(df['col_1'].quantile([0, 0.25, 0.5, 0.75, 1.0]))
# 0.00     0.0
# 0.25     2.5
# 0.50     5.0
# 0.75     7.5
# 1.00    10.0
# Name: col_1, dtype: float64

print(type(df['col_1'].quantile([0, 0.25, 0.5, 0.75, 1.0])))
# <class 'pandas.core.series.Series'>

source: pandas_quantile.py

Specify interpolation method: `interpolation`

The interpolation method can be specified using the interpolation argument. The default is 'linear', where the value is linearly interpolated between the surrounding values.

print(df.quantile(0.21))
# col_1    2.1
# col_2    4.5
# Name: 0.21, dtype: float64

print(df.quantile(0.21, interpolation='linear'))
# col_1    2.1
# col_2    4.5
# Name: 0.21, dtype: float64

source: pandas_quantile.py

For 'lower', the smaller value is used; for 'higher', the larger value; and for 'nearest', the closest value is used.

print(df.quantile(0.21, interpolation='lower'))
# col_1    2
# col_2    4
# Name: 0.21, dtype: int64

print(df.quantile(0.21, interpolation='higher'))
# col_1    3
# col_2    9
# Name: 0.21, dtype: int64

print(df.quantile(0.21, interpolation='nearest'))
# col_1    2
# col_2    4
# Name: 0.21, dtype: int64

source: pandas_quantile.py

'midpoint' results in the average of the surrounding values.

print(df.quantile(0.21, interpolation='midpoint'))
# col_1    2.5
# col_2    6.5
# Name: 0.21, dtype: float64

source: pandas_quantile.py

Differences in data type (`dtype`) by interpolation method

The default is linear interpolation, so if the original data type (dtype) is an integer (int), it will be converted to a floating point number (float). Note that the data type may change, even if the resulting value equals the original.

print(df.quantile(0.2))
# col_1    2.0
# col_2    4.0
# Name: 0.2, dtype: float64

source: pandas_quantile.py

For 'lower', 'higher', and 'nearest', the original value is used directly, so the data type remains unchanged.

print(df.quantile(0.2, interpolation='lower'))
# col_1    2
# col_2    4
# Name: 0.2, dtype: int64

source: pandas_quantile.py

Specify column-wise or row-wise: `axis`

By default, processing is done column-wise, but setting the axis argument to 1 or 'columns' processes it row-wise.

print(df.quantile(axis=1))
# 0      0.0
# 1      1.0
# 2      3.0
# 3      6.0
# 4     10.0
# 5     15.0
# 6     21.0
# 7     28.0
# 8     36.0
# 9     45.0
# 10    55.0
# Name: 0.5, dtype: float64

source: pandas_quantile.py

Specify whether to process non-numeric data: `numeric_only`

The numeric_only argument specifies whether to process non-numeric columns. Setting numeric_only to True targets only numeric columns; setting it to False includes all column types.

Since pandas version 2.0, the default value of numeric_only is False. Before that, it was True. Be aware of the version differences.

`quantile()` for strings (`str`)

Consider the following DataFrame with a string column.

df_str = df.copy()
df_str['col_3'] = list('abcdefghijk')
print(df_str)
#     col_1  col_2 col_3
# 0       0      0     a
# 1       1      1     b
# 2       2      4     c
# 3       3      9     d
# 4       4     16     e
# 5       5     25     f
# 6       6     36     g
# 7       7     49     h
# 8       8     64     i
# 9       9     81     j
# 10     10    100     k

print(df_str.dtypes)
# col_1     int64
# col_2     int64
# col_3    object
# dtype: object

source: pandas_quantile.py

If numeric_only is set to True, only numeric columns are targeted, and string columns are excluded.

print(df_str.quantile(numeric_only=True))
# col_1     5.0
# col_2    25.0
# Name: 0.5, dtype: float64

source: pandas_quantile.py

If numeric_only is set to False (which is the default from version 2.0), and string columns are included, setting interpolation to 'linear' (default) or 'midpoint' will result in an error. For 'lower', 'higher', and 'nearest', the result is based on the lexicographical order of the values.

# print(df_str.quantile())
# TypeError: unsupported operand type(s) for -: 'str' and 'str'

# print(df_str.quantile(interpolation='midpoint'))
# TypeError: unsupported operand type(s) for -: 'str' and 'str'

print(df_str.quantile([0.2, 0.21, 0.3], interpolation='lower'))
#       col_1  col_2 col_3
# 0.20      2      4     c
# 0.21      2      4     c
# 0.30      3      9     d

print(df_str.quantile([0.2, 0.21, 0.3], interpolation='higher'))
#       col_1  col_2 col_3
# 0.20      2      4     c
# 0.21      3      9     d
# 0.30      3      9     d

print(df_str.quantile([0.2, 0.21, 0.3], interpolation='nearest'))
#       col_1  col_2 col_3
# 0.20      2      4     c
# 0.21      2      4     c
# 0.30      3      9     d

source: pandas_quantile.py

`quantile()` for `datetime`

Consider the following DataFrame with a datetime column.

df_dt = df.copy()
df_dt['col_3'] = pd.date_range('2023-01-01', '2023-01-11')
print(df_dt)
#     col_1  col_2      col_3
# 0       0      0 2023-01-01
# 1       1      1 2023-01-02
# 2       2      4 2023-01-03
# 3       3      9 2023-01-04
# 4       4     16 2023-01-05
# 5       5     25 2023-01-06
# 6       6     36 2023-01-07
# 7       7     49 2023-01-08
# 8       8     64 2023-01-09
# 9       9     81 2023-01-10
# 10     10    100 2023-01-11

print(df_dt.dtypes)
# col_1             int64
# col_2             int64
# col_3    datetime64[ns]
# dtype: object

source: pandas_quantile.py

If numeric_only is set to True, only numeric columns are targeted, and datetime columns are excluded.

print(df_dt.quantile(numeric_only=True))
# col_1     5.0
# col_2    25.0
# Name: 0.5, dtype: float64

source: pandas_quantile.py

datetime columns are correctly interpolated even with interpolation set to 'linear' (default) or 'midpoint'. Of course, 'lower', 'higher', and 'nearest' are also acceptable.

print(df_dt.quantile([0.2, 0.21, 0.3]))
#       col_1  col_2               col_3
# 0.20    2.0    4.0 2023-01-03 00:00:00
# 0.21    2.1    4.5 2023-01-03 02:24:00
# 0.30    3.0    9.0 2023-01-04 00:00:00

print(df_dt.quantile([0.2, 0.21, 0.3], interpolation='midpoint'))
#       col_1  col_2               col_3
# 0.20    2.0    4.0 2023-01-03 00:00:00
# 0.21    2.5    6.5 2023-01-03 12:00:00
# 0.30    3.0    9.0 2023-01-04 00:00:00

print(df_dt.quantile([0.2, 0.21, 0.3], interpolation='lower'))
#       col_1  col_2      col_3
# 0.20      2      4 2023-01-03
# 0.21      2      4 2023-01-03
# 0.30      3      9 2023-01-04

print(df_dt.quantile([0.2, 0.21, 0.3], interpolation='higher'))
#       col_1  col_2      col_3
# 0.20      2      4 2023-01-03
# 0.21      3      9 2023-01-04
# 0.30      3      9 2023-01-04

print(df_dt.quantile([0.2, 0.21, 0.3], interpolation='nearest'))
#       col_1  col_2      col_3
# 0.20      2      4 2023-01-03
# 0.21      2      4 2023-01-03
# 0.30      3      9 2023-01-04

source: pandas_quantile.py

`quantile()` for `bool`

Consider the following DataFrame with a bool column.

df_bool = df.copy()
df_bool['col_3'] = [True, False, True, False, True, False, True, False, True, False, True]
print(df_bool)
#     col_1  col_2  col_3
# 0       0      0   True
# 1       1      1  False
# 2       2      4   True
# 3       3      9  False
# 4       4     16   True
# 5       5     25  False
# 6       6     36   True
# 7       7     49  False
# 8       8     64   True
# 9       9     81  False
# 10     10    100   True

print(df_bool.dtypes)
# col_1    int64
# col_2    int64
# col_3     bool
# dtype: object

source: pandas_quantile.py

Since bool is a subclass of int and treated numerically, it is included in the processing regardless of the numeric_only setting. However, as of pandas 2.1.4, an error occurs if there is a bool column.

BUG: Broken bool supports in pandas' quantile by NumPy's percentile behaviour change · Issue #41792 · pandas-dev/pandas

# print(df_bool.quantile())
# TypeError: numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.

# print(df_bool.quantile(numeric_only=True))
# TypeError: numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.

source: pandas_quantile.py

Use select_dtypes() to exclude bool columns or astype() to convert them to integers (int).

print(df_bool.select_dtypes(exclude=bool))
#     col_1  col_2
# 0       0      0
# 1       1      1
# 2       2      4
# 3       3      9
# 4       4     16
# 5       5     25
# 6       6     36
# 7       7     49
# 8       8     64
# 9       9     81
# 10     10    100

print(df_bool.select_dtypes(exclude=bool).quantile())
# col_1     5.0
# col_2    25.0
# Name: 0.5, dtype: float64

print(df_bool.astype({'col_3': int}))
#     col_1  col_2  col_3
# 0       0      0      1
# 1       1      1      0
# 2       2      4      1
# 3       3      9      0
# 4       4     16      1
# 5       5     25      0
# 6       6     36      1
# 7       7     49      0
# 8       8     64      1
# 9       9     81      0
# 10     10    100      1

print(df_bool.astype({'col_3': int}).quantile())
# col_1     5.0
# col_2    25.0
# col_3     1.0
# Name: 0.5, dtype: float64

source: pandas_quantile.py

pandas: Find the quantile with quantile()

Basic usage of `quantile()`

Specify the quantile(s) to compute: `q`

Specify interpolation method: `interpolation`

Differences in data type (`dtype`) by interpolation method

Specify column-wise or row-wise: `axis`

Specify whether to process non-numeric data: `numeric_only`

`quantile()` for strings (`str`)

`quantile()` for `datetime`

`quantile()` for `bool`

Related Categories

Related Articles

pandas: Find the quantile with quantile()

Basic usage of quantile()

Specify the quantile(s) to compute: q

Specify interpolation method: interpolation

Differences in data type (dtype) by interpolation method

Specify column-wise or row-wise: axis

Specify whether to process non-numeric data: numeric_only

quantile() for strings (str)

quantile() for datetime

quantile() for bool

Related Categories

Related Articles

Basic usage of `quantile()`

Specify the quantile(s) to compute: `q`

Specify interpolation method: `interpolation`

Differences in data type (`dtype`) by interpolation method

Specify column-wise or row-wise: `axis`

Specify whether to process non-numeric data: `numeric_only`

`quantile()` for strings (`str`)

`quantile()` for `datetime`

`quantile()` for `bool`