pandas: Find the quantile with quantile()

Posted: | Tags: Python, pandas

In pandas, the quantile() method allows you to find the quantiles for columns or rows in a DataFrame. This method is also available on Series.

Quantiles are defined as follows:

In statistics and probability, quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way.
...
Common quantiles have special names, such as quartiles (four groups), deciles (ten groups), and percentiles (100 groups). Quantile - Wikipedia

The describe() method is useful to compute summary statistics including quartiles.

The pandas version used in this article is as follows. Note that functionality may vary between versions. The following DataFrame is used as an example.

import pandas as pd

print(pd.__version__)
# 2.1.4

df = pd.DataFrame({'col_1': range(11), 'col_2': [i**2 for i in range(11)]})
print(df)
#     col_1  col_2
# 0       0      0
# 1       1      1
# 2       2      4
# 3       3      9
# 4       4     16
# 5       5     25
# 6       6     36
# 7       7     49
# 8       8     64
# 9       9     81
# 10     10    100

Basic usage of quantile()

By default, the quantile() method on a DataFrame returns the median (the second quartile or 50th percentile) for each column. This result is presented as a Series. How it handles columns containing non-numeric data is described later.

print(df.quantile())
# col_1     5.0
# col_2    25.0
# Name: 0.5, dtype: float64

print(type(df.quantile()))
# <class 'pandas.core.series.Series'>

When calling quantile() on a Series, the median is returned as a scalar value.

print(df['col_1'].quantile())
# 5.0

print(type(df['col_1'].quantile()))
# <class 'numpy.float64'>

The type of the values depends on the original data type and the interpolation argument described later.

Specify the quantile(s) to compute: q

Specify the desired quantile using the first argument, q, which ranges from 0.0 to 1.0.

print(df.quantile(0.2))
# col_1    2.0
# col_2    4.0
# Name: 0.2, dtype: float64

Multiple values can also be specified as a list. In this case, the return value is a DataFrame.

print(df.quantile([0, 0.25, 0.5, 0.75, 1.0]))
#       col_1  col_2
# 0.00    0.0    0.0
# 0.25    2.5    6.5
# 0.50    5.0   25.0
# 0.75    7.5   56.5
# 1.00   10.0  100.0

print(type(df.quantile([0, 0.25, 0.5, 0.75, 1.0])))
# <class 'pandas.core.frame.DataFrame'>

When specifying a list for quantile() on a Series, the return value is a Series.

print(df['col_1'].quantile([0, 0.25, 0.5, 0.75, 1.0]))
# 0.00     0.0
# 0.25     2.5
# 0.50     5.0
# 0.75     7.5
# 1.00    10.0
# Name: col_1, dtype: float64

print(type(df['col_1'].quantile([0, 0.25, 0.5, 0.75, 1.0])))
# <class 'pandas.core.series.Series'>

Specify interpolation method: interpolation

The interpolation method can be specified using the interpolation argument. The default is 'linear', where the value is linearly interpolated between the surrounding values.

print(df.quantile(0.21))
# col_1    2.1
# col_2    4.5
# Name: 0.21, dtype: float64

print(df.quantile(0.21, interpolation='linear'))
# col_1    2.1
# col_2    4.5
# Name: 0.21, dtype: float64

For 'lower', the smaller value is used; for 'higher', the larger value; and for 'nearest', the closest value is used.

print(df.quantile(0.21, interpolation='lower'))
# col_1    2
# col_2    4
# Name: 0.21, dtype: int64

print(df.quantile(0.21, interpolation='higher'))
# col_1    3
# col_2    9
# Name: 0.21, dtype: int64

print(df.quantile(0.21, interpolation='nearest'))
# col_1    2
# col_2    4
# Name: 0.21, dtype: int64

'midpoint' results in the average of the surrounding values.

print(df.quantile(0.21, interpolation='midpoint'))
# col_1    2.5
# col_2    6.5
# Name: 0.21, dtype: float64

Differences in data type (dtype) by interpolation method

The default is linear interpolation, so if the original data type (dtype) is an integer (int), it will be converted to a floating point number (float). Note that the data type may change, even if the resulting value equals the original.

print(df.quantile(0.2))
# col_1    2.0
# col_2    4.0
# Name: 0.2, dtype: float64

For 'lower', 'higher', and 'nearest', the original value is used directly, so the data type remains unchanged.

print(df.quantile(0.2, interpolation='lower'))
# col_1    2
# col_2    4
# Name: 0.2, dtype: int64

Specify column-wise or row-wise: axis

By default, processing is done column-wise, but setting the axis argument to 1 or 'columns' processes it row-wise.

print(df.quantile(axis=1))
# 0      0.0
# 1      1.0
# 2      3.0
# 3      6.0
# 4     10.0
# 5     15.0
# 6     21.0
# 7     28.0
# 8     36.0
# 9     45.0
# 10    55.0
# Name: 0.5, dtype: float64

Specify whether to process non-numeric data: numeric_only

The numeric_only argument specifies whether to process non-numeric columns. Setting numeric_only to True targets only numeric columns; setting it to False includes all column types.

Since pandas version 2.0, the default value of numeric_only is False. Before that, it was True. Be aware of the version differences.

quantile() for strings (str)

Consider the following DataFrame with a string column.

df_str = df.copy()
df_str['col_3'] = list('abcdefghijk')
print(df_str)
#     col_1  col_2 col_3
# 0       0      0     a
# 1       1      1     b
# 2       2      4     c
# 3       3      9     d
# 4       4     16     e
# 5       5     25     f
# 6       6     36     g
# 7       7     49     h
# 8       8     64     i
# 9       9     81     j
# 10     10    100     k

print(df_str.dtypes)
# col_1     int64
# col_2     int64
# col_3    object
# dtype: object

If numeric_only is set to True, only numeric columns are targeted, and string columns are excluded.

print(df_str.quantile(numeric_only=True))
# col_1     5.0
# col_2    25.0
# Name: 0.5, dtype: float64

If numeric_only is set to False (which is the default from version 2.0), and string columns are included, setting interpolation to 'linear' (default) or 'midpoint' will result in an error. For 'lower', 'higher', and 'nearest', the result is based on the lexicographical order of the values.

# print(df_str.quantile())
# TypeError: unsupported operand type(s) for -: 'str' and 'str'

# print(df_str.quantile(interpolation='midpoint'))
# TypeError: unsupported operand type(s) for -: 'str' and 'str'

print(df_str.quantile([0.2, 0.21, 0.3], interpolation='lower'))
#       col_1  col_2 col_3
# 0.20      2      4     c
# 0.21      2      4     c
# 0.30      3      9     d

print(df_str.quantile([0.2, 0.21, 0.3], interpolation='higher'))
#       col_1  col_2 col_3
# 0.20      2      4     c
# 0.21      3      9     d
# 0.30      3      9     d

print(df_str.quantile([0.2, 0.21, 0.3], interpolation='nearest'))
#       col_1  col_2 col_3
# 0.20      2      4     c
# 0.21      2      4     c
# 0.30      3      9     d

quantile() for datetime

Consider the following DataFrame with a datetime column.

df_dt = df.copy()
df_dt['col_3'] = pd.date_range('2023-01-01', '2023-01-11')
print(df_dt)
#     col_1  col_2      col_3
# 0       0      0 2023-01-01
# 1       1      1 2023-01-02
# 2       2      4 2023-01-03
# 3       3      9 2023-01-04
# 4       4     16 2023-01-05
# 5       5     25 2023-01-06
# 6       6     36 2023-01-07
# 7       7     49 2023-01-08
# 8       8     64 2023-01-09
# 9       9     81 2023-01-10
# 10     10    100 2023-01-11

print(df_dt.dtypes)
# col_1             int64
# col_2             int64
# col_3    datetime64[ns]
# dtype: object

If numeric_only is set to True, only numeric columns are targeted, and datetime columns are excluded.

print(df_dt.quantile(numeric_only=True))
# col_1     5.0
# col_2    25.0
# Name: 0.5, dtype: float64

datetime columns are correctly interpolated even with interpolation set to 'linear' (default) or 'midpoint'. Of course, 'lower', 'higher', and 'nearest' are also acceptable.

print(df_dt.quantile([0.2, 0.21, 0.3]))
#       col_1  col_2               col_3
# 0.20    2.0    4.0 2023-01-03 00:00:00
# 0.21    2.1    4.5 2023-01-03 02:24:00
# 0.30    3.0    9.0 2023-01-04 00:00:00

print(df_dt.quantile([0.2, 0.21, 0.3], interpolation='midpoint'))
#       col_1  col_2               col_3
# 0.20    2.0    4.0 2023-01-03 00:00:00
# 0.21    2.5    6.5 2023-01-03 12:00:00
# 0.30    3.0    9.0 2023-01-04 00:00:00

print(df_dt.quantile([0.2, 0.21, 0.3], interpolation='lower'))
#       col_1  col_2      col_3
# 0.20      2      4 2023-01-03
# 0.21      2      4 2023-01-03
# 0.30      3      9 2023-01-04

print(df_dt.quantile([0.2, 0.21, 0.3], interpolation='higher'))
#       col_1  col_2      col_3
# 0.20      2      4 2023-01-03
# 0.21      3      9 2023-01-04
# 0.30      3      9 2023-01-04

print(df_dt.quantile([0.2, 0.21, 0.3], interpolation='nearest'))
#       col_1  col_2      col_3
# 0.20      2      4 2023-01-03
# 0.21      2      4 2023-01-03
# 0.30      3      9 2023-01-04

quantile() for bool

Consider the following DataFrame with a bool column.

df_bool = df.copy()
df_bool['col_3'] = [True, False, True, False, True, False, True, False, True, False, True]
print(df_bool)
#     col_1  col_2  col_3
# 0       0      0   True
# 1       1      1  False
# 2       2      4   True
# 3       3      9  False
# 4       4     16   True
# 5       5     25  False
# 6       6     36   True
# 7       7     49  False
# 8       8     64   True
# 9       9     81  False
# 10     10    100   True

print(df_bool.dtypes)
# col_1    int64
# col_2    int64
# col_3     bool
# dtype: object

Since bool is a subclass of int and treated numerically, it is included in the processing regardless of the numeric_only setting. However, as of pandas 2.1.4, an error occurs if there is a bool column.

# print(df_bool.quantile())
# TypeError: numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.

# print(df_bool.quantile(numeric_only=True))
# TypeError: numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.

Use select_dtypes() to exclude bool columns or astype() to convert them to integers (int).

print(df_bool.select_dtypes(exclude=bool))
#     col_1  col_2
# 0       0      0
# 1       1      1
# 2       2      4
# 3       3      9
# 4       4     16
# 5       5     25
# 6       6     36
# 7       7     49
# 8       8     64
# 9       9     81
# 10     10    100

print(df_bool.select_dtypes(exclude=bool).quantile())
# col_1     5.0
# col_2    25.0
# Name: 0.5, dtype: float64

print(df_bool.astype({'col_3': int}))
#     col_1  col_2  col_3
# 0       0      0      1
# 1       1      1      0
# 2       2      4      1
# 3       3      9      0
# 4       4     16      1
# 5       5     25      0
# 6       6     36      1
# 7       7     49      0
# 8       8     64      1
# 9       9     81      0
# 10     10    100      1

print(df_bool.astype({'col_3': int}).quantile())
# col_1     5.0
# col_2    25.0
# col_3     1.0
# Name: 0.5, dtype: float64

Related Categories

Related Articles