pandas: Data binning with cut() and qcut()
In pandas, you can bin data with pandas.cut()
and pandas.qcut()
.
This article describes how to use pandas.cut()
and pandas.qcut()
.
- Binning with equal intervals or given boundary values:
pd.cut()
- Specify the number of equal-width bins
- Specify the bin edges (boundary values)
- Get the computed or specified bins:
retbins
- Specify whether the rightmost edge is included or not:
right
- Specify labels:
labels
- Specify the precision of boundary values:
precision
- Binning to make the number of elements equal:
pd.qcut()
- Specify the number of bins
- For duplicate values
- Count the number of elements in the bin:
value_counts()
- For Python list and NumPy array
- Example: Titanic data
Use the following pandas.Series
as an example.
import pandas as pd
s = pd.Series(data=[x**2 for x in range(11)],
index=list('abcdefghijk'))
print(s)
# a 0
# b 1
# c 4
# d 9
# e 16
# f 25
# g 36
# h 49
# i 64
# j 81
# k 100
# dtype: int64
Binning with equal intervals or given boundary values: pd.cut()
In pandas.cut()
, the first parameter x
is a one-dimensional array (Python list or numpy.ndarray
, pandas.Series
) as the source data, and the second parameter bins
is the bin division setting.
Specify the number of equal-width bins
You can specify the number of equal-width bins by specifying an integer value for bins
.
s_cut = pd.cut(s, 4)
print(s_cut)
# a (-0.1, 25.0]
# b (-0.1, 25.0]
# c (-0.1, 25.0]
# d (-0.1, 25.0]
# e (-0.1, 25.0]
# f (-0.1, 25.0]
# g (25.0, 50.0]
# h (25.0, 50.0]
# i (50.0, 75.0]
# j (75.0, 100.0]
# k (75.0, 100.0]
# dtype: category
# Categories (4, interval[float64]): [(-0.1, 25.0] < (25.0, 50.0] < (50.0, 75.0] < (75.0, 100.0]]
print(type(s_cut))
# <class 'pandas.core.series.Series'>
(a, b]
means a < x <= b
.
Specify the bin edges (boundary values)
You can specify the bin edges, boundary values, by specifying a list for bins
. Elements outside the range are considered NaN
.
print(pd.cut(s, [0, 10, 50, 100]))
# a NaN
# b (0, 10]
# c (0, 10]
# d (0, 10]
# e (10, 50]
# f (10, 50]
# g (10, 50]
# h (10, 50]
# i (50, 100]
# j (50, 100]
# k (50, 100]
# dtype: category
# Categories (3, interval[int64]): [(0, 10] < (10, 50] < (50, 100]]
Get the computed or specified bins: retbins
With retbins=True
, you can get bins, i.e., a list of boundary values in addition to the binned data.
s_cut, bins = pd.cut(s, 4, retbins=True)
print(s_cut)
# a (-0.1, 25.0]
# b (-0.1, 25.0]
# c (-0.1, 25.0]
# d (-0.1, 25.0]
# e (-0.1, 25.0]
# f (-0.1, 25.0]
# g (25.0, 50.0]
# h (25.0, 50.0]
# i (50.0, 75.0]
# j (75.0, 100.0]
# k (75.0, 100.0]
# dtype: category
# Categories (4, interval[float64]): [(-0.1, 25.0] < (25.0, 50.0] < (50.0, 75.0] < (75.0, 100.0]]
print(bins)
print(type(bins))
# [ -0.1 25. 50. 75. 100. ]
# <class 'numpy.ndarray'>
Specify whether the rightmost edge is included or not: right
By default, the rightmost edge is included and the leftmost edge is not. If right=False
, on the contrary, the rightmost edge is not included.
print(pd.cut(s, 4, right=False))
# a [0.0, 25.0)
# b [0.0, 25.0)
# c [0.0, 25.0)
# d [0.0, 25.0)
# e [0.0, 25.0)
# f [25.0, 50.0)
# g [25.0, 50.0)
# h [25.0, 50.0)
# i [50.0, 75.0)
# j [75.0, 100.1)
# k [75.0, 100.1)
# dtype: category
# Categories (4, interval[float64]): [[0.0, 25.0) < [25.0, 50.0) < [50.0, 75.0) < [75.0, 100.1)]
Specify labels: labels
You can specify labels with the labels
parameter. The default labels are like (a, b]
, as in the previous examples.
If labels=False
, integer indexes (sequential numbers starting from 0) are used as labels.
print(pd.cut(s, 4, labels=False))
# a 0
# b 0
# c 0
# d 0
# e 0
# f 0
# g 1
# h 1
# i 2
# j 3
# k 3
# dtype: int64
You can also specify a list of any labels. Note that an error is raised if the number of bins does not match the number of elements in the list.
print(pd.cut(s, 4, labels=['small', 'medium', 'large', 'x-large']))
# a small
# b small
# c small
# d small
# e small
# f small
# g medium
# h medium
# i large
# j x-large
# k x-large
# dtype: category
# Categories (4, object): [small < medium < large < x-large]
Specify the precision of boundary values: precision
You can specify the precision (number of decimal places) of boundary values with the precision
parameter.
print(pd.cut(s, 3))
# a (-0.1, 33.333]
# b (-0.1, 33.333]
# c (-0.1, 33.333]
# d (-0.1, 33.333]
# e (-0.1, 33.333]
# f (-0.1, 33.333]
# g (33.333, 66.667]
# h (33.333, 66.667]
# i (33.333, 66.667]
# j (66.667, 100.0]
# k (66.667, 100.0]
# dtype: category
# Categories (3, interval[float64]): [(-0.1, 33.333] < (33.333, 66.667] < (66.667, 100.0]]
print(pd.cut(s, 3, precision=1))
# a (-0.1, 33.3]
# b (-0.1, 33.3]
# c (-0.1, 33.3]
# d (-0.1, 33.3]
# e (-0.1, 33.3]
# f (-0.1, 33.3]
# g (33.3, 66.7]
# h (33.3, 66.7]
# i (33.3, 66.7]
# j (66.7, 100.0]
# k (66.7, 100.0]
# dtype: category
# Categories (3, interval[float64]): [(-0.1, 33.3] < (33.3, 66.7] < (66.7, 100.0]]
Binning to make the number of elements equal: pd.qcut()
qcut()
divides data so that the number of elements in each bin is as equal as possible.
The first parameter x
is a one-dimensional array (Python list or numpy.ndarray
, pandas.Series
) as the source data, and the second parameter q
is the number of bins.
You can specify the same parameters as in cut()
, labels
and retbins
.
Specify the number of bins
You can specify the number of divisions in q
.
If q=2
, the data is divided by 2-quantile (median).
print(pd.qcut(s, 2))
# a (-0.001, 25.0]
# b (-0.001, 25.0]
# c (-0.001, 25.0]
# d (-0.001, 25.0]
# e (-0.001, 25.0]
# f (-0.001, 25.0]
# g (25.0, 100.0]
# h (25.0, 100.0]
# i (25.0, 100.0]
# j (25.0, 100.0]
# k (25.0, 100.0]
# dtype: category
# Categories (2, interval[float64]): [(-0.001, 25.0] < (25.0, 100.0]]
If q=4
, the data is divided by 4-quantiles (quartiles).
s_qcut, bins = pd.qcut(s, 4, labels=['Q1', 'Q2', 'Q3', 'Q4'], retbins=True)
print(s_qcut)
# a Q1
# b Q1
# c Q1
# d Q2
# e Q2
# f Q2
# g Q3
# h Q3
# i Q4
# j Q4
# k Q4
# dtype: category
# Categories (4, object): [Q1 < Q2 < Q3 < Q4]
print(bins)
# [ 0. 6.5 25. 56.5 100. ]
For duplicate values
Be careful when there are duplicate elements in the original data.
For example, if the values are duplicated up to the median:
s_duplicate = pd.Series(data=[0, 0, 0, 0, 0, 1, 2, 3, 4, 5, 6],
index=list('abcdefghijk'))
print(s_duplicate)
# a 0
# b 0
# c 0
# d 0
# e 0
# f 1
# g 2
# h 3
# i 4
# j 5
# k 6
# dtype: int64
It is possible to divide by 2 at the median as q=2
, but a larger number of divisions will result in an error.
print(pd.qcut(s_duplicate, 2))
# a (-0.001, 1.0]
# b (-0.001, 1.0]
# c (-0.001, 1.0]
# d (-0.001, 1.0]
# e (-0.001, 1.0]
# f (-0.001, 1.0]
# g (1.0, 6.0]
# h (1.0, 6.0]
# i (1.0, 6.0]
# j (1.0, 6.0]
# k (1.0, 6.0]
# dtype: category
# Categories (2, interval[float64]): [(-0.001, 1.0] < (1.0, 6.0]]
# print(pd.qcut(s_duplicate, 4))
# ValueError: Bin edges must be unique: array([0. , 0. , 1. , 3.5, 6. ]).
# You can drop duplicate edges by setting the 'duplicates' kwarg
For example, in the case of 4 divisions, the minimum, 25%, 50%, 75%, and maximum values are set as boundary values, but if there are many overlapping elements as in the example, the minimum and 25% are the same value, which is the cause of the error.
If duplicates='drop'
, duplicate boundary values are excluded.
print(pd.qcut(s_duplicate, 4, duplicates='drop'))
# a (-0.001, 1.0]
# b (-0.001, 1.0]
# c (-0.001, 1.0]
# d (-0.001, 1.0]
# e (-0.001, 1.0]
# f (-0.001, 1.0]
# g (1.0, 3.5]
# h (1.0, 3.5]
# i (3.5, 6.0]
# j (3.5, 6.0]
# k (3.5, 6.0]
# dtype: category
# Categories (3, interval[float64]): [(-0.001, 1.0] < (1.0, 3.5] < (3.5, 6.0]]
Count the number of elements in the bin: value_counts()
You can get the number of elements in a bin by calling the value_counts()
method from the pandas.Series
returned by cut()
or qcut()
.
counts = pd.cut(s, 3, labels=['S', 'M', 'L']).value_counts()
print(counts)
# S 6
# M 3
# L 2
# dtype: int64
print(type(counts))
# <class 'pandas.core.series.Series'>
print(counts['M'])
# 3
value_counts()
is also provided as function pandas.value_counts()
.
print(pd.value_counts(pd.cut(s, 3, labels=['S', 'M', 'L'])))
# S 6
# M 3
# L 2
# dtype: int64
For Python list and NumPy array
The previous examples used pandas.Series
as the source data, but the first parameter x
of cut()
or qcut()
can be a Python list or NumPy array ndarray
if it is one-dimensional.
l = [x**2 for x in range(11)]
print(l)
# [0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
l_cut = pd.cut(l, 3, labels=['S', 'M', 'L'])
print(l_cut)
# [S, S, S, S, S, ..., M, M, M, L, L]
# Length: 11
# Categories (3, object): [S < M < L]
print(type(l_cut))
# <class 'pandas.core.categorical.Categorical'>
You can get elements by index and convert them to Python list with list()
.
print(l_cut[0])
# S
print(list(l_cut))
# ['S', 'S', 'S', 'S', 'S', 'S', 'M', 'M', 'M', 'L', 'L']
If you want to count the number of elements in a bin, use pandas.value_counts()
.
print(pd.value_counts(l_cut))
# S 6
# M 3
# L 2
# dtype: int64
Example: Titanic data
Use Titanic data as an example. You can download it from Kaggle.
It is also available here.
Some columns are excluded.
df_titanic = pd.read_csv('data/src/titanic_train.csv').drop(['Name', 'Ticket', 'Cabin', 'Embarked'], axis=1)
print(df_titanic.head())
# PassengerId Survived Pclass Sex Age SibSp Parch Fare
# 0 1 0 3 male 22.0 1 0 7.2500
# 1 2 1 1 female 38.0 1 0 71.2833
# 2 3 1 3 female 26.0 0 0 7.9250
# 3 4 1 1 female 35.0 1 0 53.1000
# 4 5 0 3 male 35.0 0 0 8.0500
Bin the 'Age'
column with cut()
.
print(df_titanic['Age'].describe())
# count 714.000000
# mean 29.699118
# std 14.526497
# min 0.420000
# 25% 20.125000
# 50% 28.000000
# 75% 38.000000
# max 80.000000
# Name: Age, dtype: float64
print(pd.cut(df_titanic['Age'], 5, precision=0).value_counts(sort=False, dropna=False))
# (0.0, 16.0] 100
# (16.0, 32.0] 346
# (32.0, 48.0] 188
# (48.0, 64.0] 69
# (64.0, 80.0] 11
# NaN 177
# Name: Age, dtype: int64
To add the result as a new column to the original DataFrame
, do the following. To overwrite an existing column, simply name the left-hand column as the existing column name.
df_titanic['Age_bin'] = pd.cut(df_titanic['Age'], 5, labels=False)
print(df_titanic.head())
# PassengerId Survived Pclass Sex Age SibSp Parch Fare Age_bin
# 0 1 0 3 male 22.0 1 0 7.2500 1.0
# 1 2 1 1 female 38.0 1 0 71.2833 2.0
# 2 3 1 3 female 26.0 0 0 7.9250 1.0
# 3 4 1 1 female 35.0 1 0 53.1000 2.0
# 4 5 0 3 male 35.0 0 0 8.0500 2.0
Note that, in this example, the binning process is performed immediately for convenience, but ideally, the missing value NaN
should be complemented first before binning.