pandas: Grouping data with groupby()

Posted: | Tags: Python, pandas

In pandas, the groupby() method allows grouping data in DataFrame and Series.

This method enables aggregating data per group to compute statistical measures such as averages, minimums, maximums, and totals, or to apply any functions.

The pandas version used in this article is as follows. Note that functionality may vary between versions. The following DataFrame is used as an example.

import pandas as pd

print(pd.__version__)
# 2.1.2

df = pd.DataFrame(
    {'c_0': ['A', 'A', 'B', 'B', 'B', 'B'],
     'c_1': ['X', 'Y', 'X', 'Y', 'X', 'Y'],
     'c_2': [0, 1, 4, 9, 16, 25],
     'c_3': [125, 64, 27, 16, 1, 0]},
    index=['r_0', 'r_1', 'r_2', 'r_3', 'r_4', 'r_5']
)
print(df)
#     c_0 c_1  c_2  c_3
# r_0   A   X    0  125
# r_1   A   Y    1   64
# r_2   B   X    4   27
# r_3   B   Y    9   16
# r_4   B   X   16    1
# r_5   B   Y   25    0

Basic usage of groupby()

You can group data using the groupby() method, which is provided in both DataFrame and Series.

When you specify column names in the first argument, by, the data is grouped by the values in that column. A GroupBy object is returned, which does not display its contents when printed.

grouped = df.groupby('c_0')
print(grouped)
# <pandas.core.groupby.generic.DataFrameGroupBy object at 0x1272139d0>

print(type(grouped))
# <class 'pandas.core.groupby.generic.DataFrameGroupBy'>

You can process each group by executing methods on the GroupBy object.

For example, the mean() method calculates the average for each group. If the numeric_only argument is set to True, non-numeric columns are ignored. A DataFrame is returned.

df_mean = grouped.mean(numeric_only=True)
print(df_mean)
#       c_2   c_3
# c_0            
# A     0.5  94.5
# B    13.5  11.0

print(type(df_mean))
# <class 'pandas.core.frame.DataFrame'>

It's also possible to write groupby() and its methods consecutively.

print(df.groupby('c_0').mean(numeric_only=True))
#       c_2   c_3
# c_0            
# A     0.5  94.5
# B    13.5  11.0

print(df.groupby('c_1').mean(numeric_only=True))
#            c_2        c_3
# c_1                      
# X     6.666667  51.000000
# Y    11.666667  26.666667

Applying [column_name] or [list_of_column_names] to the GroupBy object targets only those columns for processing. This is useful when there are unnecessary columns.

print(df.groupby('c_0')['c_2'].mean())
# c_0
# A     0.5
# B    13.5
# Name: c_2, dtype: float64

print(df.groupby('c_0')[['c_2', 'c_3']].mean())
#       c_2   c_3
# c_0            
# A     0.5  94.5
# B    13.5  11.0

In addition to mean(), there are various methods such as sum() for calculating totals and count() for counting non-missing values.

print(df.groupby('c_0').sum(numeric_only=True))
#      c_2  c_3
# c_0          
# A      1  189
# B     54   44

print(df.groupby('c_0').count())
#      c_1  c_2  c_3
# c_0               
# A      2    2    2
# B      4    4    4

For a list of available methods, refer to the official documentation.

The agg() method for applying multiple processes, describe() for calculating multiple statistics at once, and apply() for applying any function to each group are described later.

Group by multiple columns

When a list of column names is specified as the first argument in groupby(), grouping can be done using multiple columns.

print(df.groupby(['c_0', 'c_1']).mean())
#           c_2    c_3
# c_0 c_1             
# A   X     0.0  125.0
#     Y     1.0   64.0
# B   X    10.0   14.0
#     Y    17.0    8.0

A multi-index DataFrame is returned.

Specify whether to use column names as index: as_index

By default, the column names specified as the first argument in groupby() become the result's index. Setting the as_index argument to False prevents this.

print(df.groupby('c_0', as_index=False).mean(numeric_only=True))
#   c_0   c_2   c_3
# 0   A   0.5  94.5
# 1   B  13.5  11.0

print(df.groupby(['c_0', 'c_1'], as_index=False).mean())
#   c_0 c_1   c_2    c_3
# 0   A   X   0.0  125.0
# 1   A   Y   1.0   64.0
# 2   B   X  10.0   14.0
# 3   B   Y  17.0    8.0

Handle missing values NaN: dropna

If the column specified as the first argument in groupby() contains missing values NaN, these rows are ignored by default. Setting the dropna argument to False treats NaN as a key.

df_nan = df.copy()
df_nan.iloc[0, 1] = float('nan')
df_nan.iloc[5, 1] = float('nan')
print(df_nan)
#     c_0  c_1  c_2  c_3
# r_0   A  NaN    0  125
# r_1   A    Y    1   64
# r_2   B    X    4   27
# r_3   B    Y    9   16
# r_4   B    X   16    1
# r_5   B  NaN   25    0

print(df_nan.groupby(['c_0', 'c_1']).mean())
#           c_2   c_3
# c_0 c_1            
# A   Y     1.0  64.0
# B   X    10.0  14.0
#     Y     9.0  16.0

print(df_nan.groupby(['c_0', 'c_1'], dropna=False).mean())
#           c_2    c_3
# c_0 c_1             
# A   Y     1.0   64.0
#     NaN   0.0  125.0
# B   X    10.0   14.0
#     Y     9.0   16.0
#     NaN  25.0    0.0

See the following article for handling missing values in pandas.

Get data in each group: get_group()

You can get data from each group using the get_group() method of the GroupBy object.

Specify the column name as the argument. If the group is based on multiple columns, use a tuple containing those column names. This method returns a DataFrame that includes the specified columns as keys.

print(df.groupby('c_0').get_group('B'))
#     c_0 c_1  c_2  c_3
# r_2   B   X    4   27
# r_3   B   Y    9   16
# r_4   B   X   16    1
# r_5   B   Y   25    0

print(df.groupby(['c_0', 'c_1']).get_group(('B', 'X')))
#     c_0 c_1  c_2  c_3
# r_2   B   X    4   27
# r_4   B   X   16    1

The number of data points in each group can be obtained using the size() method.

print(df.groupby('c_0').size())
# c_0
# A    2
# B    4
# dtype: int64

print(df.groupby(['c_0', 'c_1']).size())
# c_0  c_1
# A    X      1
#      Y      1
# B    X      2
#      Y      2
# dtype: int64

Apply multiple processes: agg()

Use the agg() method of the GroupBy object to apply multiple processes simultaneously.

You can specify the method name of the GroupBy object as a string. If specified as a list, multiple processes will be applied. It is also possible to apply different processes to each column using a dictionary (dict) with the column names as keys.

print(df.groupby(['c_0', 'c_1']).agg('mean'))
#           c_2    c_3
# c_0 c_1             
# A   X     0.0  125.0
#     Y     1.0   64.0
# B   X    10.0   14.0
#     Y    17.0    8.0

print(df.groupby(['c_0', 'c_1']).agg(['mean', 'min', 'max']))
#           c_2            c_3          
#          mean min max   mean  min  max
# c_0 c_1                               
# A   X     0.0   0   0  125.0  125  125
#     Y     1.0   1   1   64.0   64   64
# B   X    10.0   4  16   14.0    1   27
#     Y    17.0   9  25    8.0    0   16

print(df.groupby(['c_0', 'c_1']).agg({'c_2': 'sum', 'c_3': ['min', 'max']}))
#         c_2  c_3     
#         sum  min  max
# c_0 c_1              
# A   X     0  125  125
#     Y     1   64   64
# B   X    20    1   27
#     Y    34    0   16

Specifying a non-existent method name results in an error.

# print(df.groupby(['row_0', 'row_1']).agg('xxx'))
# AttributeError: 'xxx' is not a valid function for 'DataFrameGroupBy' object

# print(df.groupby(['row_0', 'row_1']).agg(['xxx']))
# AttributeError: 'SeriesGroupBy' object has no attribute 'xxx'

As you can see from the error message above, when a single string is specified, the method of DataFrameGroupBy is used; when a list is specified, the method of SeriesGroupBy is used.

Callable objects such as functions defined with def or lambda expressions can also be specified.

def my_func(x):
    return x.max() + x.min()

print(df.groupby(['c_0', 'c_1']).agg([my_func, lambda x: x.sum() - x.mean()]))
#             c_2                c_3           
#         my_func <lambda_0> my_func <lambda_0>
# c_0 c_1                                      
# A   X         0        0.0     250        0.0
#     Y         2        0.0     128        0.0
# B   X        20       10.0      28       14.0
#     Y        34       17.0      16        8.0

Regardless of whether it is specified as a single item or within a list, Series is passed to the callable objects.

print(df.groupby(['c_0', 'c_1']).agg(lambda x: str(type(x))).iloc[0, 0])
# <class 'pandas.core.series.Series'>

print(df.groupby(['c_0', 'c_1']).agg([lambda x: str(type(x))]).iloc[0, 0])
# <class 'pandas.core.series.Series'>

print(df.groupby(['c_0', 'c_1']).agg(lambda x: str(x.values)))
#              c_2      c_3
# c_0 c_1                  
# A   X        [0]    [125]
#     Y        [1]     [64]
# B   X    [ 4 16]  [27  1]
#     Y    [ 9 25]  [16  0]

Calculate multiple statistics: describe()

Use the describe() method of the GroupBy object to calculate key statistics for each group at once. This is more convenient than specifying each statistic individually using agg().

The following example only outputs results for the column c_2.

print(df.groupby(['c_0', 'c_1']).describe()['c_2'])
#          count  mean        std  min   25%   50%   75%   max
# c_0 c_1                                                     
# A   X      1.0   0.0        NaN  0.0   0.0   0.0   0.0   0.0
#     Y      1.0   1.0        NaN  1.0   1.0   1.0   1.0   1.0
# B   X      2.0  10.0   8.485281  4.0   7.0  10.0  13.0  16.0
#     Y      2.0  17.0  11.313708  9.0  13.0  17.0  21.0  25.0

See the following article for the meaning of each statistic.

Apply any function to each group: apply()

Use the apply() method of the GroupBy object to apply any function to each group.

Each group is passed as a DataFrame to the function (callable object) specified in the first argument. Note that the DataFrame passed includes the columns specified as keys.

print(df.groupby(['c_0', 'c_1']).apply(lambda x: type(x)))
# c_0  c_1
# A    X      <class 'pandas.core.frame.DataFrame'>
#      Y      <class 'pandas.core.frame.DataFrame'>
# B    X      <class 'pandas.core.frame.DataFrame'>
#      Y      <class 'pandas.core.frame.DataFrame'>
# dtype: object

dfs = []
df.groupby(['c_0', 'c_1']).apply(lambda x: dfs.append(x))
print(dfs[0])
#     c_0 c_1  c_2  c_3
# r_0   A   X    0  125

print(dfs[1])
#     c_0 c_1  c_2  c_3
# r_1   A   Y    1   64

print(dfs[2])
#     c_0 c_1  c_2  c_3
# r_2   B   X    4   27
# r_4   B   X   16    1

print(dfs[3])
#     c_0 c_1  c_2  c_3
# r_3   B   Y    9   16
# r_5   B   Y   25    0

The result's form changes depending on the type of object returned by the function specified in apply() and the arguments of groupby().

Some patterns are shown below. Due to its complexity, it is not necessary to remember all the detailed patterns. Instead, simply be aware that various patterns exist. When you actually use it, it's a good idea to experiment with your expected inputs to see the results.

Specifying a function that returns a scalar value results in a Series. However, with as_index=False, it returns a DataFrame.

print(df.groupby(['c_0', 'c_1']).apply(lambda x: x['c_2'].max()))
# c_0  c_1
# A    X       0
#      Y       1
# B    X      16
#      Y      25
# dtype: int64

print(df.groupby(['c_0', 'c_1'], as_index=False).apply(lambda x: x['c_2'].max()))
#   c_0 c_1  None
# 0   A   X     0
# 1   A   Y     1
# 2   B   X    16
# 3   B   Y    25

When you specify a function that returns a Series, a DataFrame is returned if the index of the Series matches the original column name, or a Series is returned if it differs.

print(dfs[0][['c_2', 'c_3']].max())
# c_2      0
# c_3    125
# dtype: int64

print(dfs[0][['c_2', 'c_3']].max(axis=1))
# r_0    125
# dtype: int64

print(df.groupby(['c_0', 'c_1']).apply(lambda x: x[['c_2', 'c_3']].max()))
#          c_2  c_3
# c_0 c_1          
# A   X      0  125
#     Y      1   64
# B   X     16   27
#     Y     25   16

print(df.groupby(['c_0', 'c_1']).apply(lambda x: x[['c_2', 'c_3']].max(axis=1)))
# c_0  c_1     
# A    X    r_0    125
#      Y    r_1     64
# B    X    r_2     27
#           r_4     16
#      Y    r_3     16
#           r_5     25
# dtype: int64

Furthermore, when a Series is returned, the index changes depending on the as_index and group_keys arguments of groupby().

print(
    df.groupby(['c_0', 'c_1'], as_index=False).apply(
        lambda x: x[['c_2', 'c_3']].max(axis=1)
    )
)
# 0  r_0    125
# 1  r_1     64
# 2  r_2     27
#    r_4     16
# 3  r_3     16
#    r_5     25
# dtype: int64

print(
    df.groupby(['c_0', 'c_1'], group_keys=False).apply(
        lambda x: x[['c_2', 'c_3']].max(axis=1)
    )
)
# r_0    125
# r_1     64
# r_2     27
# r_4     16
# r_3     16
# r_5     25
# dtype: int64

Specifying a function that returns a DataFrame results in a DataFrame. The index changes depending on the as_index and group_keys arguments of groupby().

print(df.groupby(['c_0', 'c_1']).apply(lambda x: x[['c_2', 'c_3']] * 10))
#              c_2   c_3
# c_0 c_1               
# A   X   r_0    0  1250
#     Y   r_1   10   640
# B   X   r_2   40   270
#         r_4  160    10
#     Y   r_3   90   160
#         r_5  250     0

print(
    df.groupby(['c_0', 'c_1'], as_index=False).apply(lambda x: x[['c_2', 'c_3']] * 10)
)
#        c_2   c_3
# 0 r_0    0  1250
# 1 r_1   10   640
# 2 r_2   40   270
#   r_4  160    10
# 3 r_3   90   160
#   r_5  250     0

print(
    df.groupby(['c_0', 'c_1'], group_keys=False).apply(lambda x: x[['c_2', 'c_3']] * 10)
)
#      c_2   c_3
# r_0    0  1250
# r_1   10   640
# r_2   40   270
# r_3   90   160
# r_4  160    10
# r_5  250     0

Related Categories

Related Articles