pandas: Aggregate data with agg(), aggregate()
In pandas, you can apply multiple operations to rows or columns in a DataFrame
and aggregate them using the agg()
and aggregate()
methods. agg()
is an alias for aggregate()
, and both return the same result. These methods are also available on Series
.
To obtain the summary statistics (such as mean or standard deviation) for each column at once, you can use the describe()
method.
agg()
is also available as a method for objects returned by methods like groupby()
, resample()
, and rolling()
. The basic usage and underlying concepts are consistent with those explained in this article. For specific examples using groupby()
, refer to the following article.
The pandas and NumPy versions used in this article are as follows. Note that functionality may vary between versions.
import pandas as pd
import numpy as np
print(pd.__version__)
# 2.1.2
print(np.__version__)
# 1.26.1
agg()
is an alias for aggregate()
As noted in the introduction, agg()
is an alias for aggregate()
. They are interchangeable.
print(pd.DataFrame.agg is pd.DataFrame.aggregate)
# True
print(pd.Series.agg is pd.Series.aggregate)
# True
The following sample code uses agg()
.
Basic usage of agg()
In the case of DataFrame
Consider the following DataFrame
.
df = pd.DataFrame({'A': [0, 1, 2], 'B': [3, 4, 5]})
print(df)
# A B
# 0 0 3
# 1 1 4
# 2 2 5
The first argument of agg()
can be a string of function/method names, a callable object, or a list of these. The following example uses a string. Further details are discussed later.
Specifying a list returns a DataFrame
, whereas a single string or a callable object returns a Series
. Note that a DataFrame
is returned even if the list contains only one element.
print(df.agg(['sum', 'mean', 'min', 'max']))
# A B
# sum 3.0 12.0
# mean 1.0 4.0
# min 0.0 3.0
# max 2.0 5.0
print(type(df.agg(['sum', 'mean', 'min', 'max'])))
# <class 'pandas.core.frame.DataFrame'>
print(df.agg(['sum']))
# A B
# sum 3 12
print(type(df.agg(['sum'])))
# <class 'pandas.core.frame.DataFrame'>
print(df.agg('sum'))
# A 3
# B 12
# dtype: int64
print(type(df.agg('sum')))
# <class 'pandas.core.series.Series'>
Specifying a dictionary (dict
) with column names as keys and operations to apply as values allows for different operations on each column.
print(df.agg({'A': ['sum', 'min', 'max'], 'B': ['mean', 'min', 'max']}))
# A B
# sum 3.0 NaN
# min 0.0 3.0
# max 2.0 5.0
# mean NaN 4.0
Specifying the operation individually, not as a list, returns a Series
. Specifying a list for any column returns a DataFrame
.
print(df.agg({'A': 'sum', 'B': 'mean'}))
# A 3.0
# B 4.0
# dtype: float64
print(df.agg({'A': ['sum'], 'B': 'mean'}))
# A B
# sum 3.0 NaN
# mean NaN 4.0
print(df.agg({'A': ['min', 'max'], 'B': 'mean'}))
# A B
# min 0.0 NaN
# max 2.0 NaN
# mean NaN 4.0
By default, the operation is applied to each column (column-wise), but by setting the axis
argument to 1
or 'columns'
, it is applied to each row (row-wise).
print(df.agg(['sum', 'mean', 'min', 'max'], axis=1))
# sum mean min max
# 0 3.0 1.5 0.0 3.0
# 1 5.0 2.5 1.0 4.0
# 2 7.0 3.5 2.0 5.0
In the case of Series
Consider the following Series
.
s = pd.Series([0, 1, 2])
print(s)
# 0 0
# 1 1
# 2 2
# dtype: int64
Specifying a list as the first argument of agg()
returns a Series
, whereas specifying a single function or method returns a scalar value. Note that a Series
is returned even if the list contains only one element.
print(s.agg(['sum', 'mean', 'min', 'max']))
# sum 3.0
# mean 1.0
# min 0.0
# max 2.0
# dtype: float64
print(type(s.agg(['sum', 'mean', 'min', 'max'])))
# <class 'pandas.core.series.Series'>
print(s.agg(['sum']))
# sum 3
# dtype: int64
print(type(s.agg(['sum'])))
# <class 'pandas.core.series.Series'>
print(s.agg('sum'))
# 3
print(type(s.agg('sum')))
# <class 'numpy.int64'>
When specified with a list, the labels of the results become the names of the operations, but when specified with a dictionary, the keys become the label names.
print(s.agg({'Total': 'sum', 'Average': 'mean', 'Min': 'min', 'Max': 'max'}))
# Total 3.0
# Average 1.0
# Min 0.0
# Max 2.0
# dtype: float64
You cannot specify a list as the value in the dictionary.
# print(s.agg({'NewLabel_1': ['sum', 'max'], 'NewLabel_2': ['mean', 'min']}))
# SpecificationError: nested renamer is not supported
Operations that can be specified in agg()
Strings of function/method names
The strings specified as the first argument in agg()
are checked by a function called _apply_str()
. Previously, it was named _try_aggregate_string_function()
.
def _apply_str(self, obj, func: str, *args, **kwargs):
"""
if arg is a string, then try to operate on it:
- try to find a function (or attribute) on obj
- try to find a numpy function
- raise
"""
A string that matches either a method/attribute name of obj
(in this case, Series
or DataFrame
) or a function name from NumPy is considered valid.
For example, 'count'
is a method in Series
and DataFrame
but not a function in NumPy, and 'amax'
is a function in NumPy but not a method in Series
and DataFrame
. Both can be specified as strings in agg()
.
df = pd.DataFrame({'A': [0, 1, 2], 'B': [3, 4, 5]})
print(df)
# A B
# 0 0 3
# 1 1 4
# 2 2 5
print(df.agg(['count', 'amax']))
# A B
# count 3 3
# amax 2 5
print(df['A'].count())
# 3
# print(np.count(df['A']))
# AttributeError: module 'numpy' has no attribute 'count'
print(np.amax(df['A']))
# 2
# print(df['A'].amax())
# AttributeError: 'Series' object has no attribute 'amax'
Strings that don't match either will result in an error.
# print(df.agg(['xxx']))
# AttributeError: 'xxx' is not a valid function for 'Series' object
# print(df.agg('xxx'))
# AttributeError: 'xxx' is not a valid function for 'DataFrame' object
As you can see from the error message above, using a list invokes Series
methods/attributes, while a single string invokes DataFrame
methods/attributes.
As seen in the source code of _apply_str()
, NumPy functions are valid only if obj
has the __array__
attribute. DataFrame
and Series
have the __array__
attribute, but objects returned by methods like groupby()
, resample()
, and rolling()
do not.
print(hasattr(pd.DataFrame, '__array__'))
# True
print(hasattr(pd.core.groupby.GroupBy, '__array__'))
# False
Therefore, in the agg()
method of objects returned by methods like groupby()
, strings denoting NumPy function names are not recognized. However, it is possible to specify them as callable objects like np.xxx
.
Note that this is the specification as of pandas version 2.1.2. Be aware that this may change in different versions.
Callable objects
You can also specify callable objects such as functions defined with def
or lambda expressions in the first argument of agg()
.
df = pd.DataFrame({'A': [0, 1, 2], 'B': [3, 4, 5]})
print(df)
# A B
# 0 0 3
# 1 1 4
# 2 2 5
def my_func(x):
return x.min() + x.max()
print(df.agg([my_func, lambda x: x.min() - x.max()]))
# A B
# my_func 2 8
# <lambda> -2 -2
Specify arguments for functions or methods
Keyword arguments specified in agg()
are passed to the function or method to be applied.
df = pd.DataFrame({'A': [0, 1, 2], 'B': [3, 4, 5]})
print(df)
# A B
# 0 0 3
# 1 1 4
# 2 2 5
print(df.agg('std'))
# A 1.0
# B 1.0
# dtype: float64
print(df.agg('std', ddof=0))
# A 0.816497
# B 0.816497
# dtype: float64
print(df.agg(['std'], ddof=0))
# A B
# std 0.816497 0.816497
When multiple functions or methods are specified, all of them receive the keyword arguments. If they cannot accept them, an error will occur.
# print(df.agg(['max', 'std'], ddof=0))
# TypeError: max() got an unexpected keyword argument 'ddof'
To specify arguments individually, use lambda expressions.
print(df.agg(['max', lambda x: x.std(ddof=0)]))
# A B
# max 2.000000 5.000000
# <lambda> 0.816497 0.816497
Handle unsupported data types (dtype
)
Consider the following DataFrame
including columns with string elements.
df_str = pd.DataFrame({'A': [0, 1, 2], 'B': [3, 4, 5], 'C': ['X', 'Y', 'Z']})
print(df_str)
# A B C
# 0 0 3 X
# 1 1 4 Y
# 2 2 5 Z
For example, using mean()
on a Series
with string elements results in an error, so specifying it in agg()
also results in an error.
# df_str['C'].mean()
# TypeError: Could not convert XYZ to numeric
# print(df_str.agg(['mean']))
# TypeError: Could not convert string 'XYZ' to numeric
Note that this is the specification as of pandas version 2.1.2. In 1.0.4, it did not result in an error but returned NaN
instead.
mean()
in DataFrame
has the numeric_only
argument, but mean()
in Series
does not. Therefore, numeric_only
cannot be used when operations are specified with a list, which means they are processed as a Series
.
print(df_str.mean(numeric_only=True))
# A 1.0
# B 4.0
# dtype: float64
print(df_str.agg('mean', numeric_only=True))
# A 1.0
# B 4.0
# dtype: float64
# df_str['C'].mean(numeric_only=True)
# TypeError: Series.mean does not allow numeric_only=True with non-numeric dtypes.
# print(df_str.agg(['mean'], numeric_only=True))
# TypeError: Series.mean does not allow numeric_only=True with non-numeric dtypes.
To apply agg()
to only numerical columns, first use select_dtypes()
.
print(df_str.select_dtypes(include='number').agg(['sum', 'mean']))
# A B
# sum 3.0 12.0
# mean 1.0 4.0