pandas: Aggregate data with agg(), aggregate()

Posted: 2024-01-18 | Tags: Python, pandas

In pandas, you can apply multiple operations to rows or columns in a DataFrame and aggregate them using the agg() and aggregate() methods. agg() is an alias for aggregate(), and both return the same result. These methods are also available on Series.

Contents

agg() is an alias for aggregate()
Basic usage of agg()
- In the case of DataFrame
- In the case of Series
Operations that can be specified in agg()
Handle unsupported data types (dtype)

To obtain the summary statistics (such as mean or standard deviation) for each column at once, you can use the describe() method.

pandas: Get summary statistics for each column with describe()

agg() is also available as a method for objects returned by methods like groupby(), resample(), and rolling(). The basic usage and underlying concepts are consistent with those explained in this article. For specific examples using groupby(), refer to the following article.

pandas: Grouping data with groupby()

The pandas and NumPy versions used in this article are as follows. Note that functionality may vary between versions.

import pandas as pd
import numpy as np

print(pd.__version__)
# 2.1.2

print(np.__version__)
# 1.26.1

source: pandas_agg.py

`agg()` is an alias for `aggregate()`

As noted in the introduction, agg() is an alias for aggregate(). They are interchangeable.

print(pd.DataFrame.agg is pd.DataFrame.aggregate)
# True

print(pd.Series.agg is pd.Series.aggregate)
# True

source: pandas_agg.py

The following sample code uses agg().

Basic usage of `agg()`

In the case of `DataFrame`

Consider the following DataFrame.

df = pd.DataFrame({'A': [0, 1, 2], 'B': [3, 4, 5]})
print(df)
#    A  B
# 0  0  3
# 1  1  4
# 2  2  5

source: pandas_agg.py

The first argument of agg() can be a string of function/method names, a callable object, or a list of these. The following example uses a string. Further details are discussed later.

Specifying a list returns a DataFrame, whereas a single string or a callable object returns a Series. Note that a DataFrame is returned even if the list contains only one element.

print(df.agg(['sum', 'mean', 'min', 'max']))
#         A     B
# sum   3.0  12.0
# mean  1.0   4.0
# min   0.0   3.0
# max   2.0   5.0

print(type(df.agg(['sum', 'mean', 'min', 'max'])))
# <class 'pandas.core.frame.DataFrame'>

print(df.agg(['sum']))
#      A   B
# sum  3  12

print(type(df.agg(['sum'])))
# <class 'pandas.core.frame.DataFrame'>

print(df.agg('sum'))
# A     3
# B    12
# dtype: int64

print(type(df.agg('sum')))
# <class 'pandas.core.series.Series'>

source: pandas_agg.py

Specifying a dictionary (dict) with column names as keys and operations to apply as values allows for different operations on each column.

print(df.agg({'A': ['sum', 'min', 'max'], 'B': ['mean', 'min', 'max']}))
#         A    B
# sum   3.0  NaN
# min   0.0  3.0
# max   2.0  5.0
# mean  NaN  4.0

source: pandas_agg.py

Specifying the operation individually, not as a list, returns a Series. Specifying a list for any column returns a DataFrame.

print(df.agg({'A': 'sum', 'B': 'mean'}))
# A    3.0
# B    4.0
# dtype: float64

print(df.agg({'A': ['sum'], 'B': 'mean'}))
#         A    B
# sum   3.0  NaN
# mean  NaN  4.0

print(df.agg({'A': ['min', 'max'], 'B': 'mean'}))
#         A    B
# min   0.0  NaN
# max   2.0  NaN
# mean  NaN  4.0

source: pandas_agg.py

By default, the operation is applied to each column (column-wise), but by setting the axis argument to 1 or 'columns', it is applied to each row (row-wise).

print(df.agg(['sum', 'mean', 'min', 'max'], axis=1))
#    sum  mean  min  max
# 0  3.0   1.5  0.0  3.0
# 1  5.0   2.5  1.0  4.0
# 2  7.0   3.5  2.0  5.0

source: pandas_agg.py

In the case of `Series`

Consider the following Series.

s = pd.Series([0, 1, 2])
print(s)
# 0    0
# 1    1
# 2    2
# dtype: int64

source: pandas_agg.py

Specifying a list as the first argument of agg() returns a Series, whereas specifying a single function or method returns a scalar value. Note that a Series is returned even if the list contains only one element.

print(s.agg(['sum', 'mean', 'min', 'max']))
# sum     3.0
# mean    1.0
# min     0.0
# max     2.0
# dtype: float64

print(type(s.agg(['sum', 'mean', 'min', 'max'])))
# <class 'pandas.core.series.Series'>

print(s.agg(['sum']))
# sum    3
# dtype: int64

print(type(s.agg(['sum'])))
# <class 'pandas.core.series.Series'>

print(s.agg('sum'))
# 3

print(type(s.agg('sum')))
# <class 'numpy.int64'>

source: pandas_agg.py

When specified with a list, the labels of the results become the names of the operations, but when specified with a dictionary, the keys become the label names.

print(s.agg({'Total': 'sum', 'Average': 'mean', 'Min': 'min', 'Max': 'max'}))
# Total      3.0
# Average    1.0
# Min        0.0
# Max        2.0
# dtype: float64

source: pandas_agg.py

You cannot specify a list as the value in the dictionary.

# print(s.agg({'NewLabel_1': ['sum', 'max'], 'NewLabel_2': ['mean', 'min']}))
# SpecificationError: nested renamer is not supported

source: pandas_agg.py

Operations that can be specified in `agg()`

Strings of function/method names

The strings specified as the first argument in agg() are checked by a function called _apply_str(). Previously, it was named _try_aggregate_string_function().

    def _apply_str(self, obj, func: str, *args, **kwargs):
        """
        if arg is a string, then try to operate on it:
        - try to find a function (or attribute) on obj
        - try to find a numpy function
        - raise
        """

source: apply.py

A string that matches either a method/attribute name of obj (in this case, Series or DataFrame) or a function name from NumPy is considered valid.

For example, 'count' is a method in Series and DataFrame but not a function in NumPy, and 'amax' is a function in NumPy but not a method in Series and DataFrame. Both can be specified as strings in agg().

df = pd.DataFrame({'A': [0, 1, 2], 'B': [3, 4, 5]})
print(df)
#    A  B
# 0  0  3
# 1  1  4
# 2  2  5

print(df.agg(['count', 'amax']))
#        A  B
# count  3  3
# amax   2  5

print(df['A'].count())
# 3

# print(np.count(df['A']))
# AttributeError: module 'numpy' has no attribute 'count'

print(np.amax(df['A']))
# 2

# print(df['A'].amax())
# AttributeError: 'Series' object has no attribute 'amax'

source: pandas_agg.py

Strings that don't match either will result in an error.

# print(df.agg(['xxx']))
# AttributeError: 'xxx' is not a valid function for 'Series' object

# print(df.agg('xxx'))
# AttributeError: 'xxx' is not a valid function for 'DataFrame' object

source: pandas_agg.py

As you can see from the error message above, using a list invokes Series methods/attributes, while a single string invokes DataFrame methods/attributes.

As seen in the source code of _apply_str(), NumPy functions are valid only if obj has the __array__ attribute. DataFrame and Series have the __array__ attribute, but objects returned by methods like groupby(), resample(), and rolling() do not.

print(hasattr(pd.DataFrame, '__array__'))
# True

print(hasattr(pd.core.groupby.GroupBy, '__array__'))
# False

source: pandas_agg.py

Therefore, in the agg() method of objects returned by methods like groupby(), strings denoting NumPy function names are not recognized. However, it is possible to specify them as callable objects like np.xxx.

Note that this is the specification as of pandas version 2.1.2. Be aware that this may change in different versions.

Callable objects

You can also specify callable objects such as functions defined with def or lambda expressions in the first argument of agg().

df = pd.DataFrame({'A': [0, 1, 2], 'B': [3, 4, 5]})
print(df)
#    A  B
# 0  0  3
# 1  1  4
# 2  2  5

def my_func(x):
    return x.min() + x.max()

print(df.agg([my_func, lambda x: x.min() - x.max()]))
#           A  B
# my_func   2  8
# <lambda> -2 -2

source: pandas_agg.py

Specify arguments for functions or methods

Keyword arguments specified in agg() are passed to the function or method to be applied.

df = pd.DataFrame({'A': [0, 1, 2], 'B': [3, 4, 5]})
print(df)
#    A  B
# 0  0  3
# 1  1  4
# 2  2  5

print(df.agg('std'))
# A    1.0
# B    1.0
# dtype: float64

print(df.agg('std', ddof=0))
# A    0.816497
# B    0.816497
# dtype: float64

print(df.agg(['std'], ddof=0))
#             A         B
# std  0.816497  0.816497

source: pandas_agg.py

When multiple functions or methods are specified, all of them receive the keyword arguments. If they cannot accept them, an error will occur.

# print(df.agg(['max', 'std'], ddof=0))
# TypeError: max() got an unexpected keyword argument 'ddof'

source: pandas_agg.py

To specify arguments individually, use lambda expressions.

print(df.agg(['max', lambda x: x.std(ddof=0)]))
#                  A         B
# max       2.000000  5.000000
# <lambda>  0.816497  0.816497

source: pandas_agg.py

Handle unsupported data types (`dtype`)

Consider the following DataFrame including columns with string elements.

df_str = pd.DataFrame({'A': [0, 1, 2], 'B': [3, 4, 5], 'C': ['X', 'Y', 'Z']})
print(df_str)
#    A  B  C
# 0  0  3  X
# 1  1  4  Y
# 2  2  5  Z

source: pandas_agg.py

For example, using mean() on a Series with string elements results in an error, so specifying it in agg() also results in an error.

# df_str['C'].mean()
# TypeError: Could not convert XYZ to numeric

# print(df_str.agg(['mean']))
# TypeError: Could not convert string 'XYZ' to numeric

source: pandas_agg.py

Note that this is the specification as of pandas version 2.1.2. In 1.0.4, it did not result in an error but returned NaN instead.

mean() in DataFrame has the numeric_only argument, but mean() in Series does not. Therefore, numeric_only cannot be used when operations are specified with a list, which means they are processed as a Series.

pandas.DataFrame.mean — pandas 2.1.3 documentation

print(df_str.mean(numeric_only=True))
# A    1.0
# B    4.0
# dtype: float64

print(df_str.agg('mean', numeric_only=True))
# A    1.0
# B    4.0
# dtype: float64

# df_str['C'].mean(numeric_only=True)
# TypeError: Series.mean does not allow numeric_only=True with non-numeric dtypes.

# print(df_str.agg(['mean'], numeric_only=True))
# TypeError: Series.mean does not allow numeric_only=True with non-numeric dtypes.

source: pandas_agg.py

To apply agg() to only numerical columns, first use select_dtypes().

pandas: Select columns by dtype with select_dtypes()

print(df_str.select_dtypes(include='number').agg(['sum', 'mean']))
#         A     B
# sum   3.0  12.0
# mean  1.0   4.0

source: pandas_agg.py

Related Categories

Related Articles