pandas: Get summary statistics for each column with describe()
In pandas, the describe() method on DataFrame and Series allows you to get summary statistics such as the mean, standard deviation, maximum, minimum, and mode for each column.
- pandas.DataFrame.describe — pandas 2.1.4 documentation
- pandas.Series.describe — pandas 2.1.4 documentation
- Basic usage of describe()
- Specify target types: include,exclude
- Calculate top,freq, and more for numerical columns
- Calculate mean,std, and more for strings of numbers
- Details of items by describe()and corresponding methods
- Specify percentiles to calculate in describe():percentiles
- For datetime64[ns]type
The pandas version used in this article is as follows. Note that functionality may vary between versions. The following DataFrame is used as an example.
import pandas as pd
print(pd.__version__)
# 2.1.4
df = pd.DataFrame({'int': [1, 2, 3, 1],
                   'float': [0.1, 0.2, 0.3, float('nan')],
                   'str': ['X', 'Y', 'X', 'Z'],
                   'str_num': ['1', '2', '3', '1'],
                   'bool': [True, True, False, True]})
print(df)
#    int  float str str_num   bool
# 0    1    0.1   X       1   True
# 1    2    0.2   Y       2   True
# 2    3    0.3   X       3  False
# 3    1    NaN   Z       1   True
print(df.dtypes)
# int          int64
# float      float64
# str         object
# str_num     object
# bool          bool
# dtype: object
Basic usage of describe()
The describe() method on Series returns a Series.
print(df['float'].describe())
# count    3.00
# mean     0.20
# std      0.10
# min      0.10
# 25%      0.15
# 50%      0.20
# 75%      0.25
# max      0.30
# Name: float, dtype: float64
print(df['str'].describe())
# count     4
# unique    3
# top       X
# freq      2
# Name: str, dtype: object
print(type(df['float'].describe()))
# <class 'pandas.core.series.Series'>
The items calculated by describe() vary depending on the data type (dtype). Details about these items will be provided later.
- For numerical columns: count,mean,std,min,25%,50%,75%,max
- For object columns (such as strings): count,unique,top,freq
The describe() method on DataFrame returns a DataFrame.
print(df.describe())
#             int  float
# count  4.000000   3.00
# mean   1.750000   0.20
# std    0.957427   0.10
# min    1.000000   0.10
# 25%    1.000000   0.15
# 50%    1.500000   0.20
# 75%    2.250000   0.25
# max    3.000000   0.30
print(type(df.describe()))
# <class 'pandas.core.frame.DataFrame'>
Rows, columns, and values can be accessed using loc and at.
print(df.describe().loc['std'])
# int      0.957427
# float    0.100000
# Name: std, dtype: float64
print(df.describe().at['std', 'int'])
# 0.9574271077563381
In a DataFrame containing mixed column types, calculations are applied only to numerical columns by default. The include and exclude arguments, which will be explained next, allow you to specify the data types to include and exclude.
Specify target types: include, exclude
To target specific data types with the describe() method in a DataFrame, use the include and exclude arguments. Remember to specify the data type, not the column name.
Target non-numerical columns
Numerical types can be represented as 'number'. Setting exclude='number' will calculate results for non-numerical columns, such as those containing strings.
print(df.describe(exclude='number'))
#        str str_num  bool
# count    4       4     4
# unique   3       3     2
# top      X       1  True
# freq     2       2     3
In the case of a DataFrame without numerical columns, calculations are applied to the available data types by default.
print(df[['str', 'str_num', 'bool']])
#   str str_num   bool
# 0   X       1   True
# 1   Y       2   True
# 2   X       3  False
# 3   Z       1   True
print(df[['str', 'str_num', 'bool']].describe())
#        str str_num  bool
# count    4       4     4
# unique   3       3     2
# top      X       1  True
# freq     2       2     3
Target all types of columns
Setting include='all' includes all types of columns. Since the calculated items vary between numerical and other column types, the values for items not calculated will be missing values (NaN).
print(df.describe(include='all'))
#              int  float  str str_num  bool
# count   4.000000   3.00    4       4     4
# unique       NaN    NaN    3       3     2
# top          NaN    NaN    X       1  True
# freq         NaN    NaN    2       2     3
# mean    1.750000   0.20  NaN     NaN   NaN
# std     0.957427   0.10  NaN     NaN   NaN
# min     1.000000   0.10  NaN     NaN   NaN
# 25%     1.000000   0.15  NaN     NaN   NaN
# 50%     1.500000   0.20  NaN     NaN   NaN
# 75%     2.250000   0.25  NaN     NaN   NaN
# max     3.000000   0.30  NaN     NaN   NaN
Include and exclude specific types
Any data type can be specified for the include and exclude arguments. A DataFrame is returned as the result, even when it contains only a single column.
print(df.describe(include=int))
#             int
# count  4.000000
# mean   1.750000
# std    0.957427
# min    1.000000
# 25%    1.000000
# 50%    1.500000
# 75%    2.250000
# max    3.000000
print(type(df.describe(include=int)))
# <class 'pandas.core.frame.DataFrame'>
Multiple types can be specified in a list. The items to be calculated are automatically determined based on the selected types.
print(df.describe(include=[object, bool]))
#        str str_num  bool
# count    4       4     4
# unique   3       3     2
# top      X       1  True
# freq     2       2     3
print(df.describe(exclude=['f8', 'object']))
#              int  bool
# count   4.000000     4
# unique       NaN     2
# top          NaN  True
# freq         NaN     3
# mean    1.750000   NaN
# std     0.957427   NaN
# min     1.000000   NaN
# 25%     1.000000   NaN
# 50%     1.500000   NaN
# 75%     2.250000   NaN
# max     3.000000   NaN
You can specify data types using type objects (like int, bool, or np.float64), type names in strings (like 'int', 'bool', or 'float64'), or type codes in strings (for example, 'f8' for 'float64'). For more details, refer to the following article.
Calculate top, freq, and more for numerical columns
For example, in categorical data represented numerically, such as when males are coded as 0, females as 1, or geographical names are assigned numbers, you might prefer to check the mode and its frequency instead of the mean or standard deviation.
In such cases, you can use the astype() method to convert the type to object, either for all columns or for specific columns.
print(df.astype(object).describe())
#         int  float str str_num  bool
# count     4    3.0   4       4     4
# unique    3    3.0   3       3     2
# top       1    0.1   X       1  True
# freq      2    1.0   2       2     3
print(df.astype({'int': object}).describe(exclude='number'))
#         int str str_num  bool
# count     4   4       4     4
# unique    3   3       3     2
# top       1   X       1  True
# freq      2   2       2     3
Calculate mean, std, and more for strings of numbers
To calculate the mean and standard deviation for strings of numbers, first convert them to a numerical data type using the astype() method.
print(df.astype({'str_num': int, 'bool': int}).describe())
#             int  float   str_num  bool
# count  4.000000   3.00  4.000000  4.00
# mean   1.750000   0.20  1.750000  0.75
# std    0.957427   0.10  0.957427  0.50
# min    1.000000   0.10  1.000000  0.00
# 25%    1.000000   0.15  1.000000  0.75
# 50%    1.500000   0.20  1.500000  1.00
# 75%    2.250000   0.25  2.250000  1.00
# max    3.000000   0.30  3.000000  1.00
Details of items by describe() and corresponding methods
This section details the items calculated by describe() and the methods for obtaining them individually. While you can access the values for each item by selecting the rows in the DataFrame returned by describe(), using individual methods is more efficient when you do not require other items.
Note that describe() excludes missing values (NaN) in its calculations, but some individual methods provide an option to include or exclude NaN.
count: Number of non-NaN-values
count represents the number of non-NaN-values, which can be obtained by the count() method.
print(df.count())
# int        4
# float      3
# str        4
# str_num    4
# bool       4
# dtype: int64
unique: Number of unique values
unique represents the number of unique values, which can be obtained by the nunique() method.
print(df.nunique())
# int        3
# float      3
# str        3
# str_num    3
# bool       2
# dtype: int64
top: Mode
top represents the mode (the most frequent value), which can be obtained by the mode() method.
mode() on DataFrame returns a DataFrame. If there are multiple modes, any unfilled positions in the result are represented as missing values (NaN).
print(df.mode())
#    int  float  str str_num  bool
# 0  1.0    0.1    X       1  True
# 1  NaN    0.2  NaN     NaN   NaN
# 2  NaN    0.3  NaN     NaN   NaN
Selecting the first row with iloc[0] allows you to obtain at least one mode for each column.
print(df.mode().iloc[0])
# int         1.0
# float       0.1
# str           X
# str_num       1
# bool       True
# Name: 0, dtype: object
Note that top in describe() returns only one of the modes if there are multiple, and it does not necessarily match the first row of mode().
freq: Frequency of the mode
freq represents the frequency of the mode, which can be obtained by the value_counts() method on Series.
print(df['str'].value_counts())
# str
# X    2
# Y    1
# Z    1
# Name: count, dtype: int64
print(df['str'].value_counts().iat[0])
# 2
For more details on obtaining the frequency of the mode for each column, refer to the following article.
mean: Arithmetic mean
mean represents the arithmetic mean, which can be obtained by the mean() method.
Setting the numeric_only argument to True targets only numerical columns. The same applies to methods described in the following subsections.
print(df.mean(numeric_only=True))
# int      1.75
# float    0.20
# bool     0.75
# dtype: float64
In describe(), bool type columns are excluded, whereas in mean(), True is treated as 1 and False as 0. The same applies to methods described in the following subsections.
std: Sample standard deviation
std represents the sample standard deviation, which can be obtained by the std() method.
print(df.std(numeric_only=True))
# int      0.957427
# float    0.100000
# bool     0.500000
# dtype: float64
min: Minimum Value
min represents the minimum value, which can be obtained by the min() method.
print(df.min(numeric_only=True))
# int          1
# float      0.1
# bool     False
# dtype: object
max: Maximum Value
max represents the maximum value, which can be obtained by the max() method.
print(df.max(numeric_only=True))
# int         3
# float     0.3
# bool     True
# dtype: object
50%: Median (50th percentile)
50% represents the median (50th percentile), which can be obtained by the median() method.
pandas.DataFrame.median — pandas 2.1.4 documentation
print(df.median(numeric_only=True))
# int      1.5
# float    0.2
# bool     1.0
# dtype: float64
25%, 75%: 25th and 75th percentiles
25% and 75% represent the 25th and 75th percentiles, which can be obtained by the quantile() method.
As of pandas version 2.1.4, even when setting the numeric_only argument to True, an error occurs if there are bool columns.
# print(df.quantile(q=[0.25, 0.75], numeric_only=True))
# TypeError: numpy boolean subtract, the `-` operator, is not supported, use the bitwise_xor, the `^` operator, or the logical_xor function instead.
The following example uses a DataFrame that excludes the bool column. quantile() can calculate the minimum value, 25th percentile, median, 75th percentile, and maximum value all at once.
print(df.iloc[:, :-1])
#    int  float str str_num
# 0    1    0.1   X       1
# 1    2    0.2   Y       2
# 2    3    0.3   X       3
# 3    1    NaN   Z       1
print(df.iloc[:, :-1].quantile(q=[0, 0.25, 0.5, 0.75, 1], numeric_only=True))
#        int  float
# 0.00  1.00   0.10
# 0.25  1.00   0.15
# 0.50  1.50   0.20
# 0.75  2.25   0.25
# 1.00  3.00   0.30
You can specify the percentiles to calculate in describe() with the percentiles argument.
Specify percentiles to calculate in describe(): percentiles
By default, describe() calculates the minimum value (0th percentile), median (50th percentile), and maximum value (100th percentile), along with the 25th and 75th percentiles.
The minimum, median, and maximum values are always calculated, but others can be specified with the percentiles argument. Specify a list of values ranging from 0.0 to 1.0.
print(df.describe(percentiles=[0.2, 0.4, 0.6, 0.8]))
#             int  float
# count  4.000000   3.00
# mean   1.750000   0.20
# std    0.957427   0.10
# min    1.000000   0.10
# 20%    1.000000   0.14
# 40%    1.200000   0.18
# 50%    1.500000   0.20
# 60%    1.800000   0.22
# 80%    2.400000   0.26
# max    3.000000   0.30
For datetime64[ns] type
Pandas has datetime64[ns] for representing dates and times.
df['dt'] = pd.to_datetime(['2023-12-01', '2023-12-08', '2023-12-15', '2023-12-22'])
print(df)
#    int  float str str_num   bool         dt
# 0    1    0.1   X       1   True 2023-12-01
# 1    2    0.2   Y       2   True 2023-12-08
# 2    3    0.3   X       3  False 2023-12-15
# 3    1    NaN   Z       1   True 2023-12-22
print(df.dtypes)
# int                 int64
# float             float64
# str                object
# str_num            object
# bool                 bool
# dt         datetime64[ns]
# dtype: object
In previous versions, for columns of type datetime64[ns], additional calculations like first (the earliest datetime) and last (the latest datetime) were performed along with count, unique, top, and freq.
As of pandas version 2.1.4, datetime64[ns] columns are treated similarly to numeric columns. min and max correspond to first and last, respectively. To calculate count or unique for datetime64[ns] columns, you can convert their type using astype().
print(df.describe())
#             int  float                   dt
# count  4.000000   3.00                    4
# mean   1.750000   0.20  2023-12-11 12:00:00
# min    1.000000   0.10  2023-12-01 00:00:00
# 25%    1.000000   0.15  2023-12-06 06:00:00
# 50%    1.500000   0.20  2023-12-11 12:00:00
# 75%    2.250000   0.25  2023-12-16 18:00:00
# max    3.000000   0.30  2023-12-22 00:00:00
# std    0.957427   0.10                  NaN
print(df.astype(object).describe())
#         int  float str str_num  bool                   dt
# count     4    3.0   4       4     4                    4
# unique    3    3.0   3       3     2                    4
# top       1    0.1   X       1  True  2023-12-01 00:00:00
# freq      2    1.0   2       2     3                    1
For information on calculating percentiles for datetime, refer to the following article.