pandas: Count values in DataFrame/Series with conditions
This article explains how to count values in a pandas.DataFrame
or pandas.Series
that meet specific conditions by column, by row, and in total.
The count()
method of DataFrame
and Series
, which will be explained later, counts the number of non-NaN
values.
For methods on extracting rows that meet conditions and counting the number of unique values in each column, refer to the following articles.
The pandas version used in this article is as follows. Note that functionality may vary between versions. The following DataFrame
is used as an example.
import pandas as pd
print(pd.__version__)
# 2.1.4
df = pd.read_csv('data/src/sample_pandas_normal.csv')
print(df)
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
# 4 Ellen 24 CA 88
# 5 Frank 30 NY 57
How to count values with conditions
The process for counting values that meet specific conditions is as follows:
- Evaluate each value to produce a Boolean
DataFrame
orSeries
, using methods such as comparison operators or string accessors. - Use the
sum()
method to countTrue
valuesDataFrame
- Count by column:
sum()
- Count by row:
sum(axis=1)
- Count in total:
sum().sum()
orvalues.sum()
- Count by column:
Series
- Count in total:
sum()
- Count in total:
DataFrame
When you apply comparison operators to a DataFrame
or Series
, it evaluates each value, resulting in a Boolean DataFrame
or Series
of the same size.
The parentheses ()
on the right side are optional.
df_bool = (df == 'CA')
print(df_bool)
# name age state point
# 0 False False False False
# 1 False False True False
# 2 False False True False
# 3 False False False False
# 4 False False True False
# 5 False False False False
Since True
is treated as 1
and False
as 0
in Python, you can count values that meet the conditions using the sum()
method. By default, it counts by column, and setting axis=1
counts by row.
print(df_bool.sum())
# name 0
# age 0
# state 3
# point 0
# dtype: int64
print(df_bool.sum(axis=1))
# 0 0
# 1 1
# 2 1
# 3 0
# 4 1
# 5 0
# dtype: int64
The sum()
method of DataFrame
returns a Series
. Calling sum()
on this Series
yields the total count.
print(df_bool.sum().sum())
# 3
A DataFrame
can be converted to a NumPy array (ndarray
) using the values
attribute.
The sum()
method of ndarray
calculates the sum across the entire array by default, so calling sum()
on the values
attribute (ndarray
) retrieves the total number of values that meet the condition.
print(df_bool.values)
# [[False False False False]
# [False False True False]
# [False False True False]
# [False False False False]
# [False False True False]
# [False False False False]]
print(type(df_bool.values))
# <class 'numpy.ndarray'>
print(df_bool.values.sum())
# 3
You can write as follows:
print((df == 'CA').sum())
# name 0
# age 0
# state 3
# point 0
# dtype: int64
print((df == 'CA').sum(axis=1))
# 0 0
# 1 1
# 2 1
# 3 0
# 4 1
# 5 0
# dtype: int64
print((df == 'CA').sum().sum())
# 3
print((df == 'CA').values.sum())
# 3
Note that performing numerical comparison operations on a DataFrame
with mixed numeric and string columns will result in an error. Details are discussed later.
Series
Consider the following Series
.
s = df['age']
print(s)
# 0 24
# 1 42
# 2 18
# 3 68
# 4 24
# 5 30
# Name: age, dtype: int64
The procedure is the same as with DataFrame
. Since Series
is one-dimensional, the sum()
method returns the total count.
s_bool = (s < 25)
print(s_bool)
# 0 True
# 1 False
# 2 True
# 3 False
# 4 True
# 5 False
# Name: age, dtype: bool
print(s_bool.sum())
# 3
print((s < 25).sum())
# 3
To count values that meet a condition in any row or column of a DataFrame
, specify the row or column using []
, loc[]
, iloc[]
, and perform the same process.
- pandas: Select rows/columns by index (numbers and names)
- pandas: Get/Set values with loc, iloc, at, iat
Multiple conditions (AND, OR, NOT)
To combine multiple conditions, enclose each in parentheses ()
and connect them using the &
operator for AND or the |
operator for OR. The ~
operator (NOT
) can also be used.
print((df == 'CA') | (df == 70))
# name age state point
# 0 False False False False
# 1 False False True False
# 2 False False True True
# 3 False False False True
# 4 False False True False
# 5 False False False False
print(~(df == 'CA'))
# name age state point
# 0 True True True True
# 1 True True False True
# 2 True True False True
# 3 True True True True
# 4 True True False True
# 5 True True True True
print((df['state'] == 'CA') & (df['age'] < 30))
# 0 False
# 1 False
# 2 True
# 3 False
# 4 True
# 5 False
# dtype: bool
Note that using and
and or
instead of &
and |
, or omitting parentheses, will result in an error.
Once a Boolean DataFrame
or Series
is obtained, you can count values that meet the conditions using the sum()
method as mentioned above.
Specify conditions for numerical values
As seen in previous examples, comparison operators, such as <
, <=
, >
, >=
, ==
, !=
, can be used for numerical values.
However, be aware that applying these operators to a DataFrame
that includes string columns, when comparing against numeric values, will result in an error. To specifically work with numeric columns, use the select_dtypes()
method.
# print(df < 65)
# TypeError: '<' not supported between instances of 'str' and 'int'
df_num = df.select_dtypes('number')
print(df_num)
# age point
# 0 24 64
# 1 42 92
# 2 18 70
# 3 68 70
# 4 24 88
# 5 30 57
print((df_num < 65).sum())
# age 5
# point 2
# dtype: int64
print(((df_num > 35) & (df_num < 65)).sum())
# age 1
# point 2
# dtype: int64
Specify conditions for strings
For string operations in Series
, in addition to ==
and !=
, there are additional methods available through the string accessor (str
):
str.contains()
: Checks if each string contains a specific substringstr.endswith()
: Checks if each string ends with a specific substringstr.startswith()
: Checks if each string starts with a specific substring.str.match()
: Checks each string against a regular expression (regex) pattern
Note that the str
accessor is specific to Series
and unavailable for DataFrame
.
df_str = df[['name', 'state']]
print(df_str)
# name state
# 0 Alice NY
# 1 Bob CA
# 2 Charlie CA
# 3 Dave TX
# 4 Ellen CA
# 5 Frank NY
print((df_str == 'NY').sum())
# name 0
# state 2
# dtype: int64
print(df_str['name'].str.endswith('e'))
# 0 True
# 1 False
# 2 True
# 3 True
# 4 False
# 5 False
# Name: name, dtype: bool
print(df_str['name'].str.endswith('e').sum())
# 3
See the following article on how to extract rows using the str
accessor.
Count NaN
and non-NaN
values
As an example, consider the data on Titanic survivors.
df_titanic = pd.read_csv('data/src/titanic_train.csv')
print(df_titanic.head())
# PassengerId Survived Pclass \
# 0 1 0 3
# 1 2 1 1
# 2 3 1 3
# 3 4 1 1
# 4 5 0 3
#
# Name Sex Age SibSp \
# 0 Braund, Mr. Owen Harris male 22.0 1
# 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
# 2 Heikkinen, Miss. Laina female 26.0 0
# 3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
# 4 Allen, Mr. William Henry male 35.0 0
#
# Parch Ticket Fare Cabin Embarked
# 0 0 A/5 21171 7.2500 NaN S
# 1 0 PC 17599 71.2833 C85 C
# 2 0 STON/O2. 3101282 7.9250 NaN S
# 3 0 113803 53.1000 C123 S
# 4 0 373450 8.0500 NaN S
See the following articles for information on removing, replacing, and detecting NaN
.
- pandas: Remove NaN (missing values) with dropna()
- pandas: Replace NaN (missing values) with fillna()
- pandas: Detect and count NaN (missing values) with isnull(), isna()
Count NaN
values
To count NaN
values, use the isnull()
method, which checks if each value is NaN
. You can use sum()
as in the previous examples.
print(df_titanic.isnull().head())
# PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket \
# 0 False False False False False False False False False
# 1 False False False False False False False False False
# 2 False False False False False False False False False
# 3 False False False False False False False False False
# 4 False False False False False False False False False
#
# Fare Cabin Embarked
# 0 False True False
# 1 False False False
# 2 False True False
# 3 False False False
# 4 False True False
print(df_titanic.isnull().sum())
# PassengerId 0
# Survived 0
# Pclass 0
# Name 0
# Sex 0
# Age 177
# SibSp 0
# Parch 0
# Ticket 0
# Fare 0
# Cabin 687
# Embarked 2
# dtype: int64
print(df_titanic.isnull().sum(axis=1).head())
# 0 1
# 1 0
# 2 1
# 3 0
# 4 1
# dtype: int64
print(df_titanic.isnull().values.sum())
# 866
Count non-NaN
values: count()
To count non-NaN
values, use the count()
method. Similar to sum()
, it counts column-wise by default, and setting axis=1
counts row-wise.
- pandas.DataFrame.count — pandas 2.1.4 documentation
- pandas.Series.count — pandas 2.1.4 documentation
print(df_titanic.count())
# PassengerId 891
# Survived 891
# Pclass 891
# Name 891
# Sex 891
# Age 714
# SibSp 891
# Parch 891
# Ticket 891
# Fare 891
# Cabin 204
# Embarked 889
# dtype: int64
print(df_titanic.count(axis=1).head())
# 0 11
# 1 12
# 2 11
# 3 12
# 4 11
# dtype: int64
print(df_titanic.count().sum())
# 9826
print(df_titanic['Age'].count())
# 714
To simply check the count of non-NaN
values, it can be displayed using the info()
method.
df_titanic.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 891 entries, 0 to 890
# Data columns (total 12 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 PassengerId 891 non-null int64
# 1 Survived 891 non-null int64
# 2 Pclass 891 non-null int64
# 3 Name 891 non-null object
# 4 Sex 891 non-null object
# 5 Age 714 non-null float64
# 6 SibSp 891 non-null int64
# 7 Parch 891 non-null int64
# 8 Ticket 891 non-null object
# 9 Fare 891 non-null float64
# 10 Cabin 204 non-null object
# 11 Embarked 889 non-null object
# dtypes: float64(2), int64(5), object(5)
# memory usage: 83.7+ KB