pandas: Count values in DataFrame/Series with conditions
This article explains how to count values in a pandas.DataFrame or pandas.Series that meet specific conditions by column, by row, and in total.
The count() method of DataFrame and Series, which will be explained later, counts the number of non-NaN values.
For methods on extracting rows that meet conditions and counting the number of unique values in each column, refer to the following articles.
The pandas version used in this article is as follows. Note that functionality may vary between versions. The following DataFrame is used as an example.
import pandas as pd
print(pd.__version__)
# 2.1.4
df = pd.read_csv('data/src/sample_pandas_normal.csv')
print(df)
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
# 4 Ellen 24 CA 88
# 5 Frank 30 NY 57
How to count values with conditions
The process for counting values that meet specific conditions is as follows:
- Evaluate each value to produce a Boolean
DataFrameorSeries, using methods such as comparison operators or string accessors. - Use the
sum()method to countTruevaluesDataFrame- Count by column:
sum() - Count by row:
sum(axis=1) - Count in total:
sum().sum()orvalues.sum()
- Count by column:
Series- Count in total:
sum()
- Count in total:
DataFrame
When you apply comparison operators to a DataFrame or Series, it evaluates each value, resulting in a Boolean DataFrame or Series of the same size.
The parentheses () on the right side are optional.
df_bool = (df == 'CA')
print(df_bool)
# name age state point
# 0 False False False False
# 1 False False True False
# 2 False False True False
# 3 False False False False
# 4 False False True False
# 5 False False False False
Since True is treated as 1 and False as 0 in Python, you can count values that meet the conditions using the sum() method. By default, it counts by column, and setting axis=1 counts by row.
print(df_bool.sum())
# name 0
# age 0
# state 3
# point 0
# dtype: int64
print(df_bool.sum(axis=1))
# 0 0
# 1 1
# 2 1
# 3 0
# 4 1
# 5 0
# dtype: int64
The sum() method of DataFrame returns a Series. Calling sum() on this Series yields the total count.
print(df_bool.sum().sum())
# 3
A DataFrame can be converted to a NumPy array (ndarray) using the values attribute.
The sum() method of ndarray calculates the sum across the entire array by default, so calling sum() on the values attribute (ndarray) retrieves the total number of values that meet the condition.
print(df_bool.values)
# [[False False False False]
# [False False True False]
# [False False True False]
# [False False False False]
# [False False True False]
# [False False False False]]
print(type(df_bool.values))
# <class 'numpy.ndarray'>
print(df_bool.values.sum())
# 3
You can write as follows:
print((df == 'CA').sum())
# name 0
# age 0
# state 3
# point 0
# dtype: int64
print((df == 'CA').sum(axis=1))
# 0 0
# 1 1
# 2 1
# 3 0
# 4 1
# 5 0
# dtype: int64
print((df == 'CA').sum().sum())
# 3
print((df == 'CA').values.sum())
# 3
Note that performing numerical comparison operations on a DataFrame with mixed numeric and string columns will result in an error. Details are discussed later.
Series
Consider the following Series.
s = df['age']
print(s)
# 0 24
# 1 42
# 2 18
# 3 68
# 4 24
# 5 30
# Name: age, dtype: int64
The procedure is the same as with DataFrame. Since Series is one-dimensional, the sum() method returns the total count.
s_bool = (s < 25)
print(s_bool)
# 0 True
# 1 False
# 2 True
# 3 False
# 4 True
# 5 False
# Name: age, dtype: bool
print(s_bool.sum())
# 3
print((s < 25).sum())
# 3
To count values that meet a condition in any row or column of a DataFrame, specify the row or column using [], loc[], iloc[], and perform the same process.
- pandas: Select rows/columns by index (numbers and names)
- pandas: Get/Set values with loc, iloc, at, iat
Multiple conditions (AND, OR, NOT)
To combine multiple conditions, enclose each in parentheses () and connect them using the & operator for AND or the | operator for OR. The ~ operator (NOT) can also be used.
print((df == 'CA') | (df == 70))
# name age state point
# 0 False False False False
# 1 False False True False
# 2 False False True True
# 3 False False False True
# 4 False False True False
# 5 False False False False
print(~(df == 'CA'))
# name age state point
# 0 True True True True
# 1 True True False True
# 2 True True False True
# 3 True True True True
# 4 True True False True
# 5 True True True True
print((df['state'] == 'CA') & (df['age'] < 30))
# 0 False
# 1 False
# 2 True
# 3 False
# 4 True
# 5 False
# dtype: bool
Note that using and and or instead of & and |, or omitting parentheses, will result in an error.
Once a Boolean DataFrame or Series is obtained, you can count values that meet the conditions using the sum() method as mentioned above.
Specify conditions for numerical values
As seen in previous examples, comparison operators, such as <, <=, >, >=, ==, !=, can be used for numerical values.
However, be aware that applying these operators to a DataFrame that includes string columns, when comparing against numeric values, will result in an error. To specifically work with numeric columns, use the select_dtypes() method.
# print(df < 65)
# TypeError: '<' not supported between instances of 'str' and 'int'
df_num = df.select_dtypes('number')
print(df_num)
# age point
# 0 24 64
# 1 42 92
# 2 18 70
# 3 68 70
# 4 24 88
# 5 30 57
print((df_num < 65).sum())
# age 5
# point 2
# dtype: int64
print(((df_num > 35) & (df_num < 65)).sum())
# age 1
# point 2
# dtype: int64
Specify conditions for strings
For string operations in Series, in addition to == and !=, there are additional methods available through the string accessor (str):
str.contains(): Checks if each string contains a specific substringstr.endswith(): Checks if each string ends with a specific substringstr.startswith(): Checks if each string starts with a specific substring.str.match(): Checks each string against a regular expression (regex) pattern
Note that the str accessor is specific to Series and unavailable for DataFrame.
df_str = df[['name', 'state']]
print(df_str)
# name state
# 0 Alice NY
# 1 Bob CA
# 2 Charlie CA
# 3 Dave TX
# 4 Ellen CA
# 5 Frank NY
print((df_str == 'NY').sum())
# name 0
# state 2
# dtype: int64
print(df_str['name'].str.endswith('e'))
# 0 True
# 1 False
# 2 True
# 3 True
# 4 False
# 5 False
# Name: name, dtype: bool
print(df_str['name'].str.endswith('e').sum())
# 3
See the following article on how to extract rows using the str accessor.
Count NaN and non-NaN values
As an example, consider the data on Titanic survivors.
df_titanic = pd.read_csv('data/src/titanic_train.csv')
print(df_titanic.head())
# PassengerId Survived Pclass \
# 0 1 0 3
# 1 2 1 1
# 2 3 1 3
# 3 4 1 1
# 4 5 0 3
#
# Name Sex Age SibSp \
# 0 Braund, Mr. Owen Harris male 22.0 1
# 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
# 2 Heikkinen, Miss. Laina female 26.0 0
# 3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
# 4 Allen, Mr. William Henry male 35.0 0
#
# Parch Ticket Fare Cabin Embarked
# 0 0 A/5 21171 7.2500 NaN S
# 1 0 PC 17599 71.2833 C85 C
# 2 0 STON/O2. 3101282 7.9250 NaN S
# 3 0 113803 53.1000 C123 S
# 4 0 373450 8.0500 NaN S
See the following articles for information on removing, replacing, and detecting NaN.
- pandas: Remove NaN (missing values) with dropna()
- pandas: Replace NaN (missing values) with fillna()
- pandas: Detect and count NaN (missing values) with isnull(), isna()
Count NaN values
To count NaN values, use the isnull() method, which checks if each value is NaN. You can use sum() as in the previous examples.
print(df_titanic.isnull().head())
# PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket \
# 0 False False False False False False False False False
# 1 False False False False False False False False False
# 2 False False False False False False False False False
# 3 False False False False False False False False False
# 4 False False False False False False False False False
#
# Fare Cabin Embarked
# 0 False True False
# 1 False False False
# 2 False True False
# 3 False False False
# 4 False True False
print(df_titanic.isnull().sum())
# PassengerId 0
# Survived 0
# Pclass 0
# Name 0
# Sex 0
# Age 177
# SibSp 0
# Parch 0
# Ticket 0
# Fare 0
# Cabin 687
# Embarked 2
# dtype: int64
print(df_titanic.isnull().sum(axis=1).head())
# 0 1
# 1 0
# 2 1
# 3 0
# 4 1
# dtype: int64
print(df_titanic.isnull().values.sum())
# 866
Count non-NaN values: count()
To count non-NaN values, use the count() method. Similar to sum(), it counts column-wise by default, and setting axis=1 counts row-wise.
- pandas.DataFrame.count — pandas 2.1.4 documentation
- pandas.Series.count — pandas 2.1.4 documentation
print(df_titanic.count())
# PassengerId 891
# Survived 891
# Pclass 891
# Name 891
# Sex 891
# Age 714
# SibSp 891
# Parch 891
# Ticket 891
# Fare 891
# Cabin 204
# Embarked 889
# dtype: int64
print(df_titanic.count(axis=1).head())
# 0 11
# 1 12
# 2 11
# 3 12
# 4 11
# dtype: int64
print(df_titanic.count().sum())
# 9826
print(df_titanic['Age'].count())
# 714
To simply check the count of non-NaN values, it can be displayed using the info() method.
df_titanic.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 891 entries, 0 to 890
# Data columns (total 12 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 PassengerId 891 non-null int64
# 1 Survived 891 non-null int64
# 2 Pclass 891 non-null int64
# 3 Name 891 non-null object
# 4 Sex 891 non-null object
# 5 Age 714 non-null float64
# 6 SibSp 891 non-null int64
# 7 Parch 891 non-null int64
# 8 Ticket 891 non-null object
# 9 Fare 891 non-null float64
# 10 Cabin 204 non-null object
# 11 Embarked 889 non-null object
# dtypes: float64(2), int64(5), object(5)
# memory usage: 83.7+ KB