pandas: Count values in DataFrame/Series with conditions

Modified: | Tags: Python, pandas

This article explains how to count values in a pandas.DataFrame or pandas.Series that meet specific conditions by column, by row, and in total.

The count() method of DataFrame and Series, which will be explained later, counts the number of non-NaN values.

For methods on extracting rows that meet conditions and counting the number of unique values in each column, refer to the following articles.

The pandas version used in this article is as follows. Note that functionality may vary between versions. The following DataFrame is used as an example.

import pandas as pd

print(pd.__version__)
# 2.1.4

df = pd.read_csv('data/src/sample_pandas_normal.csv')
print(df)
#       name  age state  point
# 0    Alice   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57

How to count values with conditions

The process for counting values that meet specific conditions is as follows:

  1. Evaluate each value to produce a Boolean DataFrame or Series, using methods such as comparison operators or string accessors.
  2. Use the sum() method to count True values
    • DataFrame
      • Count by column: sum()
      • Count by row: sum(axis=1)
      • Count in total: sum().sum() or values.sum()
    • Series
      • Count in total: sum()

DataFrame

When you apply comparison operators to a DataFrame or Series, it evaluates each value, resulting in a Boolean DataFrame or Series of the same size.

The parentheses () on the right side are optional.

df_bool = (df == 'CA')
print(df_bool)
#     name    age  state  point
# 0  False  False  False  False
# 1  False  False   True  False
# 2  False  False   True  False
# 3  False  False  False  False
# 4  False  False   True  False
# 5  False  False  False  False

Since True is treated as 1 and False as 0 in Python, you can count values that meet the conditions using the sum() method. By default, it counts by column, and setting axis=1 counts by row.

print(df_bool.sum())
# name     0
# age      0
# state    3
# point    0
# dtype: int64

print(df_bool.sum(axis=1))
# 0    0
# 1    1
# 2    1
# 3    0
# 4    1
# 5    0
# dtype: int64

The sum() method of DataFrame returns a Series. Calling sum() on this Series yields the total count.

print(df_bool.sum().sum())
# 3

A DataFrame can be converted to a NumPy array (ndarray) using the values attribute.

The sum() method of ndarray calculates the sum across the entire array by default, so calling sum() on the values attribute (ndarray) retrieves the total number of values that meet the condition.

print(df_bool.values)
# [[False False False False]
#  [False False  True False]
#  [False False  True False]
#  [False False False False]
#  [False False  True False]
#  [False False False False]]

print(type(df_bool.values))
# <class 'numpy.ndarray'>

print(df_bool.values.sum())
# 3

You can write as follows:

print((df == 'CA').sum())
# name     0
# age      0
# state    3
# point    0
# dtype: int64

print((df == 'CA').sum(axis=1))
# 0    0
# 1    1
# 2    1
# 3    0
# 4    1
# 5    0
# dtype: int64

print((df == 'CA').sum().sum())
# 3

print((df == 'CA').values.sum())
# 3

Note that performing numerical comparison operations on a DataFrame with mixed numeric and string columns will result in an error. Details are discussed later.

Series

Consider the following Series.

s = df['age']
print(s)
# 0    24
# 1    42
# 2    18
# 3    68
# 4    24
# 5    30
# Name: age, dtype: int64

The procedure is the same as with DataFrame. Since Series is one-dimensional, the sum() method returns the total count.

s_bool = (s < 25)
print(s_bool)
# 0     True
# 1    False
# 2     True
# 3    False
# 4     True
# 5    False
# Name: age, dtype: bool

print(s_bool.sum())
# 3

print((s < 25).sum())
# 3

To count values that meet a condition in any row or column of a DataFrame, specify the row or column using [], loc[], iloc[], and perform the same process.

Multiple conditions (AND, OR, NOT)

To combine multiple conditions, enclose each in parentheses () and connect them using the & operator for AND or the | operator for OR. The ~ operator (NOT) can also be used.

print((df == 'CA') | (df == 70))
#     name    age  state  point
# 0  False  False  False  False
# 1  False  False   True  False
# 2  False  False   True   True
# 3  False  False  False   True
# 4  False  False   True  False
# 5  False  False  False  False
print(~(df == 'CA'))
#    name   age  state  point
# 0  True  True   True   True
# 1  True  True  False   True
# 2  True  True  False   True
# 3  True  True   True   True
# 4  True  True  False   True
# 5  True  True   True   True
print((df['state'] == 'CA') & (df['age'] < 30))
# 0    False
# 1    False
# 2     True
# 3    False
# 4     True
# 5    False
# dtype: bool

Note that using and and or instead of & and |, or omitting parentheses, will result in an error.

Once a Boolean DataFrame or Series is obtained, you can count values that meet the conditions using the sum() method as mentioned above.

Specify conditions for numerical values

As seen in previous examples, comparison operators, such as <, <=, >, >=, ==, !=, can be used for numerical values.

However, be aware that applying these operators to a DataFrame that includes string columns, when comparing against numeric values, will result in an error. To specifically work with numeric columns, use the select_dtypes() method.

# print(df < 65)
# TypeError: '<' not supported between instances of 'str' and 'int'

df_num = df.select_dtypes('number')
print(df_num)
#    age  point
# 0   24     64
# 1   42     92
# 2   18     70
# 3   68     70
# 4   24     88
# 5   30     57

print((df_num < 65).sum())
# age      5
# point    2
# dtype: int64

print(((df_num > 35) & (df_num < 65)).sum())
# age      1
# point    2
# dtype: int64

Specify conditions for strings

For string operations in Series, in addition to == and !=, there are additional methods available through the string accessor (str):

  • str.contains(): Checks if each string contains a specific substring
  • str.endswith(): Checks if each string ends with a specific substring
  • str.startswith(): Checks if each string starts with a specific substring.
  • str.match(): Checks each string against a regular expression (regex) pattern

Note that the str accessor is specific to Series and unavailable for DataFrame.

df_str = df[['name', 'state']]
print(df_str)
#       name state
# 0    Alice    NY
# 1      Bob    CA
# 2  Charlie    CA
# 3     Dave    TX
# 4    Ellen    CA
# 5    Frank    NY

print((df_str == 'NY').sum())
# name     0
# state    2
# dtype: int64

print(df_str['name'].str.endswith('e'))
# 0     True
# 1    False
# 2     True
# 3     True
# 4    False
# 5    False
# Name: name, dtype: bool

print(df_str['name'].str.endswith('e').sum())
# 3

See the following article on how to extract rows using the str accessor.

Count NaN and non-NaN values

As an example, consider the data on Titanic survivors.

df_titanic = pd.read_csv('data/src/titanic_train.csv')
print(df_titanic.head())
#    PassengerId  Survived  Pclass  \
# 0            1         0       3   
# 1            2         1       1   
# 2            3         1       3   
# 3            4         1       1   
# 4            5         0       3   
# 
#                                                 Name     Sex   Age  SibSp  \
# 0                            Braund, Mr. Owen Harris    male  22.0      1   
# 1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
# 2                             Heikkinen, Miss. Laina  female  26.0      0   
# 3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
# 4                           Allen, Mr. William Henry    male  35.0      0   
# 
#    Parch            Ticket     Fare Cabin Embarked  
# 0      0         A/5 21171   7.2500   NaN        S  
# 1      0          PC 17599  71.2833   C85        C  
# 2      0  STON/O2. 3101282   7.9250   NaN        S  
# 3      0            113803  53.1000  C123        S  
# 4      0            373450   8.0500   NaN        S  

See the following articles for information on removing, replacing, and detecting NaN.

Count NaN values

To count NaN values, use the isnull() method, which checks if each value is NaN. You can use sum() as in the previous examples.

print(df_titanic.isnull().head())
#    PassengerId  Survived  Pclass   Name    Sex    Age  SibSp  Parch  Ticket  \
# 0        False     False   False  False  False  False  False  False   False   
# 1        False     False   False  False  False  False  False  False   False   
# 2        False     False   False  False  False  False  False  False   False   
# 3        False     False   False  False  False  False  False  False   False   
# 4        False     False   False  False  False  False  False  False   False   
# 
#     Fare  Cabin  Embarked  
# 0  False   True     False  
# 1  False  False     False  
# 2  False   True     False  
# 3  False  False     False  
# 4  False   True     False  

print(df_titanic.isnull().sum())
# PassengerId      0
# Survived         0
# Pclass           0
# Name             0
# Sex              0
# Age            177
# SibSp            0
# Parch            0
# Ticket           0
# Fare             0
# Cabin          687
# Embarked         2
# dtype: int64

print(df_titanic.isnull().sum(axis=1).head())
# 0    1
# 1    0
# 2    1
# 3    0
# 4    1
# dtype: int64

print(df_titanic.isnull().values.sum())
# 866

Count non-NaN values: count()

To count non-NaN values, use the count() method. Similar to sum(), it counts column-wise by default, and setting axis=1 counts row-wise.

print(df_titanic.count())
# PassengerId    891
# Survived       891
# Pclass         891
# Name           891
# Sex            891
# Age            714
# SibSp          891
# Parch          891
# Ticket         891
# Fare           891
# Cabin          204
# Embarked       889
# dtype: int64

print(df_titanic.count(axis=1).head())
# 0    11
# 1    12
# 2    11
# 3    12
# 4    11
# dtype: int64

print(df_titanic.count().sum())
# 9826

print(df_titanic['Age'].count())
# 714

To simply check the count of non-NaN values, it can be displayed using the info() method.

df_titanic.info()
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 891 entries, 0 to 890
# Data columns (total 12 columns):
#  #   Column       Non-Null Count  Dtype  
# ---  ------       --------------  -----  
#  0   PassengerId  891 non-null    int64  
#  1   Survived     891 non-null    int64  
#  2   Pclass       891 non-null    int64  
#  3   Name         891 non-null    object 
#  4   Sex          891 non-null    object 
#  5   Age          714 non-null    float64
#  6   SibSp        891 non-null    int64  
#  7   Parch        891 non-null    int64  
#  8   Ticket       891 non-null    object 
#  9   Fare         891 non-null    float64
#  10  Cabin        204 non-null    object 
#  11  Embarked     889 non-null    object 
# dtypes: float64(2), int64(5), object(5)
# memory usage: 83.7+ KB

Related Categories

Related Articles