pandas: Get unique values and their counts in a column

Posted: | Tags: Python, pandas

This article explains how to get unique values and their counts in a column (= Series) of a DataFrame in pandas.

Use the unique(), value_counts(), and nunique() methods on Series. nunique() is also available as a method on DataFrame.

  • pandas.Series.unique() returns unique values as a NumPy array (ndarray)
  • pandas.Series.value_counts() returns unique values and their counts as a Series
  • pandas.Series.nunique() and pandas.DataFrame.nunique() return the number of unique values as either an int or a Series

This article begins by explaining the basic usage of each method, then shows how to get unique values and their counts, and more.

To count values that meet certain conditions, refer to the following article.

The describe() method is useful to compute summary statistics including the mode and its frequency.

The pandas version used in this article is as follows. Note that functionality may vary between versions. The following data is used for the examples. Missing values (NaN) are inserted for explanation purposes.

import pandas as pd

print(pd.__version__)
# 2.1.4

df = pd.read_csv('data/src/sample_pandas_normal.csv')
df.iloc[1] = float('nan')
print(df)
#       name   age state  point
# 0    Alice  24.0    NY   64.0
# 1      NaN   NaN   NaN    NaN
# 2  Charlie  18.0    CA   70.0
# 3     Dave  68.0    TX   70.0
# 4    Ellen  24.0    CA   88.0
# 5    Frank  30.0    NY   57.0

pandas.Series.unique()

unique() returns unique values as a one-dimensional NumPy array (ndarray). Missing values (NaN) are included. The values are arranged in the order of appearance.

print(df['state'].unique())
# ['NY' nan 'CA' 'TX']

print(type(df['state'].unique()))
# <class 'numpy.ndarray'>

pandas.Series.value_counts()

value_counts() returns a Series where the unique values are the index (labels) and their counts are the values.

print(df['state'].value_counts())
# state
# NY    2
# CA    2
# TX    1
# Name: count, dtype: int64

print(type(df['state'].value_counts()))
# <class 'pandas.core.series.Series'>

By default, missing values (NaN) are excluded, but if the dropna argument is set to False, they are also counted.

print(df['state'].value_counts(dropna=False))
# state
# NY     2
# CA     2
# NaN    1
# TX     1
# Name: count, dtype: int64

By default, the values are sorted in descending order of frequency. If the ascending argument is set to True, they are sorted in ascending order. Alternatively, setting sort to False leaves them unsorted, arranged in their original order of appearance.

print(df['state'].value_counts(dropna=False, ascending=True))
# state
# NaN    1
# TX     1
# NY     2
# CA     2
# Name: count, dtype: int64

print(df['state'].value_counts(dropna=False, sort=False))
# state
# NY     2
# NaN    1
# CA     2
# TX     1
# Name: count, dtype: int64

If the normalize argument is set to True, the values are normalized so that their total is 1. Be careful that the value changes depending on the setting of the dropna argument if NaN is included.

print(df['state'].value_counts(normalize=True))
# state
# NY    0.4
# CA    0.4
# TX    0.2
# Name: proportion, dtype: float64

print(df['state'].value_counts(dropna=False, normalize=True))
# state
# NY     0.333333
# CA     0.333333
# NaN    0.166667
# TX     0.166667
# Name: proportion, dtype: float64

pandas.Series.nunique(), pandas.DataFrame.nunique()

nunique() on Series returns the number of unique values as an integer (int).

By default, missing values (NaN) are excluded; however, setting the dropna argument to False includes them in the count.

print(df['state'].nunique())
# 3

print(type(df['state'].nunique()))
# <class 'int'>

print(df['state'].nunique(dropna=False))
# 4

nunique() on DataFrame returns the number of unique values for each column as a Series.

Similar to Series, the nunique() method on DataFrame also has the dropna argument. Additionally, while the default counting is column-wise, changing the axis argument to 1 or 'columns' switches the count to row-wise.

print(df.nunique())
# name     5
# age      4
# state    3
# point    4
# dtype: int64

print(type(df.nunique()))
# <class 'pandas.core.series.Series'>

print(df.nunique(dropna=False))
# name     6
# age      5
# state    4
# point    5
# dtype: int64

print(df.nunique(dropna=False, axis='columns'))
# 0    4
# 1    1
# 2    4
# 3    4
# 4    4
# 5    4
# dtype: int64

Get the number of unique values

The number of unique values can be counted using nunique() on Series and DataFrame.

print(df['state'].nunique())
# 3

print(df.nunique())
# name     5
# age      4
# state    3
# point    4
# dtype: int64

Get the list of unique values

unique() returns unique values as a NumPy array (ndarray). ndarray can be converted to a Python built-in list (list) using the tolist() method.

print(df['state'].unique().tolist())
# ['NY', nan, 'CA', 'TX']

print(type(df['state'].unique().tolist()))
# <class 'list'>

You can call tolist() on the index attribute of the Series returned by value_counts(), or use the values attribute to obtain the data as a NumPy array (ndarray).

print(df['state'].value_counts().index.tolist())
# ['NY', 'CA', 'TX']

print(type(df['state'].value_counts().index.tolist()))
# <class 'list'>

print(df['state'].value_counts().index.values)
# ['NY' 'CA' 'TX']

print(type(df['state'].value_counts().index.values))
# <class 'numpy.ndarray'>

unique() always counts NaN as a unique value, but value_counts() allows you to specify whether to count NaN with the dropna argument.

print(df['state'].value_counts(dropna=False).index.tolist())
# ['NY', 'CA', nan, 'TX']

Get the counts of each unique value

To get the counts (frequency, number of occurrences) of each unique value, access the values of the Series returned by value_counts().

vc = df['state'].value_counts()
print(vc)
# state
# NY    2
# CA    2
# TX    1
# Name: count, dtype: int64

print(vc['NY'])
# 2

print(vc['TX'])
# 1

To extract the unique value and its count in a for loop, use the items() method.

for index, value in df['state'].value_counts().items():
    print(index, value)
# NY 2
# CA 2
# TX 1

It was named iteritems(), but it has been changed to items(). iteritems() was removed in pandas version 2.0.

Get the dictionary of unique values and their counts

You can call the to_dict() method on the Series returned by value_counts() to convert it into a dictionary (dict).

d = df['state'].value_counts().to_dict()
print(d)
# {'NY': 2, 'CA': 2, 'TX': 1}

print(type(d))
# <class 'dict'>

print(d['NY'])
# 2

print(d['TX'])
# 1

To extract the unique value and its count in a for loop, use the items() method.

for key, value in d.items():
    print(key, value)
# NY 2
# CA 2
# TX 1

Get the mode (most frequent value) and its frequency

value_counts()

By default, value_counts() returns a Series sorted in order of frequency, so the first element represents the mode (most frequent value) and its frequency.

print(df['state'].value_counts())
# state
# NY    2
# CA    2
# TX    1
# Name: count, dtype: int64

print(df['state'].value_counts().index[0])
# NY

print(df['state'].value_counts().iat[0])
# 2

The original Series values are used as the index of the resulting Series. If this index is numeric, accessing it directly with [number] can lead to errors. Instead, use iat[number] for accurate access.

# print(df['age'].value_counts()[0])
# KeyError: 0

print(df['age'].value_counts().iat[0])
# 2

You can apply it to each column of a DataFrame using the apply() method.

print(df.apply(lambda x: x.value_counts().index[0]))
# name     Alice
# age       24.0
# state       NY
# point     70.0
# dtype: object

print(df.apply(lambda x: x.value_counts().iat[0]))
# name     1
# age      2
# state    2
# point    2
# dtype: int64

As mentioned above, by default, missing values (NaN) are excluded. If the dropna argument is set to False, they are also counted.

Be aware that if there are multiple modes, you can get only one mode using this method.

mode()

The mode() method on Series returns the modes as a Series. Converting this Series to a list with tolist() allows you to obtain the modes as a list. Even if there is only one mode, it will be a list.

print(df['state'].mode())
# 0    CA
# 1    NY
# Name: state, dtype: object

print(df['state'].mode().tolist())
# ['CA', 'NY']

print(df['age'].mode().tolist())
# [24.0]

Applying it with apply() to each column results in a Series with lists of modes as values.

s_list = df.apply(lambda x: x.mode().tolist())
print(s_list)
# name     [Alice, Charlie, Dave, Ellen, Frank]
# age                                    [24.0]
# state                                [CA, NY]
# point                                  [70.0]
# dtype: object

print(type(s_list))
# <class 'pandas.core.series.Series'>

print(s_list['name'])
# ['Alice', 'Charlie', 'Dave', 'Ellen', 'Frank']

print(type(s_list['name']))
# <class 'list'>

mode() is also available as a method of DataFrame. It returns a DataFrame. If the number of modes differs for each column, the empty parts are filled with missing values (NaN).

print(df.mode())
#       name   age state  point
# 0    Alice  24.0    CA   70.0
# 1  Charlie   NaN    NY    NaN
# 2     Dave   NaN   NaN    NaN
# 3    Ellen   NaN   NaN    NaN
# 4    Frank   NaN   NaN    NaN

By default, missing values (NaN) are excluded. If the dropna argument is set to False, they are also counted. For more details about mode(), refer to the following article.

describe()

The describe() method can calculate the number of unique values, the mode, and its frequency for each column together. top represents the mode, and freq represents its frequency. Each item can be obtained with loc[].

print(df.astype('object').describe())
#          name   age state  point
# count       5   5.0     5    5.0
# unique      5   4.0     3    4.0
# top     Alice  24.0    NY   70.0
# freq        1   2.0     2    2.0

print(df.astype('object').describe().loc['top'])
# name     Alice
# age       24.0
# state       NY
# point     70.0
# Name: top, dtype: object

In describe(), the listed items depend on the data type (dtype) of the column, so astype() is used for type conversion.

describe() excludes missing values (NaN), and unlike other methods, it does not have a dropna argument. Note that even if there are several modes, this method returns only one.

Get the normalized frequencies

When the normalize argument of value_counts() is set to True, the returned values are normalized so that their total is 1. Be aware that the values may differ based on the dropna setting if missing values NaN are included.

print(df['state'].value_counts(normalize=True))
# state
# NY    0.4
# CA    0.4
# TX    0.2
# Name: proportion, dtype: float64

print(df['state'].value_counts(dropna=False, normalize=True))
# state
# NY     0.333333
# CA     0.333333
# NaN    0.166667
# TX     0.166667
# Name: proportion, dtype: float64

Related Categories

Related Articles