pandas: Extract rows that contain specific strings from a DataFrame

Posted: | Tags: Python, pandas

This article explains how to extract rows that contain specific strings from a pandas.DataFrame, accounting for exact, partial, forward, and backward matches.

This article uses boolean indexing, but it's also possible to use the query() method.

Use the filter() method to extract rows/columns where the row/column names contain specific strings. See the following article for details.

The sample code in this article uses pandas version 2.0.3. The following pandas.DataFrame is used as an example.

import pandas as pd

print(pd.__version__)
# 2.0.3

df = pd.read_csv('data/src/sample_pandas_normal.csv').head(3)
print(df)
#       name  age state  point
# 0    Alice   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70

How to extract rows that meet the condition (Boolean indexing)

When a list or pandas.Series of boolean elements (True or False) is specified with [] on a pandas.DataFrame, rows corresponding to True are extracted. This process is known as boolean indexing.

mask = [True, False, True]
print(df[mask])
#       name  age state  point
# 0    Alice   24    NY     64
# 2  Charlie   18    CA     70

Therefore, if you can generate a boolean Series that matches the desired conditions, you can use it to filter the rows accordingly.

To filter rows based on multiple conditions, you can use operators such as & (AND), | (OR), and ~ (NOT). For more detailed information, refer to the following article.

Exact match with specific string: ==, isin()

By using ==, you can generate a Series where elements that exactly match a given string are True.

print(df['state'] == 'CA')
# 0    False
# 1     True
# 2     True
# Name: state, dtype: bool

print(df[df['state'] == 'CA'])
#       name  age state  point
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70

The isin() method of Series returns True for elements that exactly match any element in the specified list.

print(df['state'].isin(['NY', 'CA']))
# 0    True
# 1    True
# 2    True
# Name: state, dtype: bool

print(df[df['state'].isin(['NY', 'CA'])])
#       name  age state  point
# 0    Alice   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70

Partial match (Contains specific string): str.contains()

By using str.contains(), you can generate a Series where elements that contain a given substring are True.

print(df['name'].str.contains('li'))
# 0     True
# 1    False
# 2     True
# Name: name, dtype: bool

print(df[df['name'].str.contains('li')])
#       name  age state  point
# 0    Alice   24    NY     64
# 2  Charlie   18    CA     70

Note that the first argument string is treated as a regular expression pattern by default.

Handle missing values NaN: the na argument

If an element is a missing value NaN, str.contains() returns NaN by default. Because of this, trying to extract rows using this Series will result in an error.

df_nan = df.copy()
df_nan.iloc[2, 0] = float('nan')
print(df_nan)
#     name  age state  point
# 0  Alice   24    NY     64
# 1    Bob   42    CA     92
# 2    NaN   18    CA     70

print(df_nan['name'].str.contains('li'))
# 0     True
# 1    False
# 2      NaN
# Name: name, dtype: object

# print(df_nan[df_nan['name'].str.contains('li')])
# ValueError: Cannot mask with non-boolean array containing NA / NaN values

You can specify a replacement value for NaN using the na argument.

print(df_nan['name'].str.contains('li', na=False))
# 0     True
# 1    False
# 2    False
# Name: name, dtype: bool

print(df_nan['name'].str.contains('li', na=True))
# 0     True
# 1    False
# 2     True
# Name: name, dtype: bool

When using the result as a condition, setting na=True will include rows with NaN, while setting na=False will exclude them.

Case sensitivity: the case argument

By default, str.contains() distinguishes between uppercase and lowercase characters. To ignore case distinction, set the case argument to False.

print(df['name'].str.contains('LI'))
# 0    False
# 1    False
# 2    False
# Name: name, dtype: bool

print(df['name'].str.contains('LI', case=False))
# 0     True
# 1    False
# 2     True
# Name: name, dtype: bool

Use regular expression patterns: the regex and flags arguments

As previously mentioned, str.contains() treats the first argument as a regular expression pattern by default.

print(df['name'].str.contains('i.*e'))
# 0     True
# 1    False
# 2     True
# Name: name, dtype: bool

You can set the regex argument to False to treat the first argument as a literal string instead of as a regular expression pattern.

print(df['name'].str.contains('i.*e', regex=False))
# 0    False
# 1    False
# 2    False
# Name: name, dtype: bool

For example, when you want to check whether a string contains characters such as ?, ., *, etc., which are interpreted as special characters in regular expressions, you need to set regex=False. Alternatively, you can use a regular expression pattern that escapes special characters, e.g., \?.

df_q = df.copy()
df_q.iloc[2, 0] += '?'
print(df_q)
#        name  age state  point
# 0     Alice   24    NY     64
# 1       Bob   42    CA     92
# 2  Charlie?   18    CA     70

# print(df_q['name'].str.contains('?'))
# error: nothing to repeat at position 0

print(df_q['name'].str.contains('?', regex=False))
# 0    False
# 1    False
# 2     True
# Name: name, dtype: bool

print(df_q['name'].str.contains(r'\?'))
# 0    False
# 1    False
# 2     True
# Name: name, dtype: bool

You can also specify regular expression flags, like re.IGNORECASE, using the flags argument. Also, str.contains() corresponds to re.search(), while str.match() corresponds to re.match() as described later.

Forward match (Starts with a specific string): str.startswith()

By using str.startswith(), you can generate a Series where elements that start with a given string are True.

print(df['name'].str.startswith('B'))
# 0    False
# 1     True
# 2    False
# Name: name, dtype: bool

print(df[df['name'].str.startswith('B')])
#   name  age state  point
# 1  Bob   42    CA     92

Like str.contains(), str.startswith() also has the na argument. If you want to select rows with missing values NaN, set na=True. If you don't want to select them, set na=False.

Unlike str.contains(), there is no case argument in str.startswith(); therefore, uppercase and lowercase characters are always distinguished. Also, the first argument string is not interpreted as a regular expression pattern.

Backward match (Ends with a specific string): str.endswith()

By using str.endswith(), you can generate a Series where elements that end with a given string are True.

print(df['name'].str.endswith('e'))
# 0     True
# 1    False
# 2     True
# Name: name, dtype: bool

print(df[df['name'].str.endswith('e')])
#       name  age state  point
# 0    Alice   24    NY     64
# 2  Charlie   18    CA     70

str.endswith() also has the na argument. If you want to select rows with missing values NaN, set na=True. If you don't want to select them, set na=False.

There is no case argument, and uppercase and lowercase characters are always distinguished. Also, the first argument string is not interpreted as a regular expression pattern.

String starts with a match of a regular expression: str.match()

By using str.match(), you can generate a Series where elements that start with a match of a regular expression pattern are True.

print(df['name'].str.match('.*i'))
# 0     True
# 1    False
# 2     True
# Name: name, dtype: bool

print(df[df['name'].str.match('.*i')])
#       name  age state  point
# 0    Alice   24    NY     64
# 2  Charlie   18    CA     70

As mentioned above, str.match() corresponds to re.match(), checking whether the beginning of the string matches the pattern. If you want to check if any part of the string matches the pattern, use str.contains(), which corresponds to re.search() by default (regex=True).

print(df['name'].str.match('i.*e'))
# 0    False
# 1    False
# 2    False
# Name: name, dtype: bool

print(df['name'].str.contains('i.*e'))
# 0     True
# 1    False
# 2     True
# Name: name, dtype: bool

For more details on re.match() and re.search(), refer to the following article.

Similar to str.contains(), str.match() also has na, case, and flags arguments.

Related Categories

Related Articles