pandas: Extract rows that contain specific strings from a DataFrame

Posted: 2023-07-30 | Tags: Python, pandas

This article explains how to extract rows that contain specific strings from a pandas.DataFrame, accounting for exact, partial, forward, and backward matches.

Contents

How to extract rows that meet the condition (Boolean indexing)
Exact match with specific string: ==, isin()
Partial match (Contains specific string): str.contains()
Forward match (Starts with a specific string): str.startswith()
Backward match (Ends with a specific string): str.endswith()
String starts with a match of a regular expression: str.match()

This article uses boolean indexing, but it's also possible to use the query() method.

pandas: Query DataFrame and extract rows with query()

Use the filter() method to extract rows/columns where the row/column names contain specific strings. See the following article for details.

pandas: Filter rows/columns by labels with filter()

The sample code in this article uses pandas version 2.0.3. The following pandas.DataFrame is used as an example.

sample_pandas_normal.csv

import pandas as pd

print(pd.__version__)
# 2.0.3

df = pd.read_csv('data/src/sample_pandas_normal.csv').head(3)
print(df)
#       name  age state  point
# 0    Alice   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70

source: pandas_str_select.py

How to extract rows that meet the condition (Boolean indexing)

When a list or pandas.Series of boolean elements (True or False) is specified with [] on a pandas.DataFrame, rows corresponding to True are extracted. This process is known as boolean indexing.

mask = [True, False, True]
print(df[mask])
#       name  age state  point
# 0    Alice   24    NY     64
# 2  Charlie   18    CA     70

source: pandas_str_select.py

Therefore, if you can generate a boolean Series that matches the desired conditions, you can use it to filter the rows accordingly.

To filter rows based on multiple conditions, you can use operators such as & (AND), | (OR), and ~ (NOT). For more detailed information, refer to the following article.

pandas: Select rows by multiple conditions

Exact match with specific string: `==`, `isin()`

By using ==, you can generate a Series where elements that exactly match a given string are True.

print(df['state'] == 'CA')
# 0    False
# 1     True
# 2     True
# Name: state, dtype: bool

print(df[df['state'] == 'CA'])
#       name  age state  point
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70

source: pandas_str_select.py

The isin() method of Series returns True for elements that exactly match any element in the specified list.

pandas.Series.isin — pandas 2.0.3 documentation

print(df['state'].isin(['NY', 'CA']))
# 0    True
# 1    True
# 2    True
# Name: state, dtype: bool

print(df[df['state'].isin(['NY', 'CA'])])
#       name  age state  point
# 0    Alice   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70

source: pandas_str_select.py

Partial match (Contains specific string): `str.contains()`

By using str.contains(), you can generate a Series where elements that contain a given substring are True.

pandas.Series.str.contains — pandas 2.0.3 documentation

print(df['name'].str.contains('li'))
# 0     True
# 1    False
# 2     True
# Name: name, dtype: bool

print(df[df['name'].str.contains('li')])
#       name  age state  point
# 0    Alice   24    NY     64
# 2  Charlie   18    CA     70

source: pandas_str_select.py

Note that the first argument string is treated as a regular expression pattern by default.

Handle missing values `NaN`: the `na` argument

If an element is a missing value NaN, str.contains() returns NaN by default. Because of this, trying to extract rows using this Series will result in an error.

df_nan = df.copy()
df_nan.iloc[2, 0] = float('nan')
print(df_nan)
#     name  age state  point
# 0  Alice   24    NY     64
# 1    Bob   42    CA     92
# 2    NaN   18    CA     70

print(df_nan['name'].str.contains('li'))
# 0     True
# 1    False
# 2      NaN
# Name: name, dtype: object

# print(df_nan[df_nan['name'].str.contains('li')])
# ValueError: Cannot mask with non-boolean array containing NA / NaN values

source: pandas_str_select.py

You can specify a replacement value for NaN using the na argument.

print(df_nan['name'].str.contains('li', na=False))
# 0     True
# 1    False
# 2    False
# Name: name, dtype: bool

print(df_nan['name'].str.contains('li', na=True))
# 0     True
# 1    False
# 2     True
# Name: name, dtype: bool

source: pandas_str_select.py

When using the result as a condition, setting na=True will include rows with NaN, while setting na=False will exclude them.

Case sensitivity: the `case` argument

By default, str.contains() distinguishes between uppercase and lowercase characters. To ignore case distinction, set the case argument to False.

print(df['name'].str.contains('LI'))
# 0    False
# 1    False
# 2    False
# Name: name, dtype: bool

print(df['name'].str.contains('LI', case=False))
# 0     True
# 1    False
# 2     True
# Name: name, dtype: bool

source: pandas_str_select.py

Use regular expression patterns: the `regex` and `flags` arguments

As previously mentioned, str.contains() treats the first argument as a regular expression pattern by default.

print(df['name'].str.contains('i.*e'))
# 0     True
# 1    False
# 2     True
# Name: name, dtype: bool

source: pandas_str_select.py

You can set the regex argument to False to treat the first argument as a literal string instead of as a regular expression pattern.

print(df['name'].str.contains('i.*e', regex=False))
# 0    False
# 1    False
# 2    False
# Name: name, dtype: bool

source: pandas_str_select.py

For example, when you want to check whether a string contains characters such as ?, ., *, etc., which are interpreted as special characters in regular expressions, you need to set regex=False. Alternatively, you can use a regular expression pattern that escapes special characters, e.g., \?.

df_q = df.copy()
df_q.iloc[2, 0] += '?'
print(df_q)
#        name  age state  point
# 0     Alice   24    NY     64
# 1       Bob   42    CA     92
# 2  Charlie?   18    CA     70

# print(df_q['name'].str.contains('?'))
# error: nothing to repeat at position 0

print(df_q['name'].str.contains('?', regex=False))
# 0    False
# 1    False
# 2     True
# Name: name, dtype: bool

print(df_q['name'].str.contains(r'\?'))
# 0    False
# 1    False
# 2     True
# Name: name, dtype: bool

source: pandas_str_select.py

You can also specify regular expression flags, like re.IGNORECASE, using the flags argument. Also, str.contains() corresponds to re.search(), while str.match() corresponds to re.match() as described later.

Regular expressions with the re module in Python

Forward match (Starts with a specific string): `str.startswith()`

By using str.startswith(), you can generate a Series where elements that start with a given string are True.

pandas.Series.str.startswith — pandas 2.0.3 documentation

print(df['name'].str.startswith('B'))
# 0    False
# 1     True
# 2    False
# Name: name, dtype: bool

print(df[df['name'].str.startswith('B')])
#   name  age state  point
# 1  Bob   42    CA     92

source: pandas_str_select.py

Like str.contains(), str.startswith() also has the na argument. If you want to select rows with missing values NaN, set na=True. If you don't want to select them, set na=False.

Unlike str.contains(), there is no case argument in str.startswith(); therefore, uppercase and lowercase characters are always distinguished. Also, the first argument string is not interpreted as a regular expression pattern.

Backward match (Ends with a specific string): `str.endswith()`

By using str.endswith(), you can generate a Series where elements that end with a given string are True.

pandas.Series.str.endswith — pandas 2.0.3 documentation

print(df['name'].str.endswith('e'))
# 0     True
# 1    False
# 2     True
# Name: name, dtype: bool

print(df[df['name'].str.endswith('e')])
#       name  age state  point
# 0    Alice   24    NY     64
# 2  Charlie   18    CA     70

source: pandas_str_select.py

str.endswith() also has the na argument. If you want to select rows with missing values NaN, set na=True. If you don't want to select them, set na=False.

There is no case argument, and uppercase and lowercase characters are always distinguished. Also, the first argument string is not interpreted as a regular expression pattern.

String starts with a match of a regular expression: `str.match()`

By using str.match(), you can generate a Series where elements that start with a match of a regular expression pattern are True.

pandas.Series.str.match — pandas 2.0.3 documentation

print(df['name'].str.match('.*i'))
# 0     True
# 1    False
# 2     True
# Name: name, dtype: bool

print(df[df['name'].str.match('.*i')])
#       name  age state  point
# 0    Alice   24    NY     64
# 2  Charlie   18    CA     70

source: pandas_str_select.py

As mentioned above, str.match() corresponds to re.match(), checking whether the beginning of the string matches the pattern. If you want to check if any part of the string matches the pattern, use str.contains(), which corresponds to re.search() by default (regex=True).

print(df['name'].str.match('i.*e'))
# 0    False
# 1    False
# 2    False
# Name: name, dtype: bool

print(df['name'].str.contains('i.*e'))
# 0     True
# 1    False
# 2     True
# Name: name, dtype: bool

source: pandas_str_select.py

For more details on re.match() and re.search(), refer to the following article.

Regular expressions with the re module in Python

Similar to str.contains(), str.match() also has na, case, and flags arguments.

pandas: Extract rows that contain specific strings from a DataFrame

How to extract rows that meet the condition (Boolean indexing)

Exact match with specific string: `==`, `isin()`

Partial match (Contains specific string): `str.contains()`

Handle missing values `NaN`: the `na` argument

Case sensitivity: the `case` argument

Use regular expression patterns: the `regex` and `flags` arguments

Forward match (Starts with a specific string): `str.startswith()`

Backward match (Ends with a specific string): `str.endswith()`

String starts with a match of a regular expression: `str.match()`

Related Categories

Related Articles

pandas: Extract rows that contain specific strings from a DataFrame

How to extract rows that meet the condition (Boolean indexing)

Exact match with specific string: ==, isin()

Partial match (Contains specific string): str.contains()

Handle missing values NaN: the na argument

Case sensitivity: the case argument

Use regular expression patterns: the regex and flags arguments

Forward match (Starts with a specific string): str.startswith()

Backward match (Ends with a specific string): str.endswith()

String starts with a match of a regular expression: str.match()

Related Categories

Related Articles

Exact match with specific string: `==`, `isin()`

Partial match (Contains specific string): `str.contains()`

Handle missing values `NaN`: the `na` argument

Case sensitivity: the `case` argument

Use regular expression patterns: the `regex` and `flags` arguments

Forward match (Starts with a specific string): `str.startswith()`

Backward match (Ends with a specific string): `str.endswith()`

String starts with a match of a regular expression: `str.match()`