pandas: Extract rows that contain specific strings from a DataFrame
This article explains how to extract rows that contain specific strings from a pandas.DataFrame, accounting for exact, partial, forward, and backward matches.
- How to extract rows that meet the condition (Boolean indexing)
- Exact match with specific string:
==,isin() - Partial match (Contains specific string):
str.contains() - Forward match (Starts with a specific string):
str.startswith() - Backward match (Ends with a specific string):
str.endswith() - String starts with a match of a regular expression:
str.match()
This article uses boolean indexing, but it's also possible to use the query() method.
Use the filter() method to extract rows/columns where the row/column names contain specific strings. See the following article for details.
The sample code in this article uses pandas version 2.0.3. The following pandas.DataFrame is used as an example.
import pandas as pd
print(pd.__version__)
# 2.0.3
df = pd.read_csv('data/src/sample_pandas_normal.csv').head(3)
print(df)
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
How to extract rows that meet the condition (Boolean indexing)
When a list or pandas.Series of boolean elements (True or False) is specified with [] on a pandas.DataFrame, rows corresponding to True are extracted. This process is known as boolean indexing.
mask = [True, False, True]
print(df[mask])
# name age state point
# 0 Alice 24 NY 64
# 2 Charlie 18 CA 70
Therefore, if you can generate a boolean Series that matches the desired conditions, you can use it to filter the rows accordingly.
To filter rows based on multiple conditions, you can use operators such as & (AND), | (OR), and ~ (NOT). For more detailed information, refer to the following article.
Exact match with specific string: ==, isin()
By using ==, you can generate a Series where elements that exactly match a given string are True.
print(df['state'] == 'CA')
# 0 False
# 1 True
# 2 True
# Name: state, dtype: bool
print(df[df['state'] == 'CA'])
# name age state point
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
The isin() method of Series returns True for elements that exactly match any element in the specified list.
print(df['state'].isin(['NY', 'CA']))
# 0 True
# 1 True
# 2 True
# Name: state, dtype: bool
print(df[df['state'].isin(['NY', 'CA'])])
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
Partial match (Contains specific string): str.contains()
By using str.contains(), you can generate a Series where elements that contain a given substring are True.
print(df['name'].str.contains('li'))
# 0 True
# 1 False
# 2 True
# Name: name, dtype: bool
print(df[df['name'].str.contains('li')])
# name age state point
# 0 Alice 24 NY 64
# 2 Charlie 18 CA 70
Note that the first argument string is treated as a regular expression pattern by default.
Handle missing values NaN: the na argument
If an element is a missing value NaN, str.contains() returns NaN by default. Because of this, trying to extract rows using this Series will result in an error.
df_nan = df.copy()
df_nan.iloc[2, 0] = float('nan')
print(df_nan)
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 NaN 18 CA 70
print(df_nan['name'].str.contains('li'))
# 0 True
# 1 False
# 2 NaN
# Name: name, dtype: object
# print(df_nan[df_nan['name'].str.contains('li')])
# ValueError: Cannot mask with non-boolean array containing NA / NaN values
You can specify a replacement value for NaN using the na argument.
print(df_nan['name'].str.contains('li', na=False))
# 0 True
# 1 False
# 2 False
# Name: name, dtype: bool
print(df_nan['name'].str.contains('li', na=True))
# 0 True
# 1 False
# 2 True
# Name: name, dtype: bool
When using the result as a condition, setting na=True will include rows with NaN, while setting na=False will exclude them.
Case sensitivity: the case argument
By default, str.contains() distinguishes between uppercase and lowercase characters. To ignore case distinction, set the case argument to False.
print(df['name'].str.contains('LI'))
# 0 False
# 1 False
# 2 False
# Name: name, dtype: bool
print(df['name'].str.contains('LI', case=False))
# 0 True
# 1 False
# 2 True
# Name: name, dtype: bool
Use regular expression patterns: the regex and flags arguments
As previously mentioned, str.contains() treats the first argument as a regular expression pattern by default.
print(df['name'].str.contains('i.*e'))
# 0 True
# 1 False
# 2 True
# Name: name, dtype: bool
You can set the regex argument to False to treat the first argument as a literal string instead of as a regular expression pattern.
print(df['name'].str.contains('i.*e', regex=False))
# 0 False
# 1 False
# 2 False
# Name: name, dtype: bool
For example, when you want to check whether a string contains characters such as ?, ., *, etc., which are interpreted as special characters in regular expressions, you need to set regex=False. Alternatively, you can use a regular expression pattern that escapes special characters, e.g., \?.
df_q = df.copy()
df_q.iloc[2, 0] += '?'
print(df_q)
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie? 18 CA 70
# print(df_q['name'].str.contains('?'))
# error: nothing to repeat at position 0
print(df_q['name'].str.contains('?', regex=False))
# 0 False
# 1 False
# 2 True
# Name: name, dtype: bool
print(df_q['name'].str.contains(r'\?'))
# 0 False
# 1 False
# 2 True
# Name: name, dtype: bool
You can also specify regular expression flags, like re.IGNORECASE, using the flags argument. Also, str.contains() corresponds to re.search(), while str.match() corresponds to re.match() as described later.
Forward match (Starts with a specific string): str.startswith()
By using str.startswith(), you can generate a Series where elements that start with a given string are True.
print(df['name'].str.startswith('B'))
# 0 False
# 1 True
# 2 False
# Name: name, dtype: bool
print(df[df['name'].str.startswith('B')])
# name age state point
# 1 Bob 42 CA 92
Like str.contains(), str.startswith() also has the na argument. If you want to select rows with missing values NaN, set na=True. If you don't want to select them, set na=False.
Unlike str.contains(), there is no case argument in str.startswith(); therefore, uppercase and lowercase characters are always distinguished. Also, the first argument string is not interpreted as a regular expression pattern.
Backward match (Ends with a specific string): str.endswith()
By using str.endswith(), you can generate a Series where elements that end with a given string are True.
print(df['name'].str.endswith('e'))
# 0 True
# 1 False
# 2 True
# Name: name, dtype: bool
print(df[df['name'].str.endswith('e')])
# name age state point
# 0 Alice 24 NY 64
# 2 Charlie 18 CA 70
str.endswith() also has the na argument. If you want to select rows with missing values NaN, set na=True. If you don't want to select them, set na=False.
There is no case argument, and uppercase and lowercase characters are always distinguished. Also, the first argument string is not interpreted as a regular expression pattern.
String starts with a match of a regular expression: str.match()
By using str.match(), you can generate a Series where elements that start with a match of a regular expression pattern are True.
print(df['name'].str.match('.*i'))
# 0 True
# 1 False
# 2 True
# Name: name, dtype: bool
print(df[df['name'].str.match('.*i')])
# name age state point
# 0 Alice 24 NY 64
# 2 Charlie 18 CA 70
As mentioned above, str.match() corresponds to re.match(), checking whether the beginning of the string matches the pattern. If you want to check if any part of the string matches the pattern, use str.contains(), which corresponds to re.search() by default (regex=True).
print(df['name'].str.match('i.*e'))
# 0 False
# 1 False
# 2 False
# Name: name, dtype: bool
print(df['name'].str.contains('i.*e'))
# 0 True
# 1 False
# 2 True
# Name: name, dtype: bool
For more details on re.match() and re.search(), refer to the following article.
Similar to str.contains(), str.match() also has na, case, and flags arguments.