pandas: Extract rows that contain specific strings from a DataFrame
This article explains how to extract rows that contain specific strings from a pandas.DataFrame
, accounting for exact, partial, forward, and backward matches.
- How to extract rows that meet the condition (Boolean indexing)
- Exact match with specific string:
==
,isin()
- Partial match (Contains specific string):
str.contains()
- Forward match (Starts with a specific string):
str.startswith()
- Backward match (Ends with a specific string):
str.endswith()
- String starts with a match of a regular expression:
str.match()
This article uses boolean indexing, but it's also possible to use the query()
method.
Use the filter()
method to extract rows/columns where the row/column names contain specific strings. See the following article for details.
The sample code in this article uses pandas version 2.0.3
. The following pandas.DataFrame
is used as an example.
import pandas as pd
print(pd.__version__)
# 2.0.3
df = pd.read_csv('data/src/sample_pandas_normal.csv').head(3)
print(df)
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
How to extract rows that meet the condition (Boolean indexing)
When a list or pandas.Series
of boolean elements (True
or False
) is specified with []
on a pandas.DataFrame
, rows corresponding to True
are extracted. This process is known as boolean indexing.
mask = [True, False, True]
print(df[mask])
# name age state point
# 0 Alice 24 NY 64
# 2 Charlie 18 CA 70
Therefore, if you can generate a boolean Series
that matches the desired conditions, you can use it to filter the rows accordingly.
To filter rows based on multiple conditions, you can use operators such as &
(AND), |
(OR), and ~
(NOT). For more detailed information, refer to the following article.
Exact match with specific string: ==
, isin()
By using ==
, you can generate a Series
where elements that exactly match a given string are True
.
print(df['state'] == 'CA')
# 0 False
# 1 True
# 2 True
# Name: state, dtype: bool
print(df[df['state'] == 'CA'])
# name age state point
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
The isin()
method of Series
returns True
for elements that exactly match any element in the specified list.
print(df['state'].isin(['NY', 'CA']))
# 0 True
# 1 True
# 2 True
# Name: state, dtype: bool
print(df[df['state'].isin(['NY', 'CA'])])
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
Partial match (Contains specific string): str.contains()
By using str.contains()
, you can generate a Series
where elements that contain a given substring are True
.
print(df['name'].str.contains('li'))
# 0 True
# 1 False
# 2 True
# Name: name, dtype: bool
print(df[df['name'].str.contains('li')])
# name age state point
# 0 Alice 24 NY 64
# 2 Charlie 18 CA 70
Note that the first argument string is treated as a regular expression pattern by default.
Handle missing values NaN
: the na
argument
If an element is a missing value NaN
, str.contains()
returns NaN
by default. Because of this, trying to extract rows using this Series
will result in an error.
df_nan = df.copy()
df_nan.iloc[2, 0] = float('nan')
print(df_nan)
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 NaN 18 CA 70
print(df_nan['name'].str.contains('li'))
# 0 True
# 1 False
# 2 NaN
# Name: name, dtype: object
# print(df_nan[df_nan['name'].str.contains('li')])
# ValueError: Cannot mask with non-boolean array containing NA / NaN values
You can specify a replacement value for NaN
using the na
argument.
print(df_nan['name'].str.contains('li', na=False))
# 0 True
# 1 False
# 2 False
# Name: name, dtype: bool
print(df_nan['name'].str.contains('li', na=True))
# 0 True
# 1 False
# 2 True
# Name: name, dtype: bool
When using the result as a condition, setting na=True
will include rows with NaN
, while setting na=False
will exclude them.
Case sensitivity: the case
argument
By default, str.contains()
distinguishes between uppercase and lowercase characters. To ignore case distinction, set the case
argument to False
.
print(df['name'].str.contains('LI'))
# 0 False
# 1 False
# 2 False
# Name: name, dtype: bool
print(df['name'].str.contains('LI', case=False))
# 0 True
# 1 False
# 2 True
# Name: name, dtype: bool
Use regular expression patterns: the regex
and flags
arguments
As previously mentioned, str.contains()
treats the first argument as a regular expression pattern by default.
print(df['name'].str.contains('i.*e'))
# 0 True
# 1 False
# 2 True
# Name: name, dtype: bool
You can set the regex
argument to False
to treat the first argument as a literal string instead of as a regular expression pattern.
print(df['name'].str.contains('i.*e', regex=False))
# 0 False
# 1 False
# 2 False
# Name: name, dtype: bool
For example, when you want to check whether a string contains characters such as ?
, .
, *
, etc., which are interpreted as special characters in regular expressions, you need to set regex=False
. Alternatively, you can use a regular expression pattern that escapes special characters, e.g., \?
.
df_q = df.copy()
df_q.iloc[2, 0] += '?'
print(df_q)
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie? 18 CA 70
# print(df_q['name'].str.contains('?'))
# error: nothing to repeat at position 0
print(df_q['name'].str.contains('?', regex=False))
# 0 False
# 1 False
# 2 True
# Name: name, dtype: bool
print(df_q['name'].str.contains(r'\?'))
# 0 False
# 1 False
# 2 True
# Name: name, dtype: bool
You can also specify regular expression flags, like re.IGNORECASE
, using the flags
argument. Also, str.contains()
corresponds to re.search()
, while str.match()
corresponds to re.match()
as described later.
Forward match (Starts with a specific string): str.startswith()
By using str.startswith()
, you can generate a Series
where elements that start with a given string are True
.
print(df['name'].str.startswith('B'))
# 0 False
# 1 True
# 2 False
# Name: name, dtype: bool
print(df[df['name'].str.startswith('B')])
# name age state point
# 1 Bob 42 CA 92
Like str.contains()
, str.startswith()
also has the na
argument. If you want to select rows with missing values NaN
, set na=True
. If you don't want to select them, set na=False
.
Unlike str.contains()
, there is no case
argument in str.startswith()
; therefore, uppercase and lowercase characters are always distinguished. Also, the first argument string is not interpreted as a regular expression pattern.
Backward match (Ends with a specific string): str.endswith()
By using str.endswith()
, you can generate a Series
where elements that end with a given string are True
.
print(df['name'].str.endswith('e'))
# 0 True
# 1 False
# 2 True
# Name: name, dtype: bool
print(df[df['name'].str.endswith('e')])
# name age state point
# 0 Alice 24 NY 64
# 2 Charlie 18 CA 70
str.endswith()
also has the na
argument. If you want to select rows with missing values NaN
, set na=True
. If you don't want to select them, set na=False
.
There is no case
argument, and uppercase and lowercase characters are always distinguished. Also, the first argument string is not interpreted as a regular expression pattern.
String starts with a match of a regular expression: str.match()
By using str.match()
, you can generate a Series
where elements that start with a match of a regular expression pattern are True
.
print(df['name'].str.match('.*i'))
# 0 True
# 1 False
# 2 True
# Name: name, dtype: bool
print(df[df['name'].str.match('.*i')])
# name age state point
# 0 Alice 24 NY 64
# 2 Charlie 18 CA 70
As mentioned above, str.match()
corresponds to re.match()
, checking whether the beginning of the string matches the pattern. If you want to check if any part of the string matches the pattern, use str.contains()
, which corresponds to re.search()
by default (regex=True
).
print(df['name'].str.match('i.*e'))
# 0 False
# 1 False
# 2 False
# Name: name, dtype: bool
print(df['name'].str.contains('i.*e'))
# 0 True
# 1 False
# 2 True
# Name: name, dtype: bool
For more details on re.match()
and re.search()
, refer to the following article.
Similar to str.contains()
, str.match()
also has na
, case
, and flags
arguments.