pandas: Query DataFrame with query()
In pandas, the query()
method allows you to extract DataFrame
rows by specifying conditions through a query string, using comparison operators, string methods, logical combinations, and more.
- pandas.DataFrame.query — pandas 2.1.4 documentation
- Indexing and selecting data - The query() Method — pandas 2.1.4 documentation
- Comparison operators in
query()
- The
in
and==
operators inquery()
(equivalent toisin()
) - String methods in
query()
- Conditions for
index
inquery()
- Multiple conditions in
query()
- Column names with spaces or dots in
query()
- Update the original object with the
inplace
argument - Parsers and engines in
query()
For details on extracting rows and columns using Boolean indexing, see the following article.
It is also possible to extract columns by data types, or extract rows or columns by their names (labels).
- pandas: Select columns by dtype with select_dtypes()
- pandas: Filter rows/columns by labels with filter()
The pandas version used in this article is as follows. Note that functionality may vary between versions. The following DataFrame
is used as an example.
import pandas as pd
print(pd.__version__)
# 2.1.4
df = pd.read_csv('data/src/sample_pandas_normal.csv')
print(df)
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
# 4 Ellen 24 CA 88
# 5 Frank 30 NY 57
Comparison operators in query()
In pandas, rows can be extracted using comparison operators as follows.
print(df[df['age'] < 25])
# name age state point
# 0 Alice 24 NY 64
# 2 Charlie 18 CA 70
# 4 Ellen 24 CA 88
The same conditions can be specified as a string with the query()
method.
print(df.query('age < 25'))
# name age state point
# 0 Alice 24 NY 64
# 2 Charlie 18 CA 70
# 4 Ellen 24 CA 88
To use variables in a query string, prefix the variable name with @
.
val = 25
print(df.query('age < @val'))
# name age state point
# 0 Alice 24 NY 64
# 2 Charlie 18 CA 70
# 4 Ellen 24 CA 88
It is also possible to generate a string embedded with variables using f-strings.
print(f'age < {val}')
# age < 25
print(df.query(f'age < {val}'))
# name age state point
# 0 Alice 24 NY 64
# 2 Charlie 18 CA 70
# 4 Ellen 24 CA 88
You can specify a range using two comparison operators.
print(df.query('30 <= age < 50'))
# name age state point
# 1 Bob 42 CA 92
# 5 Frank 30 NY 57
Comparisons can be made between columns or using arithmetic operators for calculations.
print(df.query('age < point / 3'))
# name age state point
# 2 Charlie 18 CA 70
# 4 Ellen 24 CA 88
For equality or inequality, use ==
or !=
.
Note that strings within a query string must be enclosed in quotes. Double quotes "
can be used within single-quoted strings '...'
and vice versa. The same symbol can be used by escaping with a backslash \
.
print(df.query('state == "CA"'))
# name age state point
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 4 Ellen 24 CA 88
When using variables with @
, quotes are not needed, but f-strings are still strings, so quotes are necessary.
s = 'CA'
print(df.query('state != @s'))
# name age state point
# 0 Alice 24 NY 64
# 3 Dave 68 TX 70
# 5 Frank 30 NY 57
# print(df.query(f'state != {s}'))
# UndefinedVariableError: name 'CA' is not defined
print(df.query(f'state != "{s}"'))
# name age state point
# 0 Alice 24 NY 64
# 3 Dave 68 TX 70
# 5 Frank 30 NY 57
For partial matching conditions on strings, use the string methods mentioned below.
For more on the syntax available in a query string, refer to the official documentation.
The in
and ==
operators in query()
(equivalent to isin()
)
The isin()
method in a Series
determines whether each element is present in a specified list, returning a Boolean Series
as a result. This method can be used for extracting rows where column values exactly match those specified in the list.
print(df[df['state'].isin(['NY', 'TX'])])
# name age state point
# 0 Alice 24 NY 64
# 3 Dave 68 TX 70
# 5 Frank 30 NY 57
With the query()
method, the same result can be achieved using the in
operator.
print(df.query('state in ["NY", "TX"]'))
# name age state point
# 0 Alice 24 NY 64
# 3 Dave 68 TX 70
# 5 Frank 30 NY 57
An equivalent outcome can also be obtained using ==
with a list in the query string.
print(df.query('state == ["NY", "TX"]'))
# name age state point
# 0 Alice 24 NY 64
# 3 Dave 68 TX 70
# 5 Frank 30 NY 57
You can also use list variables with @
or f-strings.
l = ['NY', 'TX']
print(df.query('state in @l'))
# name age state point
# 0 Alice 24 NY 64
# 3 Dave 68 TX 70
# 5 Frank 30 NY 57
print(df.query(f'state in {l}'))
# name age state point
# 0 Alice 24 NY 64
# 3 Dave 68 TX 70
# 5 Frank 30 NY 57
String methods in query()
You can specify conditions for exact string matches using ==
or in
, but for partial matches, string methods (str.xxx()
) can be used.
Some of these methods include:
str.contains()
: Checks if a string contains a specific substringstr.endswith()
: Checks if a string ends with a specific substringstr.startswith()
: Checks if a string starts with a specific substringstr.match()
: Checks if a string matches a regular expression (regex) pattern
These string methods can be used within the query()
method. However, note that using these methods in query()
may not always simplify the code compared to Boolean indexing.
print(df[df['name'].str.endswith('e')])
# name age state point
# 0 Alice 24 NY 64
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
print(df.query('name.str.endswith("e")'))
# name age state point
# 0 Alice 24 NY 64
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
Non-string type columns can be converted to strings with astype()
to use string methods. This conversion can also be specified in query()
.
print(df.query('age.astype("str").str.endswith("8")'))
# name age state point
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
Handle missing values (NaN
or None
)
Using string methods as conditions on columns with missing values (NaN
or None
) results in an error.
df_nan = df.copy()
df_nan.at[0, 'name'] = float('NaN')
print(df_nan)
# name age state point
# 0 NaN 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
# 4 Ellen 24 CA 88
# 5 Frank 30 NY 57
# print(df_nan.query('name.str.endswith("e")'))
# ValueError: unknown type object
Most string methods offer the na
argument to specify a replacement value for missing values. Setting na
to True
includes rows with missing values in the extraction, whereas setting it to False
excludes them.
print(df_nan[df_nan['name'].str.endswith('e', na=True)])
# name age state point
# 0 NaN 24 NY 64
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
print(df_nan[df_nan['name'].str.endswith('e', na=False)])
# name age state point
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
You can specify the na
argument for string methods within query()
.
print(df_nan.query('name.str.endswith("e", na=False)'))
# name age state point
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
Conditions for index
in query()
In the query string, conditions for the index
(row names) can be specified using the keyword index
.
print(df.query('index % 2 == 0'))
# name age state point
# 0 Alice 24 NY 64
# 2 Charlie 18 CA 70
# 4 Ellen 24 CA 88
If the index
has a name, you can specify conditions using either that name or the keyword index
.
Sorry for the confusion, but in the following example, the index
is named name
.
df_name = df.set_index('name')
print(df_name)
# age state point
# name
# Alice 24 NY 64
# Bob 42 CA 92
# Charlie 18 CA 70
# Dave 68 TX 70
# Ellen 24 CA 88
# Frank 30 NY 57
print(df_name.index.name)
# name
print(df_name.query('name.str.endswith("e")'))
# age state point
# name
# Alice 24 NY 64
# Charlie 18 CA 70
# Dave 68 TX 70
print(df_name.query('index.str.endswith("e")'))
# age state point
# name
# Alice 24 NY 64
# Charlie 18 CA 70
# Dave 68 TX 70
To extract columns based on the columns
(column names), refer to the following article.
Multiple conditions in query()
You can specify multiple conditions using Boolean indexing as follows.
print(df[(df['age'] < 25) & (df['point'] > 65)])
# name age state point
# 2 Charlie 18 CA 70
# 4 Ellen 24 CA 88
Using the query()
method, you can express conditions as follows. Parentheses for each condition are unnecessary, and the logical AND can be represented as either &
or and
.
print(df.query('age < 25 & point > 65'))
# name age state point
# 2 Charlie 18 CA 70
# 4 Ellen 24 CA 88
print(df.query('age < 25 and point > 65'))
# name age state point
# 2 Charlie 18 CA 70
# 4 Ellen 24 CA 88
OR can be represented as either |
or or
.
print(df.query('age < 20 | point > 80'))
# name age state point
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 4 Ellen 24 CA 88
print(df.query('age < 20 or point > 80'))
# name age state point
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 4 Ellen 24 CA 88
NOT is represented as not
.
print(df.query('not age < 25 and not point > 65'))
# name age state point
# 5 Frank 30 NY 57
In the case of three or more conditions, it is safer to explicitly enclose the part you want to process first in parentheses, because the result may vary depending on the order. For example, &
has higher precedence than |
.
print(df.query('age == 24 | point > 80 & state == "CA"'))
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 4 Ellen 24 CA 88
print(df.query('(age == 24 | point > 80) & state == "CA"'))
# name age state point
# 1 Bob 42 CA 92
# 4 Ellen 24 CA 88
Column names with spaces or dots in query()
When using the query()
method, pay attention to column names.
For example, change the column names as follows.
df_rename = df.set_axis(['0name', 'age.year', 'state name', 3], axis=1)
print(df_rename)
# 0name age.year state name 3
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
# 4 Ellen 24 CA 88
# 5 Frank 30 NY 57
An error will occur if the column names are not valid Python variable names, such as those starting with numbers, or containing .
or spaces.
# print(df_rename.query('0name.str.endswith("e")'))
# SyntaxError: invalid syntax
# print(df_rename.query('age.year < 25'))
# UndefinedVariableError: name 'age' is not defined
# print(df_rename.query('state name == "CA"'))
# SyntaxError: invalid syntax
They need to be enclosed in backticks "`".
print(df_rename.query('`0name`.str.endswith("e")'))
# 0name age.year state name 3
# 0 Alice 24 NY 64
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
print(df_rename.query('`age.year` < 25'))
# 0name age.year state name 3
# 0 Alice 24 NY 64
# 2 Charlie 18 CA 70
# 4 Ellen 24 CA 88
print(df_rename.query('`state name` == "CA"'))
# 0name age.year state name 3
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 4 Ellen 24 CA 88
However, certain special cases, such as numeric column names, might still pose issues and require alternative approaches, like using Boolean indexing.
# print(df_rename.query('3 > 75'))
# KeyError: False
# print(df_rename.query('`3` > 75'))
# UndefinedVariableError: name 'BACKTICK_QUOTED_STRING_3' is not defined
print(df_rename[df_rename[3] > 75])
# 0name age.year state name 3
# 1 Bob 42 CA 92
# 4 Ellen 24 CA 88
Update the original object with the inplace
argument
By default, query()
returns a new DataFrame
, and the original object remains unchanged. Setting the inplace
argument to True
modifies the original object.
df.query('age > 25', inplace=True)
print(df)
# name age state point
# 1 Bob 42 CA 92
# 3 Dave 68 TX 70
# 5 Frank 30 NY 57
Parsers and engines in query()
The query()
method internally uses pd.eval()
.
Parser
In pd.eval()
, you can select 'pandas'
or 'python'
as the parser. You can also specify this with the parser
argument in query()
. By default, it is set to 'pandas'
, which offers a more intuitive usage, such as not requiring each condition expression to be enclosed in parentheses when connected with &
or and
.
Engine
pd.eval()
can use numexpr as an engine to evaluate expressions. For large-scale data with over 10,000 rows, using numexpr can accelerate operations.
You can install numexpr using pip
(or pip3
, depending on your environment).
$ pip install numexpr
In query()
, you can select 'python'
or 'numexpr'
with the engine
argument. The default is None
, which uses 'numexpr'
if available, otherwise 'python'
.
Be aware that the available features may differ depending on the version. For example, in earlier versions (numexpr 2.6.5, pandas 0.23.0), string methods were unavailable with numexpr, requiring engine='python'
.