pandas: Query DataFrame with query()

Modified: 2024-01-25 | Tags: Python, pandas

In pandas, the query() method allows you to extract DataFrame rows by specifying conditions through a query string, using comparison operators, string methods, logical combinations, and more.

Contents

Comparison operators in query()
The in and == operators in query() (equivalent to isin())
String methods in query()
- Handle missing values (NaN or None)
Conditions for index in query()
Multiple conditions in query()
Column names with spaces or dots in query()
Update the original object with the inplace argument
Parsers and engines in query()
- Parser
- Engine

For details on extracting rows and columns using Boolean indexing, see the following article.

pandas: Select rows by multiple conditions

It is also possible to extract columns by data types, or extract rows or columns by their names (labels).

The pandas version used in this article is as follows. Note that functionality may vary between versions. The following DataFrame is used as an example.

sample_pandas_normal.csv

import pandas as pd

print(pd.__version__)
# 2.1.4

df = pd.read_csv('data/src/sample_pandas_normal.csv')
print(df)
#       name  age state  point
# 0    Alice   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57

source: pandas_query.py

Comparison operators in `query()`

In pandas, rows can be extracted using comparison operators as follows.

print(df[df['age'] < 25])
#       name  age state  point
# 0    Alice   24    NY     64
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88

source: pandas_query.py

The same conditions can be specified as a string with the query() method.

print(df.query('age < 25'))
#       name  age state  point
# 0    Alice   24    NY     64
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88

source: pandas_query.py

To use variables in a query string, prefix the variable name with @.

val = 25
print(df.query('age < @val'))
#       name  age state  point
# 0    Alice   24    NY     64
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88

source: pandas_query.py

It is also possible to generate a string embedded with variables using f-strings.

How to use f-strings in Python

print(f'age < {val}')
# age < 25

print(df.query(f'age < {val}'))
#       name  age state  point
# 0    Alice   24    NY     64
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88

source: pandas_query.py

You can specify a range using two comparison operators.

Chained comparison (a < x < b) in Python

print(df.query('30 <= age < 50'))
#     name  age state  point
# 1    Bob   42    CA     92
# 5  Frank   30    NY     57

source: pandas_query.py

Comparisons can be made between columns or using arithmetic operators for calculations.

print(df.query('age < point / 3'))
#       name  age state  point
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88

source: pandas_query.py

For equality or inequality, use == or !=.

Note that strings within a query string must be enclosed in quotes. Double quotes " can be used within single-quoted strings '...' and vice versa. The same symbol can be used by escaping with a backslash \.

Create a string in Python (single/double/triple quotes, str())

print(df.query('state == "CA"'))
#       name  age state  point
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88

source: pandas_query.py

When using variables with @, quotes are not needed, but f-strings are still strings, so quotes are necessary.

s = 'CA'
print(df.query('state != @s'))
#     name  age state  point
# 0  Alice   24    NY     64
# 3   Dave   68    TX     70
# 5  Frank   30    NY     57

# print(df.query(f'state != {s}'))
# UndefinedVariableError: name 'CA' is not defined

print(df.query(f'state != "{s}"'))
#     name  age state  point
# 0  Alice   24    NY     64
# 3   Dave   68    TX     70
# 5  Frank   30    NY     57

source: pandas_query.py

For partial matching conditions on strings, use the string methods mentioned below.

For more on the syntax available in a query string, refer to the official documentation.

Enhancing performance - Supported syntax — pandas 2.1.4 documentation

The `in` and `==` operators in `query()` (equivalent to `isin()`)

The isin() method in a Series determines whether each element is present in a specified list, returning a Boolean Series as a result. This method can be used for extracting rows where column values exactly match those specified in the list.

pandas.Series.isin — pandas 2.1.4 documentation

print(df[df['state'].isin(['NY', 'TX'])])
#     name  age state  point
# 0  Alice   24    NY     64
# 3   Dave   68    TX     70
# 5  Frank   30    NY     57

source: pandas_query.py

With the query() method, the same result can be achieved using the in operator.

print(df.query('state in ["NY", "TX"]'))
#     name  age state  point
# 0  Alice   24    NY     64
# 3   Dave   68    TX     70
# 5  Frank   30    NY     57

source: pandas_query.py

An equivalent outcome can also be obtained using == with a list in the query string.

print(df.query('state == ["NY", "TX"]'))
#     name  age state  point
# 0  Alice   24    NY     64
# 3   Dave   68    TX     70
# 5  Frank   30    NY     57

source: pandas_query.py

You can also use list variables with @ or f-strings.

l = ['NY', 'TX']
print(df.query('state in @l'))
#     name  age state  point
# 0  Alice   24    NY     64
# 3   Dave   68    TX     70
# 5  Frank   30    NY     57

print(df.query(f'state in {l}'))
#     name  age state  point
# 0  Alice   24    NY     64
# 3   Dave   68    TX     70
# 5  Frank   30    NY     57

source: pandas_query.py

String methods in `query()`

You can specify conditions for exact string matches using == or in, but for partial matches, string methods (str.xxx()) can be used.

pandas: Extract rows that contain specific strings from a DataFrame

Some of these methods include:

str.contains(): Checks if a string contains a specific substring
str.endswith(): Checks if a string ends with a specific substring
str.startswith(): Checks if a string starts with a specific substring
str.match(): Checks if a string matches a regular expression (regex) pattern

These string methods can be used within the query() method. However, note that using these methods in query() may not always simplify the code compared to Boolean indexing.

print(df[df['name'].str.endswith('e')])
#       name  age state  point
# 0    Alice   24    NY     64
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70

print(df.query('name.str.endswith("e")'))
#       name  age state  point
# 0    Alice   24    NY     64
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70

source: pandas_query.py

Non-string type columns can be converted to strings with astype() to use string methods. This conversion can also be specified in query().

pandas: How to use astype() to cast dtype of DataFrame

print(df.query('age.astype("str").str.endswith("8")'))
#       name  age state  point
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70

source: pandas_query.py

Handle missing values (`NaN` or `None`)

Using string methods as conditions on columns with missing values (NaN or None) results in an error.

Missing values in pandas (nan, None, pd.NA)

df_nan = df.copy()
df_nan.at[0, 'name'] = float('NaN')
print(df_nan)
#       name  age state  point
# 0      NaN   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57

# print(df_nan.query('name.str.endswith("e")'))
# ValueError: unknown type object

source: pandas_query.py

Most string methods offer the na argument to specify a replacement value for missing values. Setting na to True includes rows with missing values in the extraction, whereas setting it to False excludes them.

print(df_nan[df_nan['name'].str.endswith('e', na=True)])
#       name  age state  point
# 0      NaN   24    NY     64
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70

print(df_nan[df_nan['name'].str.endswith('e', na=False)])
#       name  age state  point
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70

source: pandas_query.py

You can specify the na argument for string methods within query().

print(df_nan.query('name.str.endswith("e", na=False)'))
#       name  age state  point
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70

source: pandas_query.py

Conditions for `index` in `query()`

In the query string, conditions for the index (row names) can be specified using the keyword index.

print(df.query('index % 2 == 0'))
#       name  age state  point
# 0    Alice   24    NY     64
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88

source: pandas_query.py

If the index has a name, you can specify conditions using either that name or the keyword index.

pandas: Assign existing column to the DataFrame index with set_index()

Sorry for the confusion, but in the following example, the index is named name.

df_name = df.set_index('name')
print(df_name)
#          age state  point
# name                     
# Alice     24    NY     64
# Bob       42    CA     92
# Charlie   18    CA     70
# Dave      68    TX     70
# Ellen     24    CA     88
# Frank     30    NY     57

print(df_name.index.name)
# name

print(df_name.query('name.str.endswith("e")'))
#          age state  point
# name                     
# Alice     24    NY     64
# Charlie   18    CA     70
# Dave      68    TX     70

print(df_name.query('index.str.endswith("e")'))
#          age state  point
# name                     
# Alice     24    NY     64
# Charlie   18    CA     70
# Dave      68    TX     70

source: pandas_query.py

To extract columns based on the columns (column names), refer to the following article.

pandas: Filter rows/columns by labels with filter()

Multiple conditions in `query()`

You can specify multiple conditions using Boolean indexing as follows.

pandas: Select rows by multiple conditions

print(df[(df['age'] < 25) & (df['point'] > 65)])
#       name  age state  point
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88

source: pandas_query.py

Using the query() method, you can express conditions as follows. Parentheses for each condition are unnecessary, and the logical AND can be represented as either & or and.

print(df.query('age < 25 & point > 65'))
#       name  age state  point
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88

print(df.query('age < 25 and point > 65'))
#       name  age state  point
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88

source: pandas_query.py

OR can be represented as either | or or.

print(df.query('age < 20 | point > 80'))
#       name  age state  point
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88

print(df.query('age < 20 or point > 80'))
#       name  age state  point
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88

source: pandas_query.py

NOT is represented as not.

print(df.query('not age < 25 and not point > 65'))
#     name  age state  point
# 5  Frank   30    NY     57

source: pandas_query.py

In the case of three or more conditions, it is safer to explicitly enclose the part you want to process first in parentheses, because the result may vary depending on the order. For example, & has higher precedence than |.

print(df.query('age == 24 | point > 80 & state == "CA"'))
#     name  age state  point
# 0  Alice   24    NY     64
# 1    Bob   42    CA     92
# 4  Ellen   24    CA     88

print(df.query('(age == 24 | point > 80) & state == "CA"'))
#     name  age state  point
# 1    Bob   42    CA     92
# 4  Ellen   24    CA     88

source: pandas_query.py

Column names with spaces or dots in `query()`

When using the query() method, pay attention to column names.

For example, change the column names as follows.

pandas: Rename column/index names of DataFrame

df_rename = df.set_axis(['0name', 'age.year', 'state name', 3], axis=1)
print(df_rename)
#      0name  age.year state name   3
# 0    Alice        24         NY  64
# 1      Bob        42         CA  92
# 2  Charlie        18         CA  70
# 3     Dave        68         TX  70
# 4    Ellen        24         CA  88
# 5    Frank        30         NY  57

source: pandas_query.py

An error will occur if the column names are not valid Python variable names, such as those starting with numbers, or containing . or spaces.

Valid variable names and naming rules in Python

# print(df_rename.query('0name.str.endswith("e")'))
# SyntaxError: invalid syntax

# print(df_rename.query('age.year < 25'))
# UndefinedVariableError: name 'age' is not defined

# print(df_rename.query('state name == "CA"'))
# SyntaxError: invalid syntax

source: pandas_query.py

They need to be enclosed in backticks "`".

print(df_rename.query('`0name`.str.endswith("e")'))
#      0name  age.year state name   3
# 0    Alice        24         NY  64
# 2  Charlie        18         CA  70
# 3     Dave        68         TX  70

print(df_rename.query('`age.year` < 25'))
#      0name  age.year state name   3
# 0    Alice        24         NY  64
# 2  Charlie        18         CA  70
# 4    Ellen        24         CA  88

print(df_rename.query('`state name` == "CA"'))
#      0name  age.year state name   3
# 1      Bob        42         CA  92
# 2  Charlie        18         CA  70
# 4    Ellen        24         CA  88

source: pandas_query.py

However, certain special cases, such as numeric column names, might still pose issues and require alternative approaches, like using Boolean indexing.

# print(df_rename.query('3 > 75'))
# KeyError: False

# print(df_rename.query('`3` > 75'))
# UndefinedVariableError: name 'BACKTICK_QUOTED_STRING_3' is not defined

print(df_rename[df_rename[3] > 75])
#    0name  age.year state name   3
# 1    Bob        42         CA  92
# 4  Ellen        24         CA  88

source: pandas_query.py

Update the original object with the `inplace` argument

By default, query() returns a new DataFrame, and the original object remains unchanged. Setting the inplace argument to True modifies the original object.

df.query('age > 25', inplace=True)
print(df)
#     name  age state  point
# 1    Bob   42    CA     92
# 3   Dave   68    TX     70
# 5  Frank   30    NY     57

source: pandas_query.py

Parsers and engines in `query()`

The query() method internally uses pd.eval().

pandas.eval — pandas 2.1.4 documentation

Parser

In pd.eval(), you can select 'pandas' or 'python' as the parser. You can also specify this with the parser argument in query(). By default, it is set to 'pandas', which offers a more intuitive usage, such as not requiring each condition expression to be enclosed in parentheses when connected with & or and.

Enhancing performance - pandas.eval() parsers — pandas 2.1.4 documentation

Engine

pd.eval() can use numexpr as an engine to evaluate expressions. For large-scale data with over 10,000 rows, using numexpr can accelerate operations.

Enhancing performance - Expression evaluation via eval() — pandas 2.1.4 documentation

You can install numexpr using pip (or pip3, depending on your environment).

$ pip install numexpr

In query(), you can select 'python' or 'numexpr' with the engine argument. The default is None, which uses 'numexpr' if available, otherwise 'python'.

Be aware that the available features may differ depending on the version. For example, in earlier versions (numexpr 2.6.5, pandas 0.23.0), string methods were unavailable with numexpr, requiring engine='python'.

pandas: Query DataFrame with query()

Comparison operators in `query()`

The `in` and `==` operators in `query()` (equivalent to `isin()`)

String methods in `query()`

Handle missing values (`NaN` or `None`)

Conditions for `index` in `query()`

Multiple conditions in `query()`

Column names with spaces or dots in `query()`

Update the original object with the `inplace` argument

Parsers and engines in `query()`

Parser

Engine

Related Categories

Related Articles

pandas: Query DataFrame with query()

Comparison operators in query()

The in and == operators in query() (equivalent to isin())

String methods in query()

Handle missing values (NaN or None)

Conditions for index in query()

Multiple conditions in query()

Column names with spaces or dots in query()

Update the original object with the inplace argument

Parsers and engines in query()

Parser

Engine

Related Categories

Related Articles

Comparison operators in `query()`

The `in` and `==` operators in `query()` (equivalent to `isin()`)

String methods in `query()`

Handle missing values (`NaN` or `None`)

Conditions for `index` in `query()`

Multiple conditions in `query()`

Column names with spaces or dots in `query()`

Update the original object with the `inplace` argument

Parsers and engines in `query()`