pandas: Query DataFrame with query()

Modified: | Tags: Python, pandas

In pandas, the query() method allows you to extract DataFrame rows by specifying conditions through a query string, using comparison operators, string methods, logical combinations, and more.

For details on extracting rows and columns using Boolean indexing, see the following article.

It is also possible to extract columns by data types, or extract rows or columns by their names (labels).

The pandas version used in this article is as follows. Note that functionality may vary between versions. The following DataFrame is used as an example.

import pandas as pd

print(pd.__version__)
# 2.1.4

df = pd.read_csv('data/src/sample_pandas_normal.csv')
print(df)
#       name  age state  point
# 0    Alice   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57

Comparison operators in query()

In pandas, rows can be extracted using comparison operators as follows.

print(df[df['age'] < 25])
#       name  age state  point
# 0    Alice   24    NY     64
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88

The same conditions can be specified as a string with the query() method.

print(df.query('age < 25'))
#       name  age state  point
# 0    Alice   24    NY     64
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88

To use variables in a query string, prefix the variable name with @.

val = 25
print(df.query('age < @val'))
#       name  age state  point
# 0    Alice   24    NY     64
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88

It is also possible to generate a string embedded with variables using f-strings.

print(f'age < {val}')
# age < 25

print(df.query(f'age < {val}'))
#       name  age state  point
# 0    Alice   24    NY     64
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88

You can specify a range using two comparison operators.

print(df.query('30 <= age < 50'))
#     name  age state  point
# 1    Bob   42    CA     92
# 5  Frank   30    NY     57

Comparisons can be made between columns or using arithmetic operators for calculations.

print(df.query('age < point / 3'))
#       name  age state  point
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88

For equality or inequality, use == or !=.

Note that strings within a query string must be enclosed in quotes. Double quotes " can be used within single-quoted strings '...' and vice versa. The same symbol can be used by escaping with a backslash \.

print(df.query('state == "CA"'))
#       name  age state  point
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88

When using variables with @, quotes are not needed, but f-strings are still strings, so quotes are necessary.

s = 'CA'
print(df.query('state != @s'))
#     name  age state  point
# 0  Alice   24    NY     64
# 3   Dave   68    TX     70
# 5  Frank   30    NY     57

# print(df.query(f'state != {s}'))
# UndefinedVariableError: name 'CA' is not defined

print(df.query(f'state != "{s}"'))
#     name  age state  point
# 0  Alice   24    NY     64
# 3   Dave   68    TX     70
# 5  Frank   30    NY     57

For partial matching conditions on strings, use the string methods mentioned below.

For more on the syntax available in a query string, refer to the official documentation.

The in and == operators in query() (equivalent to isin())

The isin() method in a Series determines whether each element is present in a specified list, returning a Boolean Series as a result. This method can be used for extracting rows where column values exactly match those specified in the list.

print(df[df['state'].isin(['NY', 'TX'])])
#     name  age state  point
# 0  Alice   24    NY     64
# 3   Dave   68    TX     70
# 5  Frank   30    NY     57

With the query() method, the same result can be achieved using the in operator.

print(df.query('state in ["NY", "TX"]'))
#     name  age state  point
# 0  Alice   24    NY     64
# 3   Dave   68    TX     70
# 5  Frank   30    NY     57

An equivalent outcome can also be obtained using == with a list in the query string.

print(df.query('state == ["NY", "TX"]'))
#     name  age state  point
# 0  Alice   24    NY     64
# 3   Dave   68    TX     70
# 5  Frank   30    NY     57

You can also use list variables with @ or f-strings.

l = ['NY', 'TX']
print(df.query('state in @l'))
#     name  age state  point
# 0  Alice   24    NY     64
# 3   Dave   68    TX     70
# 5  Frank   30    NY     57

print(df.query(f'state in {l}'))
#     name  age state  point
# 0  Alice   24    NY     64
# 3   Dave   68    TX     70
# 5  Frank   30    NY     57

String methods in query()

You can specify conditions for exact string matches using == or in, but for partial matches, string methods (str.xxx()) can be used.

Some of these methods include:

  • str.contains(): Checks if a string contains a specific substring
  • str.endswith(): Checks if a string ends with a specific substring
  • str.startswith(): Checks if a string starts with a specific substring
  • str.match(): Checks if a string matches a regular expression (regex) pattern

These string methods can be used within the query() method. However, note that using these methods in query() may not always simplify the code compared to Boolean indexing.

print(df[df['name'].str.endswith('e')])
#       name  age state  point
# 0    Alice   24    NY     64
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70

print(df.query('name.str.endswith("e")'))
#       name  age state  point
# 0    Alice   24    NY     64
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70

Non-string type columns can be converted to strings with astype() to use string methods. This conversion can also be specified in query().

print(df.query('age.astype("str").str.endswith("8")'))
#       name  age state  point
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70

Handle missing values (NaN or None)

Using string methods as conditions on columns with missing values (NaN or None) results in an error.

df_nan = df.copy()
df_nan.at[0, 'name'] = float('NaN')
print(df_nan)
#       name  age state  point
# 0      NaN   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57

# print(df_nan.query('name.str.endswith("e")'))
# ValueError: unknown type object

Most string methods offer the na argument to specify a replacement value for missing values. Setting na to True includes rows with missing values in the extraction, whereas setting it to False excludes them.

print(df_nan[df_nan['name'].str.endswith('e', na=True)])
#       name  age state  point
# 0      NaN   24    NY     64
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70

print(df_nan[df_nan['name'].str.endswith('e', na=False)])
#       name  age state  point
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70

You can specify the na argument for string methods within query().

print(df_nan.query('name.str.endswith("e", na=False)'))
#       name  age state  point
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70

Conditions for index in query()

In the query string, conditions for the index (row names) can be specified using the keyword index.

print(df.query('index % 2 == 0'))
#       name  age state  point
# 0    Alice   24    NY     64
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88

If the index has a name, you can specify conditions using either that name or the keyword index.

Sorry for the confusion, but in the following example, the index is named name.

df_name = df.set_index('name')
print(df_name)
#          age state  point
# name                     
# Alice     24    NY     64
# Bob       42    CA     92
# Charlie   18    CA     70
# Dave      68    TX     70
# Ellen     24    CA     88
# Frank     30    NY     57

print(df_name.index.name)
# name

print(df_name.query('name.str.endswith("e")'))
#          age state  point
# name                     
# Alice     24    NY     64
# Charlie   18    CA     70
# Dave      68    TX     70

print(df_name.query('index.str.endswith("e")'))
#          age state  point
# name                     
# Alice     24    NY     64
# Charlie   18    CA     70
# Dave      68    TX     70

To extract columns based on the columns (column names), refer to the following article.

Multiple conditions in query()

You can specify multiple conditions using Boolean indexing as follows.

print(df[(df['age'] < 25) & (df['point'] > 65)])
#       name  age state  point
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88

Using the query() method, you can express conditions as follows. Parentheses for each condition are unnecessary, and the logical AND can be represented as either & or and.

print(df.query('age < 25 & point > 65'))
#       name  age state  point
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88

print(df.query('age < 25 and point > 65'))
#       name  age state  point
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88

OR can be represented as either | or or.

print(df.query('age < 20 | point > 80'))
#       name  age state  point
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88

print(df.query('age < 20 or point > 80'))
#       name  age state  point
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88

NOT is represented as not.

print(df.query('not age < 25 and not point > 65'))
#     name  age state  point
# 5  Frank   30    NY     57

In the case of three or more conditions, it is safer to explicitly enclose the part you want to process first in parentheses, because the result may vary depending on the order. For example, & has higher precedence than |.

print(df.query('age == 24 | point > 80 & state == "CA"'))
#     name  age state  point
# 0  Alice   24    NY     64
# 1    Bob   42    CA     92
# 4  Ellen   24    CA     88

print(df.query('(age == 24 | point > 80) & state == "CA"'))
#     name  age state  point
# 1    Bob   42    CA     92
# 4  Ellen   24    CA     88

Column names with spaces or dots in query()

When using the query() method, pay attention to column names.

For example, change the column names as follows.

df_rename = df.set_axis(['0name', 'age.year', 'state name', 3], axis=1)
print(df_rename)
#      0name  age.year state name   3
# 0    Alice        24         NY  64
# 1      Bob        42         CA  92
# 2  Charlie        18         CA  70
# 3     Dave        68         TX  70
# 4    Ellen        24         CA  88
# 5    Frank        30         NY  57

An error will occur if the column names are not valid Python variable names, such as those starting with numbers, or containing . or spaces.

# print(df_rename.query('0name.str.endswith("e")'))
# SyntaxError: invalid syntax

# print(df_rename.query('age.year < 25'))
# UndefinedVariableError: name 'age' is not defined

# print(df_rename.query('state name == "CA"'))
# SyntaxError: invalid syntax

They need to be enclosed in backticks "`".

print(df_rename.query('`0name`.str.endswith("e")'))
#      0name  age.year state name   3
# 0    Alice        24         NY  64
# 2  Charlie        18         CA  70
# 3     Dave        68         TX  70

print(df_rename.query('`age.year` < 25'))
#      0name  age.year state name   3
# 0    Alice        24         NY  64
# 2  Charlie        18         CA  70
# 4    Ellen        24         CA  88

print(df_rename.query('`state name` == "CA"'))
#      0name  age.year state name   3
# 1      Bob        42         CA  92
# 2  Charlie        18         CA  70
# 4    Ellen        24         CA  88

However, certain special cases, such as numeric column names, might still pose issues and require alternative approaches, like using Boolean indexing.

# print(df_rename.query('3 > 75'))
# KeyError: False

# print(df_rename.query('`3` > 75'))
# UndefinedVariableError: name 'BACKTICK_QUOTED_STRING_3' is not defined

print(df_rename[df_rename[3] > 75])
#    0name  age.year state name   3
# 1    Bob        42         CA  92
# 4  Ellen        24         CA  88

Update the original object with the inplace argument

By default, query() returns a new DataFrame, and the original object remains unchanged. Setting the inplace argument to True modifies the original object.

df.query('age > 25', inplace=True)
print(df)
#     name  age state  point
# 1    Bob   42    CA     92
# 3   Dave   68    TX     70
# 5  Frank   30    NY     57

Parsers and engines in query()

The query() method internally uses pd.eval().

Parser

In pd.eval(), you can select 'pandas' or 'python' as the parser. You can also specify this with the parser argument in query(). By default, it is set to 'pandas', which offers a more intuitive usage, such as not requiring each condition expression to be enclosed in parentheses when connected with & or and.

Engine

pd.eval() can use numexpr as an engine to evaluate expressions. For large-scale data with over 10,000 rows, using numexpr can accelerate operations.

You can install numexpr using pip (or pip3, depending on your environment).

$ pip install numexpr

In query(), you can select 'python' or 'numexpr' with the engine argument. The default is None, which uses 'numexpr' if available, otherwise 'python'.

Be aware that the available features may differ depending on the version. For example, in earlier versions (numexpr 2.6.5, pandas 0.23.0), string methods were unavailable with numexpr, requiring engine='python'.

Related Categories

Related Articles