pandas: Set a column as the DataFrame index with set_index()
The set_index()
method of pandas.DataFrame
allows you to set an existing column as the index
(row labels).
Conversely, to turn the index
into a data column, use reset_index()
.
For methods to rename the index
, refer to the following article.
The pandas version used in this article is as follows. Note that functionality may vary between versions. The following data is used as an example.
import pandas as pd
print(pd.__version__)
# 2.1.4
df = pd.read_csv('data/src/sample_pandas_normal.csv')
print(df)
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
# 4 Ellen 24 CA 88
# 5 Frank 30 NY 57
How to use set_index()
Basic usage
Specify the column label (name) to be used as the index
in the first argument, keys
.
print(df.set_index('name'))
# age state point
# name
# Alice 24 NY 64
# Bob 42 CA 92
# Charlie 18 CA 70
# Dave 68 TX 70
# Ellen 24 CA 88
# Frank 30 NY 57
Keep the specified column: drop
By default, the column specified as the index
is removed from the data columns. Setting drop=False
keeps the specified column in the data columns.
print(df.set_index('name', drop=False))
# name age state point
# name
# Alice Alice 24 NY 64
# Bob Bob 42 CA 92
# Charlie Charlie 18 CA 70
# Dave Dave 68 TX 70
# Ellen Ellen 24 CA 88
# Frank Frank 30 NY 57
Set a MultiIndex
In pandas, you can set a MultiIndex
, which enables hierarchical, multi-level indexing.
Specify multiple columns in a list
By specifying a list of column labels in the first argument (keys
), multiple columns are assigned as a MultiIndex
.
print(df.set_index(['state', 'name']))
# age point
# state name
# NY Alice 24 64
# CA Bob 42 92
# Charlie 18 70
# TX Dave 68 70
# CA Ellen 24 88
# NY Frank 30 57
Sorting with sort_index()
makes it display neatly.
print(df.set_index(['state', 'name']).sort_index())
# age point
# state name
# CA Bob 42 92
# Charlie 18 70
# Ellen 24 88
# NY Alice 24 64
# Frank 30 57
# TX Dave 68 70
Add a column to the index
: append
By default, specifying a new column with set_index()
removes the original index.
df_name = df.set_index('name')
print(df_name)
# age state point
# name
# Alice 24 NY 64
# Bob 42 CA 92
# Charlie 18 CA 70
# Dave 68 TX 70
# Ellen 24 CA 88
# Frank 30 NY 57
print(df_name.set_index('state'))
# age point
# state
# NY 24 64
# CA 42 92
# CA 18 70
# TX 68 70
# CA 24 88
# NY 30 57
Setting append=True
adds the specified column to the index
as a new level.
df_mi = df_name.set_index('state', append=True)
print(df_mi)
# age point
# name state
# Alice NY 24 64
# Bob CA 42 92
# Charlie CA 18 70
# Dave TX 68 70
# Ellen CA 24 88
# Frank NY 30 57
The added column becomes the lowest level in the MultiIndex
. To rearrange these levels, swaplevel()
can be used.
print(df_mi.swaplevel(0, 1))
# age point
# state name
# NY Alice 24 64
# CA Bob 42 92
# Charlie 18 70
# TX Dave 68 70
# CA Ellen 24 88
# NY Frank 30 57
Keep the original index
as a column
When a new index
is set with set_index()
, the original index
is removed.
To keep the original index
as a column, use reset_index()
to renumber the index
starting from 0
.
df_name = df.set_index('name')
print(df_name)
# age state point
# name
# Alice 24 NY 64
# Bob 42 CA 92
# Charlie 18 CA 70
# Dave 68 TX 70
# Ellen 24 CA 88
# Frank 30 NY 57
print(df_name.reset_index())
# name age state point
# 0 Alice 24 NY 64
# 1 Bob 42 CA 92
# 2 Charlie 18 CA 70
# 3 Dave 68 TX 70
# 4 Ellen 24 CA 88
# 5 Frank 30 NY 57
To change the index
to another column, use set_index()
after reset_index()
.
print(df_name.reset_index().set_index('state'))
# name age point
# state
# NY Alice 24 64
# CA Bob 42 92
# CA Charlie 18 70
# TX Dave 68 70
# CA Ellen 24 88
# NY Frank 30 57
Check for duplicates in the new index
: verify_integrity
By default, if there are duplicates in the new index
, they are used as is. Setting verify_integrity=True
results in an error if duplicates exist.
print(df.set_index('state'))
# name age point
# state
# NY Alice 24 64
# CA Bob 42 92
# CA Charlie 18 70
# TX Dave 68 70
# CA Ellen 24 88
# NY Frank 30 57
# print(df.set_index('state', verify_integrity=True))
# ValueError: Index has duplicate keys: Index(['CA', 'NY'], dtype='object', name='state')
Change original object: inplace
By default, set_index()
returns a new object without changing the original. Setting inplace=True
modifies the original object.
df.set_index('name', inplace=True)
print(df)
# age state point
# name
# Alice 24 NY 64
# Bob 42 CA 92
# Charlie 18 CA 70
# Dave 68 TX 70
# Ellen 24 CA 88
# Frank 30 NY 57
Specify the index
when reading CSV files
When reading a CSV file into a DataFrame
with pd.read_csv()
, you can directly set a column as the index
by specifying its column number in the index_col
argument.
df_name = pd.read_csv('data/src/sample_pandas_normal.csv', index_col=0)
print(df_name)
# age state point
# name
# Alice 24 NY 64
# Bob 42 CA 92
# Charlie 18 CA 70
# Dave 68 TX 70
# Ellen 24 CA 88
# Frank 30 NY 57
Select and extract rows and values using the index
Unique index
labels allow you to select and extract rows and values by their labels.
df_name = pd.read_csv('data/src/sample_pandas_normal.csv', index_col=0)
print(df_name)
# age state point
# name
# Alice 24 NY 64
# Bob 42 CA 92
# Charlie 18 CA 70
# Dave 68 TX 70
# Ellen 24 CA 88
# Frank 30 NY 57
print(df_name.loc['Bob'])
# age 42
# state CA
# point 92
# Name: Bob, dtype: object
print(df_name.at['Bob', 'age'])
# 42
For more on using loc
and at
for selecting values, rows, and columns, and for index-based row/column selection, refer to the following articles.