pandas: Set a column as the DataFrame index with set_index()

Modified: | Tags: Python, pandas

The set_index() method of pandas.DataFrame allows you to set an existing column as the index (row labels).

Conversely, to turn the index into a data column, use reset_index().

For methods to rename the index, refer to the following article.

The pandas version used in this article is as follows. Note that functionality may vary between versions. The following data is used as an example.

import pandas as pd

print(pd.__version__)
# 2.1.4

df = pd.read_csv('data/src/sample_pandas_normal.csv')
print(df)
#       name  age state  point
# 0    Alice   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57

How to use set_index()

Basic usage

Specify the column label (name) to be used as the index in the first argument, keys.

print(df.set_index('name'))
#          age state  point
# name                     
# Alice     24    NY     64
# Bob       42    CA     92
# Charlie   18    CA     70
# Dave      68    TX     70
# Ellen     24    CA     88
# Frank     30    NY     57

Keep the specified column: drop

By default, the column specified as the index is removed from the data columns. Setting drop=False keeps the specified column in the data columns.

print(df.set_index('name', drop=False))
#             name  age state  point
# name                              
# Alice      Alice   24    NY     64
# Bob          Bob   42    CA     92
# Charlie  Charlie   18    CA     70
# Dave        Dave   68    TX     70
# Ellen      Ellen   24    CA     88
# Frank      Frank   30    NY     57

Set a MultiIndex

In pandas, you can set a MultiIndex, which enables hierarchical, multi-level indexing.

Specify multiple columns in a list

By specifying a list of column labels in the first argument (keys), multiple columns are assigned as a MultiIndex.

print(df.set_index(['state', 'name']))
#                age  point
# state name               
# NY    Alice     24     64
# CA    Bob       42     92
#       Charlie   18     70
# TX    Dave      68     70
# CA    Ellen     24     88
# NY    Frank     30     57

Sorting with sort_index() makes it display neatly.

print(df.set_index(['state', 'name']).sort_index())
#                age  point
# state name               
# CA    Bob       42     92
#       Charlie   18     70
#       Ellen     24     88
# NY    Alice     24     64
#       Frank     30     57
# TX    Dave      68     70

Add a column to the index: append

By default, specifying a new column with set_index() removes the original index.

df_name = df.set_index('name')
print(df_name)
#          age state  point
# name                     
# Alice     24    NY     64
# Bob       42    CA     92
# Charlie   18    CA     70
# Dave      68    TX     70
# Ellen     24    CA     88
# Frank     30    NY     57

print(df_name.set_index('state'))
#        age  point
# state            
# NY      24     64
# CA      42     92
# CA      18     70
# TX      68     70
# CA      24     88
# NY      30     57

Setting append=True adds the specified column to the index as a new level.

df_mi = df_name.set_index('state', append=True)
print(df_mi)
#                age  point
# name    state            
# Alice   NY      24     64
# Bob     CA      42     92
# Charlie CA      18     70
# Dave    TX      68     70
# Ellen   CA      24     88
# Frank   NY      30     57

The added column becomes the lowest level in the MultiIndex. To rearrange these levels, swaplevel() can be used.

print(df_mi.swaplevel(0, 1))
#                age  point
# state name               
# NY    Alice     24     64
# CA    Bob       42     92
#       Charlie   18     70
# TX    Dave      68     70
# CA    Ellen     24     88
# NY    Frank     30     57

Keep the original index as a column

When a new index is set with set_index(), the original index is removed.

To keep the original index as a column, use reset_index() to renumber the index starting from 0.

df_name = df.set_index('name')
print(df_name)
#          age state  point
# name                     
# Alice     24    NY     64
# Bob       42    CA     92
# Charlie   18    CA     70
# Dave      68    TX     70
# Ellen     24    CA     88
# Frank     30    NY     57

print(df_name.reset_index())
#       name  age state  point
# 0    Alice   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57

To change the index to another column, use set_index() after reset_index().

print(df_name.reset_index().set_index('state'))
#           name  age  point
# state                     
# NY       Alice   24     64
# CA         Bob   42     92
# CA     Charlie   18     70
# TX        Dave   68     70
# CA       Ellen   24     88
# NY       Frank   30     57

Check for duplicates in the new index: verify_integrity

By default, if there are duplicates in the new index, they are used as is. Setting verify_integrity=True results in an error if duplicates exist.

print(df.set_index('state'))
#           name  age  point
# state                     
# NY       Alice   24     64
# CA         Bob   42     92
# CA     Charlie   18     70
# TX        Dave   68     70
# CA       Ellen   24     88
# NY       Frank   30     57

# print(df.set_index('state', verify_integrity=True))
# ValueError: Index has duplicate keys: Index(['CA', 'NY'], dtype='object', name='state')

Change original object: inplace

By default, set_index() returns a new object without changing the original. Setting inplace=True modifies the original object.

df.set_index('name', inplace=True)
print(df)
#          age state  point
# name                     
# Alice     24    NY     64
# Bob       42    CA     92
# Charlie   18    CA     70
# Dave      68    TX     70
# Ellen     24    CA     88
# Frank     30    NY     57

Specify the index when reading CSV files

When reading a CSV file into a DataFrame with pd.read_csv(), you can directly set a column as the index by specifying its column number in the index_col argument.

df_name = pd.read_csv('data/src/sample_pandas_normal.csv', index_col=0)
print(df_name)
#          age state  point
# name                     
# Alice     24    NY     64
# Bob       42    CA     92
# Charlie   18    CA     70
# Dave      68    TX     70
# Ellen     24    CA     88
# Frank     30    NY     57

Select and extract rows and values using the index

Unique index labels allow you to select and extract rows and values by their labels.

df_name = pd.read_csv('data/src/sample_pandas_normal.csv', index_col=0)
print(df_name)
#          age state  point
# name                     
# Alice     24    NY     64
# Bob       42    CA     92
# Charlie   18    CA     70
# Dave      68    TX     70
# Ellen     24    CA     88
# Frank     30    NY     57

print(df_name.loc['Bob'])
# age      42
# state    CA
# point    92
# Name: Bob, dtype: object

print(df_name.at['Bob', 'age'])
# 42

For more on using loc and at for selecting values, rows, and columns, and for index-based row/column selection, refer to the following articles.

Related Categories

Related Articles