pandas: Set a column as the DataFrame index with set_index()

Modified: 2024-01-26 | Tags: Python, pandas

The set_index() method of pandas.DataFrame allows you to set an existing column as the index (row labels).

pandas.DataFrame.set_index — pandas 2.1.4 documentation

Contents

How to use set_index()
Specify the index when reading CSV files
Select and extract rows and values using the index

Conversely, to turn the index into a data column, use reset_index().

pandas: Reset index of DataFrame/Series with reset_index()

For methods to rename the index, refer to the following article.

pandas: Rename column/index names of DataFrame

The pandas version used in this article is as follows. Note that functionality may vary between versions. The following data is used as an example.

sample_pandas_normal.csv

import pandas as pd

print(pd.__version__)
# 2.1.4

df = pd.read_csv('data/src/sample_pandas_normal.csv')
print(df)
#       name  age state  point
# 0    Alice   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57

source: pandas_set_index.py

How to use `set_index()`

Basic usage

Specify the column label (name) to be used as the index in the first argument, keys.

print(df.set_index('name'))
#          age state  point
# name                     
# Alice     24    NY     64
# Bob       42    CA     92
# Charlie   18    CA     70
# Dave      68    TX     70
# Ellen     24    CA     88
# Frank     30    NY     57

source: pandas_set_index.py

Keep the specified column: `drop`

By default, the column specified as the index is removed from the data columns. Setting drop=False keeps the specified column in the data columns.

print(df.set_index('name', drop=False))
#             name  age state  point
# name                              
# Alice      Alice   24    NY     64
# Bob          Bob   42    CA     92
# Charlie  Charlie   18    CA     70
# Dave        Dave   68    TX     70
# Ellen      Ellen   24    CA     88
# Frank      Frank   30    NY     57

source: pandas_set_index.py

Set a `MultiIndex`

In pandas, you can set a MultiIndex, which enables hierarchical, multi-level indexing.

MultiIndex / advanced indexing — pandas 2.2.0 documentation

Specify multiple columns in a list

By specifying a list of column labels in the first argument (keys), multiple columns are assigned as a MultiIndex.

print(df.set_index(['state', 'name']))
#                age  point
# state name               
# NY    Alice     24     64
# CA    Bob       42     92
#       Charlie   18     70
# TX    Dave      68     70
# CA    Ellen     24     88
# NY    Frank     30     57

source: pandas_set_index.py

Sorting with sort_index() makes it display neatly.

pandas: Sort DataFrame/Series with sort_values(), sort_index()

print(df.set_index(['state', 'name']).sort_index())
#                age  point
# state name               
# CA    Bob       42     92
#       Charlie   18     70
#       Ellen     24     88
# NY    Alice     24     64
#       Frank     30     57
# TX    Dave      68     70

source: pandas_set_index.py

Add a column to the `index`: `append`

By default, specifying a new column with set_index() removes the original index.

df_name = df.set_index('name')
print(df_name)
#          age state  point
# name                     
# Alice     24    NY     64
# Bob       42    CA     92
# Charlie   18    CA     70
# Dave      68    TX     70
# Ellen     24    CA     88
# Frank     30    NY     57

print(df_name.set_index('state'))
#        age  point
# state            
# NY      24     64
# CA      42     92
# CA      18     70
# TX      68     70
# CA      24     88
# NY      30     57

source: pandas_set_index.py

Setting append=True adds the specified column to the index as a new level.

df_mi = df_name.set_index('state', append=True)
print(df_mi)
#                age  point
# name    state            
# Alice   NY      24     64
# Bob     CA      42     92
# Charlie CA      18     70
# Dave    TX      68     70
# Ellen   CA      24     88
# Frank   NY      30     57

source: pandas_set_index.py

The added column becomes the lowest level in the MultiIndex. To rearrange these levels, swaplevel() can be used.

pandas.DataFrame.swaplevel — pandas 2.2.0 documentation

print(df_mi.swaplevel(0, 1))
#                age  point
# state name               
# NY    Alice     24     64
# CA    Bob       42     92
#       Charlie   18     70
# TX    Dave      68     70
# CA    Ellen     24     88
# NY    Frank     30     57

source: pandas_set_index.py

Keep the original `index` as a column

When a new index is set with set_index(), the original index is removed.

To keep the original index as a column, use reset_index() to renumber the index starting from 0.

pandas: Reset index of DataFrame/Series with reset_index()

df_name = df.set_index('name')
print(df_name)
#          age state  point
# name                     
# Alice     24    NY     64
# Bob       42    CA     92
# Charlie   18    CA     70
# Dave      68    TX     70
# Ellen     24    CA     88
# Frank     30    NY     57

print(df_name.reset_index())
#       name  age state  point
# 0    Alice   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57

source: pandas_set_index.py

To change the index to another column, use set_index() after reset_index().

print(df_name.reset_index().set_index('state'))
#           name  age  point
# state                     
# NY       Alice   24     64
# CA         Bob   42     92
# CA     Charlie   18     70
# TX        Dave   68     70
# CA       Ellen   24     88
# NY       Frank   30     57

source: pandas_set_index.py

Check for duplicates in the new `index`: `verify_integrity`

By default, if there are duplicates in the new index, they are used as is. Setting verify_integrity=True results in an error if duplicates exist.

print(df.set_index('state'))
#           name  age  point
# state                     
# NY       Alice   24     64
# CA         Bob   42     92
# CA     Charlie   18     70
# TX        Dave   68     70
# CA       Ellen   24     88
# NY       Frank   30     57

# print(df.set_index('state', verify_integrity=True))
# ValueError: Index has duplicate keys: Index(['CA', 'NY'], dtype='object', name='state')

source: pandas_set_index.py

Change original object: `inplace`

By default, set_index() returns a new object without changing the original. Setting inplace=True modifies the original object.

df.set_index('name', inplace=True)
print(df)
#          age state  point
# name                     
# Alice     24    NY     64
# Bob       42    CA     92
# Charlie   18    CA     70
# Dave      68    TX     70
# Ellen     24    CA     88
# Frank     30    NY     57

source: pandas_set_index.py

Specify the `index` when reading CSV files

When reading a CSV file into a DataFrame with pd.read_csv(), you can directly set a column as the index by specifying its column number in the index_col argument.

pandas: Read CSV into DataFrame with read_csv()

df_name = pd.read_csv('data/src/sample_pandas_normal.csv', index_col=0)
print(df_name)
#          age state  point
# name                     
# Alice     24    NY     64
# Bob       42    CA     92
# Charlie   18    CA     70
# Dave      68    TX     70
# Ellen     24    CA     88
# Frank     30    NY     57

source: pandas_set_index.py

Select and extract rows and values using the `index`

Unique index labels allow you to select and extract rows and values by their labels.

df_name = pd.read_csv('data/src/sample_pandas_normal.csv', index_col=0)
print(df_name)
#          age state  point
# name                     
# Alice     24    NY     64
# Bob       42    CA     92
# Charlie   18    CA     70
# Dave      68    TX     70
# Ellen     24    CA     88
# Frank     30    NY     57

print(df_name.loc['Bob'])
# age      42
# state    CA
# point    92
# Name: Bob, dtype: object

print(df_name.at['Bob', 'age'])
# 42

source: pandas_set_index.py

For more on using loc and at for selecting values, rows, and columns, and for index-based row/column selection, refer to the following articles.

pandas: Set a column as the DataFrame index with set_index()

How to use `set_index()`

Basic usage

Keep the specified column: `drop`

Set a `MultiIndex`

Specify multiple columns in a list

Add a column to the `index`: `append`

Keep the original `index` as a column

Check for duplicates in the new `index`: `verify_integrity`

Change original object: `inplace`

Specify the `index` when reading CSV files

Select and extract rows and values using the `index`

Related Categories

Related Articles

pandas: Set a column as the DataFrame index with set_index()

How to use set_index()

Basic usage

Keep the specified column: drop

Set a MultiIndex

Specify multiple columns in a list

Add a column to the index: append

Keep the original index as a column

Check for duplicates in the new index: verify_integrity

Change original object: inplace

Specify the index when reading CSV files

Select and extract rows and values using the index

Related Categories

Related Articles

How to use `set_index()`

Keep the specified column: `drop`

Set a `MultiIndex`

Add a column to the `index`: `append`

Keep the original `index` as a column

Check for duplicates in the new `index`: `verify_integrity`

Change original object: `inplace`

Specify the `index` when reading CSV files

Select and extract rows and values using the `index`