pandas: Sort DataFrame/Series with sort_values(), sort_index()

Modified: | Tags: Python, pandas

In pandas, the sort_values() and sort_index() methods allow you to sort DataFrame and Series. You can sort in ascending or descending order, or sort by multiple columns.

Note that the older sort() method has been deprecated.

Use reindex() to reorder rows and columns in any specific order.

The pandas version used in this article is as follows. Note that functionality may vary between versions. The following data is used as an example.

import pandas as pd

print(pd.__version__)
# 2.1.4

df = pd.read_csv('data/src/sample_pandas_normal.csv')
print(df)
#       name  age state  point
# 0    Alice   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57

The following examples use DataFrame, but Series also supports the sort_values() and sort_index() methods in the same manner.

Sort by values: sort_values()

To sort by values, use the sort_values() method.

Specify columns to sort by: by

Specify the column label (name) you want to sort by in the first argument, by.

print(df.sort_values('age'))
#       name  age state  point
# 2  Charlie   18    CA     70
# 0    Alice   24    NY     64
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57
# 1      Bob   42    CA     92
# 3     Dave   68    TX     70

print(df.sort_values('state'))
#       name  age state  point
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88
# 0    Alice   24    NY     64
# 5    Frank   30    NY     57
# 3     Dave   68    TX     70

Sort by multiple columns

When the first argument (by) is specified as a list, sorting is based on multiple columns. Sorting is performed in reverse order of the listed columns, with the final sort done by the first specified column.

print(df.sort_values(['age', 'state']))
#       name  age state  point
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88
# 0    Alice   24    NY     64
# 5    Frank   30    NY     57
# 1      Bob   42    CA     92
# 3     Dave   68    TX     70

print(df.sort_values(['state', 'age']))
#       name  age state  point
# 2  Charlie   18    CA     70
# 4    Ellen   24    CA     88
# 1      Bob   42    CA     92
# 0    Alice   24    NY     64
# 5    Frank   30    NY     57
# 3     Dave   68    TX     70

Specify ascending or descending: ascending

By default, sorting is in ascending order, but you can set the ascending argument to False for descending order.

print(df.sort_values('age', ascending=False))
#       name  age state  point
# 3     Dave   68    TX     70
# 1      Bob   42    CA     92
# 5    Frank   30    NY     57
# 0    Alice   24    NY     64
# 4    Ellen   24    CA     88
# 2  Charlie   18    CA     70

When sorting based on multiple columns, specifying the ascending argument as a list allows for selecting ascending or descending order for each column.

print(df.sort_values(['age', 'state'], ascending=False))
#       name  age state  point
# 3     Dave   68    TX     70
# 1      Bob   42    CA     92
# 5    Frank   30    NY     57
# 0    Alice   24    NY     64
# 4    Ellen   24    CA     88
# 2  Charlie   18    CA     70

print(df.sort_values(['age', 'state'], ascending=[True, False]))
#       name  age state  point
# 2  Charlie   18    CA     70
# 0    Alice   24    NY     64
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57
# 1      Bob   42    CA     92
# 3     Dave   68    TX     70

Reset the index: ignore_index

By default, sorting the values also rearranges the index (row labels). However, setting the ignore_index argument to True resets it to a 0-based sequence.

print(df.sort_values('age', ignore_index=True))
#       name  age state  point
# 0  Charlie   18    CA     70
# 1    Alice   24    NY     64
# 2    Ellen   24    CA     88
# 3    Frank   30    NY     57
# 4      Bob   42    CA     92
# 5     Dave   68    TX     70

Handle NaN: na_position

By default, NaN values are placed at the end.

df_nan = df.copy()
df_nan.iloc[0, 1] = float('nan')
df_nan.iloc[4, 1] = float('nan')
print(df_nan)
#       name   age state  point
# 0    Alice   NaN    NY     64
# 1      Bob  42.0    CA     92
# 2  Charlie  18.0    CA     70
# 3     Dave  68.0    TX     70
# 4    Ellen   NaN    CA     88
# 5    Frank  30.0    NY     57

print(df_nan.sort_values('age'))
#       name   age state  point
# 2  Charlie  18.0    CA     70
# 5    Frank  30.0    NY     57
# 1      Bob  42.0    CA     92
# 3     Dave  68.0    TX     70
# 0    Alice   NaN    NY     64
# 4    Ellen   NaN    CA     88

Setting the na_position argument to 'first' places them at the beginning.

print(df_nan.sort_values('age', na_position='first'))
#       name   age state  point
# 0    Alice   NaN    NY     64
# 4    Ellen   NaN    CA     88
# 2  Charlie  18.0    CA     70
# 5    Frank  30.0    NY     57
# 1      Bob  42.0    CA     92
# 3     Dave  68.0    TX     70

See the following articles about how to remove and replace missing values.

Sort based on a function: key

Like Python's built-in sorted() function, the sort_values() method allows you to specify the key argument. The function specified in key is applied before sorting, and sorting is based on its results.

For example, you can sort by string length (number of characters) using a lambda expression to apply string methods to the Series.

print(df.sort_values('name', key=lambda s: s.str.len()))
#       name  age state  point
# 1      Bob   42    CA     92
# 3     Dave   68    TX     70
# 0    Alice   24    NY     64
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57
# 2  Charlie   18    CA     70

Note that the key argument of sort_values() requires a function that takes a Series and returns an array-like object such as a Series or list.

To apply any function to each value, use the map() method. The following example uses map() to apply the built-in len() function for explanatory purposes, instead of using Series string methods.

# print(df.sort_values('name', key=len))
# TypeError: object of type 'int' has no len()

print(df.sort_values('name', key=lambda s: s.map(len)))
#       name  age state  point
# 1      Bob   42    CA     92
# 3     Dave   68    TX     70
# 0    Alice   24    NY     64
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57
# 2  Charlie   18    CA     70

Specify sorting algorithm: kind

The kind argument specifies the sorting algorithm. Possible options are 'quicksort' (default), 'mergesort', 'heapsort', and 'stable'. Refer to the following NumPy official documentation for details on each algorithm.

Note that this argument is ignored when sorting by multiple columns.

Sort columns based on rows: axis

By default, as in the examples so far, rows are sorted based on columns. To sort columns based on rows, set the axis argument to 1 or 'columns'.

To avoid errors caused by mixing numbers and strings, this example extracts numerical columns using select_dtypes().

df_num = df.select_dtypes('number')
print(df_num)
#    age  point
# 0   24     64
# 1   42     92
# 2   18     70
# 3   68     70
# 4   24     88
# 5   30     57

print(df_num.sort_values(1, axis=1, ascending=False))
#    point  age
# 0     64   24
# 1     92   42
# 2     70   18
# 3     70   68
# 4     88   24
# 5     57   30

Modify the original object: inplace

By default, sorting creates a new object. However, setting the inplace argument to True modifies the original object.

df_copy = df.copy()
df_copy.sort_values('age', inplace=True)
print(df_copy)
#       name  age state  point
# 2  Charlie   18    CA     70
# 0    Alice   24    NY     64
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57
# 1      Bob   42    CA     92
# 3     Dave   68    TX     70

Sort by index/columns (row/column labels): sort_index()

To sort by the index or columns (row or column labels), use the sort_index() method.

Sort by index (row labels)

By default, rows are sorted by the index (row labels).

df_sorted = df.sort_values('age')
print(df_sorted)
#       name  age state  point
# 2  Charlie   18    CA     70
# 0    Alice   24    NY     64
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57
# 1      Bob   42    CA     92
# 3     Dave   68    TX     70

print(df_sorted.sort_index())
#       name  age state  point
# 0    Alice   24    NY     64
# 1      Bob   42    CA     92
# 2  Charlie   18    CA     70
# 3     Dave   68    TX     70
# 4    Ellen   24    CA     88
# 5    Frank   30    NY     57

Sort by columns (column labels): axis

Setting the axis argument to 1 or 'columns' sorts columns by the columns (column labels).

print(df.sort_index(axis=1))
#    age     name  point state
# 0   24    Alice     64    NY
# 1   42      Bob     92    CA
# 2   18  Charlie     70    CA
# 3   68     Dave     70    TX
# 4   24    Ellen     88    CA
# 5   30    Frank     57    NY

The same arguments as sort_values() can be specified

sort_index() also has arguments such as ascending, ignore_index, na_position, key, kind, and inplace, with usage identical to sort_values().

print(df.sort_index(axis=1, ascending=False, key=lambda s: s.str.len()))
#   state  point     name  age
# 0    NY     64    Alice   24
# 1    CA     92      Bob   42
# 2    CA     70  Charlie   18
# 3    TX     70     Dave   68
# 4    CA     88    Ellen   24
# 5    NY     57    Frank   30

Related Categories

Related Articles