pandas.DataFrame, Seriesを順位付けするrank

Posted: 2018-07-22 | Tags: Python, pandas

pandas.DataFrameの行・列, pandas.Seriesを順位付けするにはrank()メソッドを使う。

pandas.DataFrameの列やpandas.Seriesを昇順・降順に並び替えるメソッドとしてsort_values()があるが、rank()はデータを並び替えずに各要素の順位を返す。

sort_values()については以下の記事を参照。

関連記事: pandas.DataFrame, Seriesをソートするsort_values, sort_index

ここでは以下の内容について説明する。

rank()の基本的な使い方
行・列を指定: 引数axis
数値のみを対象: 引数numeric_only
昇順・降順を指定: 引数ascending
同一値（重複値）の処理を指定: 引数: method
欠損値NaNの処理を指定: 引数na_option
パーセンテージを取得: 引数pct
pandas.Seriesの場合

以下のpandas.DataFrameを例とする。

import pandas as pd

df = pd.DataFrame({'col1': [50, 80, 100, 80],
                   'col2': [0.3, pd.np.nan, 0.1, pd.np.nan],
                   'col3': ['h', 'j', 'i', 'k']},
                  index=['a', 'b', 'c', 'd'])

print(df)
#    col1  col2 col3
# a    50   0.3    h
# b    80   NaN    j
# c   100   0.1    i
# d    80   NaN    k

source: pandas_rank.py

rank()の基本的な使い方

rank()メソッドを呼ぶと、デフォルトでは各列が昇順で順位付けされる。同一値（重複値）は平均順位となり、文字列はアルファベット順に比較される。

print(df.rank())
#    col1  col2  col3
# a   1.0   2.0   1.0
# b   2.5   NaN   3.0
# c   4.0   1.0   2.0
# d   2.5   NaN   4.0

source: pandas_rank.py

行・列を指定: 引数axis

デフォルトでは列ごとに順位付けされる。

行に対して順位付けする場合は引数axisを1とする。この例の場合、文字列は無視して処理される。

print(df.rank(axis=1))
#    col1  col2
# a   2.0   1.0
# b   1.0   NaN
# c   2.0   1.0
# d   1.0   NaN

source: pandas_rank.py

数値のみを対象: 引数numeric_only

デフォルトでは文字列も順位付けされる。

数値のみを対象とする場合は引数numeric_onlyをTrueとする。

print(df.rank(numeric_only=True))
#    col1  col2
# a   1.0   2.0
# b   2.5   NaN
# c   4.0   1.0
# d   2.5   NaN

source: pandas_rank.py

デフォルトはnumeric_only=Noneで、文字列のみの行・列は順位付けの対象となるが、例のpandas.DataFrameを行に対して順位付けする場合のように数値と文字列が混在していると文字列を無視して処理される。

print(df.rank(axis=1))
#    col1  col2
# a   2.0   1.0
# b   1.0   NaN
# c   2.0   1.0
# d   1.0   NaN

source: pandas_rank.py

数値と文字列が混在している場合にnumeric_only=FalseとするとエラーTypeErrorになる。

# print(df.rank(axis=1, numeric_only=False))
# TypeError: '<' not supported between instances of 'str' and 'int'

source: pandas_rank.py

昇順・降順を指定: 引数ascending

デフォルトでは昇順に順位付けされる。

降順にする場合は引数ascendingをFalseとする。

print(df.rank(ascending=False))
#    col1  col2  col3
# a   4.0   1.0   4.0
# b   2.5   NaN   2.0
# c   1.0   2.0   3.0
# d   2.5   NaN   1.0

source: pandas_rank.py

同一値（重複値）の処理を指定: 引数: method

デフォルトでは同一値（重複値）がある場合、その平均順位が返される。

引数methodによって同一値（重複値）の処理を指定できる。

デフォルトはmethod='average'。平均値が順位となる。

print(df.rank(method='average'))
#    col1  col2  col3
# a   1.0   2.0   1.0
# b   2.5   NaN   3.0
# c   4.0   1.0   2.0
# d   2.5   NaN   4.0

source: pandas_rank.py

method='min'とすると最小値が順位となる。1位、2位タイ、2位タイ、4位のように、スポーツなどで馴染み深い結果。

print(df.rank(method='min'))
#    col1  col2  col3
# a   1.0   2.0   1.0
# b   2.0   NaN   3.0
# c   4.0   1.0   2.0
# d   2.0   NaN   4.0

source: pandas_rank.py

method='max'とすると最大値が順位となる。

print(df.rank(method='max'))
#    col1  col2  col3
# a   1.0   2.0   1.0
# b   3.0   NaN   3.0
# c   4.0   1.0   2.0
# d   3.0   NaN   4.0

source: pandas_rank.py

method='first'とすると同一値（重複値）は登場順に順位付けされる。method='first'は数値のみに有効なので注意。文字列が含まれている場合はnumeric_only=Trueとする。

# print(df.rank(method='first'))
# ValueError: first not supported for non-numeric data

print(df.rank(method='first', numeric_only=True))
#    col1  col2
# a   1.0   2.0
# b   2.0   NaN
# c   4.0   1.0
# d   3.0   NaN

source: pandas_rank.py

method='dense'とするとminのように最小値が順位となるが、後続が詰めて順位付けされる。1位、2位タイ、2位タイ、3位のようになる。

print(df.rank(method='dense'))
#    col1  col2  col3
# a   1.0   2.0   1.0
# b   2.0   NaN   3.0
# c   3.0   1.0   2.0
# d   2.0   NaN   4.0

source: pandas_rank.py

欠損値NaNの処理を指定: 引数na_option

デフォルトでは欠損値NaNは順位付けされずNaNのまま。

引数na_optionによってNaNの処理を指定できる。

デフォルトはna_option='keep'。NaNはNaNのまま。

print(df.rank(na_option='keep'))
#    col1  col2  col3
# a   1.0   2.0   1.0
# b   2.5   NaN   3.0
# c   4.0   1.0   2.0
# d   2.5   NaN   4.0

source: pandas_rank.py

na_option='top'とするとNaNは第1位になる。NaNが複数ある場合の処理は引数methodに従う。

print(df.rank(na_option='top'))
#    col1  col2  col3
# a   1.0   4.0   1.0
# b   2.5   1.5   3.0
# c   4.0   3.0   2.0
# d   2.5   1.5   4.0

print(df.rank(na_option='top', method='min'))
#    col1  col2  col3
# a   1.0   4.0   1.0
# b   2.0   1.0   3.0
# c   4.0   3.0   2.0
# d   2.0   1.0   4.0

source: pandas_rank.py

na_option='bottom'とするとNaNは最下位になる。NaNが複数ある場合の処理は引数methodに従う。

print(df.rank(na_option='bottom'))
#    col1  col2  col3
# a   1.0   2.0   1.0
# b   2.5   3.5   3.0
# c   4.0   1.0   2.0
# d   2.5   3.5   4.0

print(df.rank(na_option='bottom', method='min'))
#    col1  col2  col3
# a   1.0   2.0   1.0
# b   2.0   3.0   3.0
# c   4.0   1.0   2.0
# d   2.0   3.0   4.0

source: pandas_rank.py

パーセンテージを取得: 引数pct

引数pctをTrueとすると各要素が全体の何パーセントの位置にいるかを返す。他の引数も合わせて指定可能。

print(df.rank(pct=True))
#     col1  col2  col3
# a  0.250   1.0  0.25
# b  0.625   NaN  0.75
# c  1.000   0.5  0.50
# d  0.625   NaN  1.00

print(df.rank(pct=True, method='min', ascending=False, na_option='bottom'))
#    col1  col2  col3
# a  1.00  0.25  1.00
# b  0.50  0.75  0.50
# c  0.25  0.50  0.75
# d  0.50  0.75  0.25

source: pandas_rank.py

pandas.Seriesの場合

これまでの例はpandas.DataFrameだが、pandas.Seriesでも同じ。

print(df['col1'].rank(method='min', ascending=False))
# a    4.0
# b    2.0
# c    1.0
# d    2.0
# Name: col1, dtype: float64

source: pandas_rank.py

pandas.DataFrame, Seriesを順位付けするrank

rank()の基本的な使い方

行・列を指定: 引数axis

数値のみを対象: 引数numeric_only

昇順・降順を指定: 引数ascending

同一値（重複値）の処理を指定: 引数: method

欠損値NaNの処理を指定: 引数na_option

パーセンテージを取得: 引数pct

pandas.Seriesの場合

関連カテゴリー

関連記事