pandas: Get dummy variables with pd.get_dummies()

Posted: | Tags: Python, pandas

In pandas, the pd.get_dummies() function converts categorical variables to dummy variables.

This function can convert data categorized by strings, such as gender, to a format like 0 for male and 1 for female. It can also transform multi-class features into a one-hot representation, a common practice in preprocessing for machine learning.

The pandas version used in this article is as follows. Note that functionality may vary between versions. The following data is used as an example. Columns have been added for explanation purposes.

import pandas as pd

print(pd.__version__)
# 2.1.2

df = pd.read_csv('data/src/sample_pandas_normal.csv', index_col=0)

df['sex'] = ['female', float('nan'), 'male', 'male', 'female', 'male']
df['rank'] = [2, 1, 1, 0, 2, 0]

print(df)
#          age state  point     sex  rank
# name                                   
# Alice     24    NY     64  female     2
# Bob       42    CA     92     NaN     1
# Charlie   18    CA     70    male     1
# Dave      68    TX     70    male     0
# Ellen     24    CA     88  female     2
# Frank     30    NY     57    male     0

Basic usage of pd.get_dummies()

The first argument, data, of pd.get_dummies() can be a Series, array-like object (such as a list or a NumPy array ndarray), or a DataFrame. In all cases, a new DataFrame is returned.

Specify Series or array-like object as the first argument

When a Series or array-like object (such as a list or NumPy array ndarray) is specified as the first argument, the category names are used as column names.

print(pd.get_dummies(df['sex']))
#          female   male
# name                  
# Alice      True  False
# Bob       False  False
# Charlie   False   True
# Dave      False   True
# Ellen      True  False
# Frank     False   True

print(pd.get_dummies(['female', float('nan'), 'male', 'male', 'female', 'male']))
#    female   male
# 0    True  False
# 1   False  False
# 2   False   True
# 3   False   True
# 4    True  False
# 5   False   True

Specify DataFrame as the first argument

When a DataFrame is specified as the first argument, by default, columns whose data type (dtype) is object (mainly strings) or category are all converted to dummy variables. Settings for converting columns of other types like numbers to dummy variables are discussed later.

In this case, the resulting column names follow the format <ORIGINAL_COLUMN_NAME>_<CATEGORY_NAME>. Settings to change this are discussed later.

print(pd.get_dummies(df))
#          age  point  rank  state_CA  state_NY  state_TX  sex_female  sex_male
# name                                                                         
# Alice     24     64     2     False      True     False        True     False
# Bob       42     92     1      True     False     False       False     False
# Charlie   18     70     1      True     False     False       False      True
# Dave      68     70     0     False     False      True       False      True
# Ellen     24     88     2      True     False     False        True     False
# Frank     30     57     0     False      True     False       False      True

Specify data type for dummy variables: dtype

By default, dummy variables are represented as bool (True and False).

You can specify the data type with the dtype argument. Since True and False are considered as 1 and 0 respectively, for example, specifying int will represent them as 1 and 0.

print(pd.get_dummies(df, dtype=int))
#          age  point  rank  state_CA  state_NY  state_TX  sex_female  sex_male
# name                                                                         
# Alice     24     64     2         0         1         0           1         0
# Bob       42     92     1         1         0         0           0         0
# Charlie   18     70     1         1         0         0           0         1
# Dave      68     70     0         0         0         1           0         1
# Ellen     24     88     2         1         0         0           1         0
# Frank     30     57     0         0         1         0           0         1

Exclude the first category: drop_first

When converting k categories to dummy variables, only k-1 dummy variables are necessary, but by default, pd.get_dummies() converts them to k dummy variables.

Setting the drop_first argument to True excludes the first category, converting it to k-1 dummy variables.

print(pd.get_dummies(df, drop_first=True))
#          age  point  rank  state_NY  state_TX  sex_male
# name                                                   
# Alice     24     64     2      True     False     False
# Bob       42     92     1     False     False     False
# Charlie   18     70     1     False     False      True
# Dave      68     70     0     False      True      True
# Ellen     24     88     2     False     False     False
# Frank     30     57     0      True     False      True

In the example data, Bob's Sex is a missing value NaN, and when dummy variables are created, both sex_female and sex_male become False. Note that setting drop_first to True in such cases would lose the information that it is NaN. To convert NaN to dummy variables, use the dummy_na argument introduced next.

Convert missing values NaN to dummy variables: dummy_na

By default, missing values NaN are ignored and all dummy variable columns become False. If you want to treat NaN as a separate category for dummy variables, set the dummy_na argument to True.

For columns that do not contain NaN, a dummy variable column for NaN will still be added, and all its elements will be False.

print(pd.get_dummies(df, drop_first=True, dummy_na=True))
#          age  point  rank  state_NY  state_TX  state_nan  sex_male  sex_nan
# name                                                                       
# Alice     24     64     2      True     False      False     False    False
# Bob       42     92     1     False     False      False     False     True
# Charlie   18     70     1     False     False      False      True    False
# Dave      68     70     0     False      True      False      True    False
# Ellen     24     88     2     False     False      False     False    False
# Frank     30     57     0      True     False      False      True    False

Specify column names for dummy variables: prefix, prefix_sep

For a DataFrame, the default column names for the generated dummy variables are <ORIGINAL_COLUMN_NAME>_<CATEGORY_NAME>. You can change this by specifying the prefix and prefix_sep arguments.

The prefix argument can be a string, list, or dictionary.

If specified as a string, all prefixes will be the same like <prefix>_<CATEGORY_NAME>. If you want the dummy variable column names to be just the category names, set both prefix and prefix_sep to an empty string ''.

print(pd.get_dummies(df, prefix='', prefix_sep=''))
#          age  point  rank     CA     NY     TX  female   male
# name                                                         
# Alice     24     64     2  False   True  False    True  False
# Bob       42     92     1   True  False  False   False  False
# Charlie   18     70     1   True  False  False   False   True
# Dave      68     70     0  False  False   True   False   True
# Ellen     24     88     2   True  False  False    True  False
# Frank     30     57     0  False   True  False   False   True

You can specify new column names as a list. When using a dictionary for prefix, map the original column names to new ones using the format {original_column_name: new_column_name}.

An error occurs if the number of elements in the list or dictionary does not match the number of columns to be converted. Ensure each column to be converted is accounted for, even if you wish to retain its original name.

print(pd.get_dummies(df, prefix=['ST', 'sex'], prefix_sep='-'))
#          age  point  rank  ST-CA  ST-NY  ST-TX  sex-female  sex-male
# name                                                                
# Alice     24     64     2  False   True  False        True     False
# Bob       42     92     1   True  False  False       False     False
# Charlie   18     70     1   True  False  False       False      True
# Dave      68     70     0  False  False   True       False      True
# Ellen     24     88     2   True  False  False        True     False
# Frank     30     57     0  False   True  False       False      True

print(pd.get_dummies(df, prefix={'state': 'ST', 'sex': 'sex'}, prefix_sep='-'))
#          age  point  rank  ST-CA  ST-NY  ST-TX  sex-female  sex-male
# name                                                                
# Alice     24     64     2  False   True  False        True     False
# Bob       42     92     1   True  False  False       False     False
# Charlie   18     70     1   True  False  False       False      True
# Dave      68     70     0  False  False   True       False      True
# Ellen     24     88     2   True  False  False        True     False
# Frank     30     57     0  False   True  False       False      True

Specify columns to be converted to dummy variables: columns

By default, in the case of a DataFrame, columns whose data type (dtype) is object (mainly strings) or category are converted to dummy variables.

You can also convert numerical and boolean columns to dummy variables by specifying the column names as a list in the columns argument. Columns not specified in columns are not converted.

print(pd.get_dummies(df, columns=['sex', 'rank']))
#          age state  point  sex_female  sex_male  rank_0  rank_1  rank_2
# name                                                                   
# Alice     24    NY     64        True     False   False   False    True
# Bob       42    CA     92       False     False   False    True   False
# Charlie   18    CA     70       False      True   False    True   False
# Dave      68    TX     70       False      True    True   False   False
# Ellen     24    CA     88        True     False   False   False    True
# Frank     30    NY     57       False      True    True   False   False

Cautions when converting multiple data with pd.get_dummies()

Be careful when converting multiple data with pd.get_dummies().

Consider the following two DataFrames.

df = pd.read_csv('data/src/sample_pandas_normal.csv', index_col=0)
df_A, df_B = df[:3].copy(), df[3:].copy()

print(df_A)
#          age state  point
# name                     
# Alice     24    NY     64
# Bob       42    CA     92
# Charlie   18    CA     70

print(df_B)
#        age state  point
# name                   
# Dave    68    TX     70
# Ellen   24    CA     88
# Frank   30    NY     57

Converting each of them with pd.get_dummies() results in the following. Since each data contains different categories, the resulting columns differ.

print(pd.get_dummies(df_A))
#          age  point  state_CA  state_NY
# name                                   
# Alice     24     64     False      True
# Bob       42     92      True     False
# Charlie   18     70      True     False

print(pd.get_dummies(df_B))
#        age  point  state_CA  state_NY  state_TX
# name                                           
# Dave    68     70     False     False      True
# Ellen   24     88      True     False     False
# Frank   30     57     False      True     False

To make the dummy variable columns common, use pandas' categorical type. Convert the target columns to categorical type using pd.Categorical().

categories = set(df_A['state'].tolist() + df_B['state'].tolist())
print(categories)
# {'NY', 'TX', 'CA'}

df_A['state'] = pd.Categorical(df_A['state'], categories)
df_B['state'] = pd.Categorical(df_B['state'], categories)

print(df_A['state'].dtypes)
# category

Here, the categories are generated by converting each column to a list with tolist(), concatenating these lists, and then removing duplicates with set().

When pd.get_dummies() is executed on them, dummy variables are generated according to the specified categories. For example, the state column in df_A does not contain TX, but a state_TX column is generated.

print(pd.get_dummies(df_A))
#          age  point  state_NY  state_TX  state_CA
# name                                             
# Alice     24     64      True     False     False
# Bob       42     92     False     False      True
# Charlie   18     70     False     False      True

print(pd.get_dummies(df_B))
#        age  point  state_NY  state_TX  state_CA
# name                                           
# Dave    68     70     False      True     False
# Ellen   24     88     False     False      True
# Frank   30     57      True     False     False

While the categories in the above example include values from at least one of the datasets, you can also define your own categories, including values not present in the datasets. Values not corresponding to a category are treated as NaN.

categories = ['CA', 'NY']

df_A['state'] = pd.Categorical(df_A['state'], categories)
df_B['state'] = pd.Categorical(df_B['state'], categories)

print(df_A)
#          age state  point
# name                     
# Alice     24    NY     64
# Bob       42    CA     92
# Charlie   18    CA     70

print(df_B)
#        age state  point
# name                   
# Dave    68   NaN     70
# Ellen   24    CA     88
# Frank   30    NY     57

print(pd.get_dummies(df_A))
#          age  point  state_CA  state_NY
# name                                   
# Alice     24     64     False      True
# Bob       42     92      True     False
# Charlie   18     70      True     False

print(pd.get_dummies(df_B))
#        age  point  state_CA  state_NY
# name                                 
# Dave    68     70     False     False
# Ellen   24     88      True     False
# Frank   30     57     False      True

Related Categories

Related Articles