pandas: Get dummy variables with pd.get_dummies()
In pandas, the pd.get_dummies()
function converts categorical variables to dummy variables.
This function can convert data categorized by strings, such as gender, to a format like 0
for male and 1
for female. It can also transform multi-class features into a one-hot representation, a common practice in preprocessing for machine learning.
- Basic usage of
pd.get_dummies()
- Specify data type for dummy variables:
dtype
- Exclude the first category:
drop_first
- Convert missing values
NaN
to dummy variables:dummy_na
- Specify column names for dummy variables:
prefix
,prefix_sep
- Specify columns to be converted to dummy variables:
columns
- Cautions when converting multiple data with
pd.get_dummies()
The pandas version used in this article is as follows. Note that functionality may vary between versions. The following data is used as an example. Columns have been added for explanation purposes.
import pandas as pd
print(pd.__version__)
# 2.1.2
df = pd.read_csv('data/src/sample_pandas_normal.csv', index_col=0)
df['sex'] = ['female', float('nan'), 'male', 'male', 'female', 'male']
df['rank'] = [2, 1, 1, 0, 2, 0]
print(df)
# age state point sex rank
# name
# Alice 24 NY 64 female 2
# Bob 42 CA 92 NaN 1
# Charlie 18 CA 70 male 1
# Dave 68 TX 70 male 0
# Ellen 24 CA 88 female 2
# Frank 30 NY 57 male 0
Basic usage of pd.get_dummies()
The first argument, data
, of pd.get_dummies()
can be a Series
, array-like object (such as a list or a NumPy array ndarray
), or a DataFrame
. In all cases, a new DataFrame
is returned.
Specify Series
or array-like object as the first argument
When a Series
or array-like object (such as a list or NumPy array ndarray
) is specified as the first argument, the category names are used as column names.
print(pd.get_dummies(df['sex']))
# female male
# name
# Alice True False
# Bob False False
# Charlie False True
# Dave False True
# Ellen True False
# Frank False True
print(pd.get_dummies(['female', float('nan'), 'male', 'male', 'female', 'male']))
# female male
# 0 True False
# 1 False False
# 2 False True
# 3 False True
# 4 True False
# 5 False True
Specify DataFrame
as the first argument
When a DataFrame
is specified as the first argument, by default, columns whose data type (dtype
) is object
(mainly strings) or category
are all converted to dummy variables. Settings for converting columns of other types like numbers to dummy variables are discussed later.
In this case, the resulting column names follow the format <ORIGINAL_COLUMN_NAME>_<CATEGORY_NAME>
. Settings to change this are discussed later.
print(pd.get_dummies(df))
# age point rank state_CA state_NY state_TX sex_female sex_male
# name
# Alice 24 64 2 False True False True False
# Bob 42 92 1 True False False False False
# Charlie 18 70 1 True False False False True
# Dave 68 70 0 False False True False True
# Ellen 24 88 2 True False False True False
# Frank 30 57 0 False True False False True
Specify data type for dummy variables: dtype
By default, dummy variables are represented as bool
(True
and False
).
You can specify the data type with the dtype
argument. Since True
and False
are considered as 1
and 0
respectively, for example, specifying int
will represent them as 1
and 0
.
print(pd.get_dummies(df, dtype=int))
# age point rank state_CA state_NY state_TX sex_female sex_male
# name
# Alice 24 64 2 0 1 0 1 0
# Bob 42 92 1 1 0 0 0 0
# Charlie 18 70 1 1 0 0 0 1
# Dave 68 70 0 0 0 1 0 1
# Ellen 24 88 2 1 0 0 1 0
# Frank 30 57 0 0 1 0 0 1
Exclude the first category: drop_first
When converting k
categories to dummy variables, only k-1
dummy variables are necessary, but by default, pd.get_dummies()
converts them to k
dummy variables.
Setting the drop_first
argument to True
excludes the first category, converting it to k-1
dummy variables.
print(pd.get_dummies(df, drop_first=True))
# age point rank state_NY state_TX sex_male
# name
# Alice 24 64 2 True False False
# Bob 42 92 1 False False False
# Charlie 18 70 1 False False True
# Dave 68 70 0 False True True
# Ellen 24 88 2 False False False
# Frank 30 57 0 True False True
In the example data, Bob
's Sex
is a missing value NaN
, and when dummy variables are created, both sex_female
and sex_male
become False
. Note that setting drop_first
to True
in such cases would lose the information that it is NaN
. To convert NaN
to dummy variables, use the dummy_na
argument introduced next.
Convert missing values NaN
to dummy variables: dummy_na
By default, missing values NaN
are ignored and all dummy variable columns become False
. If you want to treat NaN
as a separate category for dummy variables, set the dummy_na
argument to True
.
For columns that do not contain NaN
, a dummy variable column for NaN
will still be added, and all its elements will be False
.
print(pd.get_dummies(df, drop_first=True, dummy_na=True))
# age point rank state_NY state_TX state_nan sex_male sex_nan
# name
# Alice 24 64 2 True False False False False
# Bob 42 92 1 False False False False True
# Charlie 18 70 1 False False False True False
# Dave 68 70 0 False True False True False
# Ellen 24 88 2 False False False False False
# Frank 30 57 0 True False False True False
Specify column names for dummy variables: prefix
, prefix_sep
For a DataFrame
, the default column names for the generated dummy variables are <ORIGINAL_COLUMN_NAME>_<CATEGORY_NAME>
. You can change this by specifying the prefix
and prefix_sep
arguments.
The prefix
argument can be a string, list, or dictionary.
If specified as a string, all prefixes will be the same like <prefix>_<CATEGORY_NAME>
. If you want the dummy variable column names to be just the category names, set both prefix
and prefix_sep
to an empty string ''
.
print(pd.get_dummies(df, prefix='', prefix_sep=''))
# age point rank CA NY TX female male
# name
# Alice 24 64 2 False True False True False
# Bob 42 92 1 True False False False False
# Charlie 18 70 1 True False False False True
# Dave 68 70 0 False False True False True
# Ellen 24 88 2 True False False True False
# Frank 30 57 0 False True False False True
You can specify new column names as a list. When using a dictionary for prefix
, map the original column names to new ones using the format {original_column_name: new_column_name}
.
An error occurs if the number of elements in the list or dictionary does not match the number of columns to be converted. Ensure each column to be converted is accounted for, even if you wish to retain its original name.
print(pd.get_dummies(df, prefix=['ST', 'sex'], prefix_sep='-'))
# age point rank ST-CA ST-NY ST-TX sex-female sex-male
# name
# Alice 24 64 2 False True False True False
# Bob 42 92 1 True False False False False
# Charlie 18 70 1 True False False False True
# Dave 68 70 0 False False True False True
# Ellen 24 88 2 True False False True False
# Frank 30 57 0 False True False False True
print(pd.get_dummies(df, prefix={'state': 'ST', 'sex': 'sex'}, prefix_sep='-'))
# age point rank ST-CA ST-NY ST-TX sex-female sex-male
# name
# Alice 24 64 2 False True False True False
# Bob 42 92 1 True False False False False
# Charlie 18 70 1 True False False False True
# Dave 68 70 0 False False True False True
# Ellen 24 88 2 True False False True False
# Frank 30 57 0 False True False False True
Specify columns to be converted to dummy variables: columns
By default, in the case of a DataFrame
, columns whose data type (dtype
) is object
(mainly strings) or category
are converted to dummy variables.
You can also convert numerical and boolean columns to dummy variables by specifying the column names as a list in the columns
argument. Columns not specified in columns
are not converted.
print(pd.get_dummies(df, columns=['sex', 'rank']))
# age state point sex_female sex_male rank_0 rank_1 rank_2
# name
# Alice 24 NY 64 True False False False True
# Bob 42 CA 92 False False False True False
# Charlie 18 CA 70 False True False True False
# Dave 68 TX 70 False True True False False
# Ellen 24 CA 88 True False False False True
# Frank 30 NY 57 False True True False False
Cautions when converting multiple data with pd.get_dummies()
Be careful when converting multiple data with pd.get_dummies()
.
Consider the following two DataFrames
.
df = pd.read_csv('data/src/sample_pandas_normal.csv', index_col=0)
df_A, df_B = df[:3].copy(), df[3:].copy()
print(df_A)
# age state point
# name
# Alice 24 NY 64
# Bob 42 CA 92
# Charlie 18 CA 70
print(df_B)
# age state point
# name
# Dave 68 TX 70
# Ellen 24 CA 88
# Frank 30 NY 57
Converting each of them with pd.get_dummies()
results in the following. Since each data contains different categories, the resulting columns differ.
print(pd.get_dummies(df_A))
# age point state_CA state_NY
# name
# Alice 24 64 False True
# Bob 42 92 True False
# Charlie 18 70 True False
print(pd.get_dummies(df_B))
# age point state_CA state_NY state_TX
# name
# Dave 68 70 False False True
# Ellen 24 88 True False False
# Frank 30 57 False True False
To make the dummy variable columns common, use pandas' categorical type. Convert the target columns to categorical type using pd.Categorical()
.
- Categorical data — pandas 2.1.3 documentation
- pandas.Categorical — pandas 2.1.3 documentation
- Feature Request: allow user defined categories in get_dummies · Issue #22078 · pandas-dev/pandas
categories = set(df_A['state'].tolist() + df_B['state'].tolist())
print(categories)
# {'NY', 'TX', 'CA'}
df_A['state'] = pd.Categorical(df_A['state'], categories)
df_B['state'] = pd.Categorical(df_B['state'], categories)
print(df_A['state'].dtypes)
# category
Here, the categories are generated by converting each column to a list with tolist()
, concatenating these lists, and then removing duplicates with set()
.
- Convert pandas.DataFrame, Series and list to each other
- Remove/extract duplicate elements from list in Python
When pd.get_dummies()
is executed on them, dummy variables are generated according to the specified categories. For example, the state
column in df_A
does not contain TX
, but a state_TX
column is generated.
print(pd.get_dummies(df_A))
# age point state_NY state_TX state_CA
# name
# Alice 24 64 True False False
# Bob 42 92 False False True
# Charlie 18 70 False False True
print(pd.get_dummies(df_B))
# age point state_NY state_TX state_CA
# name
# Dave 68 70 False True False
# Ellen 24 88 False False True
# Frank 30 57 True False False
While the categories in the above example include values from at least one of the datasets, you can also define your own categories, including values not present in the datasets. Values not corresponding to a category are treated as NaN
.
categories = ['CA', 'NY']
df_A['state'] = pd.Categorical(df_A['state'], categories)
df_B['state'] = pd.Categorical(df_B['state'], categories)
print(df_A)
# age state point
# name
# Alice 24 NY 64
# Bob 42 CA 92
# Charlie 18 CA 70
print(df_B)
# age state point
# name
# Dave 68 NaN 70
# Ellen 24 CA 88
# Frank 30 NY 57
print(pd.get_dummies(df_A))
# age point state_CA state_NY
# name
# Alice 24 64 False True
# Bob 42 92 True False
# Charlie 18 70 True False
print(pd.get_dummies(df_B))
# age point state_CA state_NY
# name
# Dave 68 70 False False
# Ellen 24 88 True False
# Frank 30 57 False True