pandas: Add rows/columns to DataFrame with assign(), insert()
This article explains how to add new rows/columns to a pandas.DataFrame
.
Note that the append()
method was deprecated in version 1.4.0
and removed in 2.0.0
.
The sample code in this article uses pandas version 2.0.3
.
import pandas as pd
print(pd.__version__)
# 2.0.3
Add a column to a pandas.DataFrame
Add a column using bracket notation []
You can select a column using [column_name]
and assign values to it.
df = pd.DataFrame({'A': ['A1', 'A2', 'A3'],
'B': ['B1', 'B2', 'B3'],
'C': ['C1', 'C2', 'C3']},
index=['ONE', 'TWO', 'THREE'])
print(df)
# A B C
# ONE A1 B1 C1
# TWO A2 B2 C2
# THREE A3 B3 C3
df['A'] = 0
print(df)
# A B C
# ONE 0 B1 C1
# TWO 0 B2 C2
# THREE 0 B3 C3
If you specify a non-existent column name, a new column will be added with the assigned value.
Assign a scalar value
When a scalar value is assigned, all elements in the column are set to that value.
df['D'] = 0
print(df)
# A B C D
# ONE 0 B1 C1 0
# TWO 0 B2 C2 0
# THREE 0 B3 C3 0
Assign an array-like object
If an array-like object such as a list or a NumPy array ndarray
is assigned, each element is assigned directly. Note that a mismatch between the number of elements in the list and the number of rows will result in an error.
df['E'] = [0, 1, 2]
print(df)
# A B C D E
# ONE 0 B1 C1 0 0
# TWO 0 B2 C2 0 1
# THREE 0 B3 C3 0 2
# df['F'] = [0, 1, 2, 3]
# ValueError: Length of values does not match length of index
Assign a pandas.Series
You can also assign a Series
.
Since each column of a DataFrame
is treated as a Series
, you can add new columns based on the results of operations or the processed results of these methods.
df['F'] = df['B'] + df['C']
df['G'] = df['B'].str.lower()
print(df)
# A B C D E F G
# ONE 0 B1 C1 0 0 B1C1 b1
# TWO 0 B2 C2 0 1 B2C2 b2
# THREE 0 B3 C3 0 2 B3C3 b3
If the index label of the Series
does not correspond to the column name of the DataFrame
, a missing value NaN
is assigned.
s = pd.Series(['X2', 'X3', 'X4'], index=['TWO', 'THREE', 'FOUR'], name='X')
print(s)
# TWO X2
# THREE X3
# FOUR X4
# Name: X, dtype: object
df['H'] = s
print(df)
# A B C D E F G H
# ONE 0 B1 C1 0 0 B1C1 b1 NaN
# TWO 0 B2 C2 0 1 B2C2 b2 X2
# THREE 0 B3 C3 0 2 B3C3 b3 X3
The values
attribute of a Series
returns a NumPy array ndarray
, treated as an array-like object. Elements are assigned in order, regardless of the index
. Note that an error will occur if the number of elements does not match the number of rows.
print(s.values)
# ['X2' 'X3' 'X4']
df['I'] = s.values
print(df)
# A B C D E F G H I
# ONE 0 B1 C1 0 0 B1C1 b1 NaN X2
# TWO 0 B2 C2 0 1 B2C2 b2 X2 X3
# THREE 0 B3 C3 0 2 B3C3 b3 X3 X4
The pandas.DataFrame.assign()
method
The assign()
method either appends a new column or assigns new values to an existing column.
You can specify the column name and its value using the keyword argument structure, column_name=value
.
If the column name exists, the method assigns the value to it. If the column name is new, it adds a new column. This method returns a new object, while the original object remains unchanged.
df = pd.DataFrame({'A': ['A1', 'A2', 'A3'],
'B': ['B1', 'B2', 'B3'],
'C': ['C1', 'C2', 'C3']},
index=['ONE', 'TWO', 'THREE'])
print(df.assign(A=0))
# A B C
# ONE 0 B1 C1
# TWO 0 B2 C2
# THREE 0 B3 C3
print(df.assign(D=0))
# A B C D
# ONE A1 B1 C1 0
# TWO A2 B2 C2 0
# THREE A3 B3 C3 0
print(df)
# A B C
# ONE A1 B1 C1
# TWO A2 B2 C2
# THREE A3 B3 C3
Just like when adding a column with [column_name]
, you can specify lists or Series
with the assign()
method. You can also add/assign multiple columns simultaneously by specifying multiple keyword arguments.
s = pd.Series(['X2', 'X3', 'X4'], index=['TWO', 'THREE', 'FOUR'], name='X')
print(s)
# TWO X2
# THREE X3
# FOUR X4
# Name: X, dtype: object
df_new = df.assign(C='XXX',
D=0, E=[0, 1, 2],
F=s, G=s.values,
H=df['A'] + df['B'])
print(df_new)
# A B C D E F G H
# ONE A1 B1 XXX 0 0 NaN X2 A1B1
# TWO A2 B2 XXX 0 1 X2 X3 A2B2
# THREE A3 B3 XXX 0 2 X3 X4 A3B3
Note that in the assign()
method, you specify the column name as a keyword argument. Therefore, names that are not valid as argument names, such as those with symbols other than underscores _
, and reserved words, will result in an error. For information on acceptable argument names in Python, refer to the following article.
The pandas.DataFrame.insert()
method
The insert()
method allows you to add a column at any position in a DataFrame
.
Specify the position as the first argument, the column name as the second, and the value to be assigned as the third.
The third argument can accept a scalar value, an array-like object such as a list, or a Series
. The concept is similar to the previous examples.
The original DataFrame
is directly updated.
df = pd.DataFrame({'A': ['A1', 'A2', 'A3'],
'B': ['B1', 'B2', 'B3'],
'C': ['C1', 'C2', 'C3']},
index=['ONE', 'TWO', 'THREE'])
s = pd.Series(['X2', 'X3', 'X4'], index=['TWO', 'THREE', 'FOUR'], name='X')
df.insert(2, 'X', 0)
print(df)
# A B X C
# ONE A1 B1 0 C1
# TWO A2 B2 0 C2
# THREE A3 B3 0 C3
df.insert(0, 'Y', s)
print(df)
# Y A B X C
# ONE NaN A1 B1 0 C1
# TWO X2 A2 B2 0 C2
# THREE X3 A3 B3 0 C3
Note that specifying a value exceeding the number of rows as the first argument will cause an error. Using a negative value to specify the position from the end is not allowed. To specify the end as the position for the new column, use len(df.columns)
or df.shape[1]
to get the number of existing columns.
# df.insert(10, 'Z', 10)
# IndexError: index 10 is out of bounds for axis 0 with size 5
# df.insert(-1, 'Z', 10)
# ValueError: unbounded slice
Also, assigning an existing column name as the second argument will lead to an error. Although it's possible to allow duplicates by setting the allow_duplicates
argument to True
, it's not recommended due to potential confusion caused by duplicated column names.
# df.insert(0, 'Y', 10)
# ValueError: cannot insert Y, already exists
df.insert(0, 'Y', 10, allow_duplicates=True)
print(df)
# Y Y A B X C
# ONE 10 NaN A1 B1 0 C1
# TWO 10 X2 A2 B2 0 C2
# THREE 10 X3 A3 B3 0 C3
The pandas.concat()
function
You can concatenate multiple DataFrame
and Series
objects using the concat()
function.
By concatenating a Series
to a DataFrame
, you can add a new column.
In the previous examples, when adding a Series
, its name
attribute was ignored. However, when concatenating horizontally with the concat()
function with axis=1
, the name
of the Series
is used as the column name.
Specify a list or tuple of objects you want to concatenate as the first argument to concat()
.
df = pd.DataFrame({'A': ['A1', 'A2', 'A3'],
'B': ['B1', 'B2', 'B3'],
'C': ['C1', 'C2', 'C3']},
index=['ONE', 'TWO', 'THREE'])
s = pd.Series(['X2', 'X3', 'X4'], index=['TWO', 'THREE', 'FOUR'], name='X')
print(pd.concat([df, s], axis=1))
# A B C X
# ONE A1 B1 C1 NaN
# TWO A2 B2 C2 X2
# THREE A3 B3 C3 X3
# FOUR NaN NaN NaN X4
To keep only the rows sharing common indices, specify join='inner'
.
print(pd.concat([df, s], axis=1, join='inner'))
# A B C X
# TWO A2 B2 C2 X2
# THREE A3 B3 C3 X3
The function allows you to concatenate multiple Series
and DataFrame
objects.
s1 = pd.Series(['X1', 'X2', 'X3'], index=df.index, name='X')
s2 = pd.Series(['Y1', 'Y2', 'Y3'], index=df.index, name='Y')
df2 = pd.DataFrame({'df_col1': 0, 'df_col2': range(3)}, index=df.index)
print(pd.concat([df, s1, s2, df2], axis=1))
# A B C X Y df_col1 df_col2
# ONE A1 B1 C1 X1 Y1 0 0
# TWO A2 B2 C2 X2 Y2 0 1
# THREE A3 B3 C3 X3 Y3 0 2
Add a row to a pandas.DataFrame
Add a row using .loc[]
You can select a row using loc[row_name]
and assign values to it.
df = pd.DataFrame({'A': ['A1', 'A2', 'A3'],
'B': ['B1', 'B2', 'B3'],
'C': ['C1', 'C2', 'C3']},
index=['ONE', 'TWO', 'THREE'])
print(df)
# A B C
# ONE A1 B1 C1
# TWO A2 B2 C2
# THREE A3 B3 C3
df.loc['ONE'] = 0
print(df)
# A B C
# ONE 0 0 0
# TWO A2 B2 C2
# THREE A3 B3 C3
As with columns, by specifying a non-existent row name, you can add the row and assign values to it.
The approach is the same as for columns. You can assign a scalar value or an array-like object.
df.loc['FOUR'] = 0
df.loc['FIVE'] = ['A5', 'B5', 'C5']
print(df)
# A B C
# ONE 0 0 0
# TWO A2 B2 C2
# THREE A3 B3 C3
# FOUR 0 0 0
# FIVE A5 B5 C5
For array-like objects, ensure that the number of elements matches the number of columns; otherwise, it will cause an error.
# df.loc['SIX'] = ['A6', 'B6']
# ValueError: cannot set a row with mismatched columns
Like columns, Series
can also be assigned to rows. If the labels do not match, missing values NaN
are assigned. If you want to ignore the labels, you can use values
to convert to NumPy array ndarray
.
s = pd.Series(['B6', 'C6', 'D6'], index=['B', 'C', 'D'], name='SIX')
print(s)
# B B6
# C C6
# D D6
# Name: SIX, dtype: object
df.loc['XXX'] = df.loc['TWO'] + df.loc['THREE']
df.loc['YYY'] = s
df.loc['ZZZ'] = s.values
print(df)
# A B C
# ONE 0 0 0
# TWO A2 B2 C2
# THREE A3 B3 C3
# FOUR 0 0 0
# FIVE A5 B5 C5
# XXX A2A3 B2B3 C2C3
# YYY NaN B6 C6
# ZZZ B6 C6 D6
The pandas.DataFrame.append()
method (deprecated in version 1.4.0
)
The append()
method was formerly used to add new rows to DataFrame
. However, this method was deprecated in version 1.4.0
and removed in version 2.0.0
.
- pandas.DataFrame.append — pandas 1.4.4 documentation
- What’s new in 1.4.0 (January 22, 2022) — pandas 2.0.3 documentation
In the release notes, it is recommended to use the pandas.concat()
function instead.
The pandas.concat()
function
You can concatenate multiple DataFrame
and Series
objects using the concat()
function.
Specify a list or tuple of objects you want to concatenate as the first argument to concat()
. By default, they are concatenated vertically.
df1 = pd.DataFrame({'A': ['A1', 'A2', 'A3'],
'B': ['B1', 'B2', 'B3'],
'C': ['C1', 'C2', 'C3']},
index=['ONE', 'TWO', 'THREE'])
print(df1)
# A B C
# ONE A1 B1 C1
# TWO A2 B2 C2
# THREE A3 B3 C3
df2 = pd.DataFrame({'B': ['B4', 'B5'], 'C': ['C4', 'C5'], 'D': ['D4', 'D5']},
index=['FOUR', 'FIVE'])
print(df2)
# B C D
# FOUR B4 C4 D4
# FIVE B5 C5 D5
print(pd.concat([df1, df2]))
# A B C D
# ONE A1 B1 C1 NaN
# TWO A2 B2 C2 NaN
# THREE A3 B3 C3 NaN
# FOUR NaN B4 C4 D4
# FIVE NaN B5 C5 D5
To retain only the columns that share common names, specify join='inner'
.
print(pd.concat([df1, df2], join='inner'))
# B C
# ONE B1 C1
# TWO B2 C2
# THREE B3 C3
# FOUR B4 C4
# FIVE B5 C5
You need to exercise caution when concatenating DataFrame
and Series
vertically.
By default, it looks like this.
s = pd.Series(['A4', 'B4', 'C4'], index=['A', 'B', 'C'], name='FOUR')
print(s)
# A A4
# B B4
# C C4
# Name: FOUR, dtype: object
print(pd.concat([df1, s]))
# A B C 0
# ONE A1 B1 C1 NaN
# TWO A2 B2 C2 NaN
# THREE A3 B3 C3 NaN
# A NaN NaN NaN A4
# B NaN NaN NaN B4
# C NaN NaN NaN C4
By converting the Series
to DataFrame
with the to_frame()
method and transposing it with T
, you get a one-row DataFrame
. You can concatenate this.
print(s.to_frame().T)
# A B C
# FOUR A4 B4 C4
print(pd.concat([df1, s.to_frame().T]))
# A B C
# ONE A1 B1 C1
# TWO A2 B2 C2
# THREE A3 B3 C3
# FOUR A4 B4 C4
Note: Add a large number of rows or columns
It's not recommended to add a large number of rows or columns to a DataFrame
individually due to inefficiency.
For example, when you add one column at a time in a for
loop, a PerformanceWarning
is issued. It seems to be issued when you add more than 100 columns.
df = pd.DataFrame()
for i in range(101):
df[i] = 0
# PerformanceWarning: DataFrame is highly fragmented.
# This is usually the result of calling `frame.insert` many times, which has poor performance.
# Consider joining all columns at once using pd.concat(axis=1) instead.
# To get a de-fragmented frame, use `newframe = frame.copy()`
Unless you need to use the features of DataFrame
every time you add a row or column, it is better to concatenate all at once using concat()
, as the warning message suggests.
A comparison of processing speed between adding one by one and adding all at once will be introduced at the end.
Add multiple rows at once
Take the following DataFrame
as an example.
df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [10, 20, 30], 'col3': [100, 200, 300]},
index=['row1', 'row2', 'row3'])
print(df)
# col1 col2 col3
# row1 1 10 100
# row2 2 20 200
# row3 3 30 300
Add the data and row names for each row to separate lists. Although the content is simply created here, in the actual code it is created by some data processing.
l_data = []
l_label = []
for i in range(4, 7):
l_data.append([i, i * 10, i * 100])
l_label.append(f'row{i}')
print(l_data)
# [[4, 40, 400], [5, 50, 500], [6, 60, 600]]
print(l_label)
# ['row4', 'row5', 'row6']
Create a DataFrame
from these lists and the column names columns
of the original DataFrame
, and concatenate it with the original DataFrame
.
df_append = pd.DataFrame(l_data, index=l_label, columns=df.columns)
print(df_append)
# col1 col2 col3
# row4 4 40 400
# row5 5 50 500
# row6 6 60 600
df_result = pd.concat([df, df_append])
print(df_result)
# col1 col2 col3
# row1 1 10 100
# row2 2 20 200
# row3 3 30 300
# row4 4 40 400
# row5 5 50 500
# row6 6 60 600
Add multiple columns at once
The concept is the same when adding columns as it was for adding rows as described above.
Take the following DataFrame
as an example.
df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [10, 20, 30], 'col3': [100, 200, 300]},
index=['row1', 'row2', 'row3'])
print(df)
# col1 col2 col3
# row1 1 10 100
# row2 2 20 200
# row3 3 30 300
Add the data and column names for each column to separate lists.
l_data = []
l_label = []
for i in range(3, 6):
l_data.append([10**i, 2 * 10**i, 3 * 10**i])
l_label.append(f'col{i + 1}')
print(l_data)
# [[1000, 2000, 3000], [10000, 20000, 30000], [100000, 200000, 300000]]
print(l_label)
# ['col4', 'col5', 'col6']
Create a DataFrame
from these lists and the row names index
of the original DataFrame
, and concatenate it with the original DataFrame
. Note that you need to transpose the two-dimensional list holding the data.
df_append = pd.DataFrame(zip(*l_data), index=df.index, columns=l_label)
print(df_append)
# col4 col5 col6
# row1 1000 10000 100000
# row2 2000 20000 200000
# row3 3000 30000 300000
df_result = pd.concat([df, df_append], axis=1)
print(df_result)
# col1 col2 col3 col4 col5 col6
# row1 1 10 100 1000 10000 100000
# row2 2 20 200 2000 20000 200000
# row3 3 30 300 3000 30000 300000
Processing Speed Comparison
Compare the processing speed between adding rows or columns one by one and adding them all at once.
The following examples use the Jupyter Notebook magic command %%timeit
. Note that these will not work if run as Python scripts.
In the case of adding 1000 rows:
%%timeit
df_loc = pd.DataFrame([[0, 0, 0], [1, 1, 1], [2, 2, 2]])
for i in range(3, 1003):
df_loc.loc[i] = [i] * 3
# 150 ms ± 4.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
df = pd.DataFrame([[0, 0, 0], [1, 1, 1], [2, 2, 2]])
l_data = []
l_label = []
for i in range(3, 1003):
l_data.append([i] * 3)
l_label.append(i)
df_concat = pd.concat([df, pd.DataFrame(l_data, index=l_label, columns=df.columns)])
# 487 µs ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In the case of adding 1000 columns:
%%timeit
df_index = pd.DataFrame([[0, 0, 0], [1, 1, 1], [2, 2, 2]])
for i in range(3, 1003):
df_index[i] = [0, 1, 2]
# 31.2 ms ± 578 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
df = pd.DataFrame([[0, 0, 0], [1, 1, 1], [2, 2, 2]])
l_data = []
l_label = []
for i in range(3, 1003):
l_data.append([0, 1, 2])
l_label.append(i)
df_concat = pd.concat([df, pd.DataFrame(zip(*l_data), index=df.index, columns=l_label)], axis=1)
# 3.56 ms ± 54.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In both cases, adding all rows or columns at once proves to be significantly faster.