pandas: Add rows/columns to DataFrame with assign(), insert()

Posted: | Tags: Python, pandas

This article explains how to add new rows/columns to a pandas.DataFrame.

Note that the append() method was deprecated in version 1.4.0 and removed in 2.0.0.

The sample code in this article uses pandas version 2.0.3.

import pandas as pd

print(pd.__version__)
# 2.0.3

Add a column to a pandas.DataFrame

Add a column using bracket notation []

You can select a column using [column_name] and assign values to it.

df = pd.DataFrame({'A': ['A1', 'A2', 'A3'],
                   'B': ['B1', 'B2', 'B3'],
                   'C': ['C1', 'C2', 'C3']},
                  index=['ONE', 'TWO', 'THREE'])
print(df)
#         A   B   C
# ONE    A1  B1  C1
# TWO    A2  B2  C2
# THREE  A3  B3  C3

df['A'] = 0
print(df)
#        A   B   C
# ONE    0  B1  C1
# TWO    0  B2  C2
# THREE  0  B3  C3

If you specify a non-existent column name, a new column will be added with the assigned value.

Assign a scalar value

When a scalar value is assigned, all elements in the column are set to that value.

df['D'] = 0
print(df)
#        A   B   C  D
# ONE    0  B1  C1  0
# TWO    0  B2  C2  0
# THREE  0  B3  C3  0

Assign an array-like object

If an array-like object such as a list or a NumPy array ndarray is assigned, each element is assigned directly. Note that a mismatch between the number of elements in the list and the number of rows will result in an error.

df['E'] = [0, 1, 2]
print(df)
#        A   B   C  D  E
# ONE    0  B1  C1  0  0
# TWO    0  B2  C2  0  1
# THREE  0  B3  C3  0  2

# df['F'] = [0, 1, 2, 3]
# ValueError: Length of values does not match length of index

Assign a pandas.Series

You can also assign a Series.

Since each column of a DataFrame is treated as a Series, you can add new columns based on the results of operations or the processed results of these methods.

df['F'] = df['B'] + df['C']
df['G'] = df['B'].str.lower()
print(df)
#        A   B   C  D  E     F   G
# ONE    0  B1  C1  0  0  B1C1  b1
# TWO    0  B2  C2  0  1  B2C2  b2
# THREE  0  B3  C3  0  2  B3C3  b3

If the index label of the Series does not correspond to the column name of the DataFrame, a missing value NaN is assigned.

s = pd.Series(['X2', 'X3', 'X4'], index=['TWO', 'THREE', 'FOUR'], name='X')
print(s)
# TWO      X2
# THREE    X3
# FOUR     X4
# Name: X, dtype: object

df['H'] = s
print(df)
#        A   B   C  D  E     F   G    H
# ONE    0  B1  C1  0  0  B1C1  b1  NaN
# TWO    0  B2  C2  0  1  B2C2  b2   X2
# THREE  0  B3  C3  0  2  B3C3  b3   X3

The values attribute of a Series returns a NumPy array ndarray, treated as an array-like object. Elements are assigned in order, regardless of the index. Note that an error will occur if the number of elements does not match the number of rows.

print(s.values)
# ['X2' 'X3' 'X4']

df['I'] = s.values
print(df)
#        A   B   C  D  E     F   G    H   I
# ONE    0  B1  C1  0  0  B1C1  b1  NaN  X2
# TWO    0  B2  C2  0  1  B2C2  b2   X2  X3
# THREE  0  B3  C3  0  2  B3C3  b3   X3  X4

The pandas.DataFrame.assign() method

The assign() method either appends a new column or assigns new values to an existing column.

You can specify the column name and its value using the keyword argument structure, column_name=value.

If the column name exists, the method assigns the value to it. If the column name is new, it adds a new column. This method returns a new object, while the original object remains unchanged.

df = pd.DataFrame({'A': ['A1', 'A2', 'A3'],
                   'B': ['B1', 'B2', 'B3'],
                   'C': ['C1', 'C2', 'C3']},
                  index=['ONE', 'TWO', 'THREE'])

print(df.assign(A=0))
#        A   B   C
# ONE    0  B1  C1
# TWO    0  B2  C2
# THREE  0  B3  C3

print(df.assign(D=0))
#         A   B   C  D
# ONE    A1  B1  C1  0
# TWO    A2  B2  C2  0
# THREE  A3  B3  C3  0

print(df)
#         A   B   C
# ONE    A1  B1  C1
# TWO    A2  B2  C2
# THREE  A3  B3  C3

Just like when adding a column with [column_name], you can specify lists or Series with the assign() method. You can also add/assign multiple columns simultaneously by specifying multiple keyword arguments.

s = pd.Series(['X2', 'X3', 'X4'], index=['TWO', 'THREE', 'FOUR'], name='X')
print(s)
# TWO      X2
# THREE    X3
# FOUR     X4
# Name: X, dtype: object

df_new = df.assign(C='XXX',
                   D=0, E=[0, 1, 2],
                   F=s, G=s.values,
                   H=df['A'] + df['B'])
print(df_new)
#         A   B    C  D  E    F   G     H
# ONE    A1  B1  XXX  0  0  NaN  X2  A1B1
# TWO    A2  B2  XXX  0  1   X2  X3  A2B2
# THREE  A3  B3  XXX  0  2   X3  X4  A3B3

Note that in the assign() method, you specify the column name as a keyword argument. Therefore, names that are not valid as argument names, such as those with symbols other than underscores _, and reserved words, will result in an error. For information on acceptable argument names in Python, refer to the following article.

The pandas.DataFrame.insert() method

The insert() method allows you to add a column at any position in a DataFrame.

Specify the position as the first argument, the column name as the second, and the value to be assigned as the third.

The third argument can accept a scalar value, an array-like object such as a list, or a Series. The concept is similar to the previous examples.

The original DataFrame is directly updated.

df = pd.DataFrame({'A': ['A1', 'A2', 'A3'],
                   'B': ['B1', 'B2', 'B3'],
                   'C': ['C1', 'C2', 'C3']},
                  index=['ONE', 'TWO', 'THREE'])
s = pd.Series(['X2', 'X3', 'X4'], index=['TWO', 'THREE', 'FOUR'], name='X')

df.insert(2, 'X', 0)
print(df)
#         A   B  X   C
# ONE    A1  B1  0  C1
# TWO    A2  B2  0  C2
# THREE  A3  B3  0  C3

df.insert(0, 'Y', s)
print(df)
#          Y   A   B  X   C
# ONE    NaN  A1  B1  0  C1
# TWO     X2  A2  B2  0  C2
# THREE   X3  A3  B3  0  C3

Note that specifying a value exceeding the number of rows as the first argument will cause an error. Using a negative value to specify the position from the end is not allowed. To specify the end as the position for the new column, use len(df.columns) or df.shape[1] to get the number of existing columns.

# df.insert(10, 'Z', 10)
# IndexError: index 10 is out of bounds for axis 0 with size 5

# df.insert(-1, 'Z', 10)
# ValueError: unbounded slice

Also, assigning an existing column name as the second argument will lead to an error. Although it's possible to allow duplicates by setting the allow_duplicates argument to True, it's not recommended due to potential confusion caused by duplicated column names.

# df.insert(0, 'Y', 10)
# ValueError: cannot insert Y, already exists

df.insert(0, 'Y', 10, allow_duplicates=True)
print(df)
#         Y    Y   A   B  X   C
# ONE    10  NaN  A1  B1  0  C1
# TWO    10   X2  A2  B2  0  C2
# THREE  10   X3  A3  B3  0  C3

The pandas.concat() function

You can concatenate multiple DataFrame and Series objects using the concat() function.

By concatenating a Series to a DataFrame, you can add a new column.

In the previous examples, when adding a Series, its name attribute was ignored. However, when concatenating horizontally with the concat() function with axis=1, the name of the Series is used as the column name.

Specify a list or tuple of objects you want to concatenate as the first argument to concat().

df = pd.DataFrame({'A': ['A1', 'A2', 'A3'],
                   'B': ['B1', 'B2', 'B3'],
                   'C': ['C1', 'C2', 'C3']},
                  index=['ONE', 'TWO', 'THREE'])
s = pd.Series(['X2', 'X3', 'X4'], index=['TWO', 'THREE', 'FOUR'], name='X')

print(pd.concat([df, s], axis=1))
#          A    B    C    X
# ONE     A1   B1   C1  NaN
# TWO     A2   B2   C2   X2
# THREE   A3   B3   C3   X3
# FOUR   NaN  NaN  NaN   X4

To keep only the rows sharing common indices, specify join='inner'.

print(pd.concat([df, s], axis=1, join='inner'))
#         A   B   C   X
# TWO    A2  B2  C2  X2
# THREE  A3  B3  C3  X3

The function allows you to concatenate multiple Series and DataFrame objects.

s1 = pd.Series(['X1', 'X2', 'X3'], index=df.index, name='X')
s2 = pd.Series(['Y1', 'Y2', 'Y3'], index=df.index, name='Y')

df2 = pd.DataFrame({'df_col1': 0, 'df_col2': range(3)}, index=df.index)

print(pd.concat([df, s1, s2, df2], axis=1))
#         A   B   C   X   Y  df_col1  df_col2
# ONE    A1  B1  C1  X1  Y1        0        0
# TWO    A2  B2  C2  X2  Y2        0        1
# THREE  A3  B3  C3  X3  Y3        0        2

Add a row to a pandas.DataFrame

Add a row using .loc[]

You can select a row using loc[row_name] and assign values to it.

df = pd.DataFrame({'A': ['A1', 'A2', 'A3'],
                   'B': ['B1', 'B2', 'B3'],
                   'C': ['C1', 'C2', 'C3']},
                  index=['ONE', 'TWO', 'THREE'])
print(df)
#         A   B   C
# ONE    A1  B1  C1
# TWO    A2  B2  C2
# THREE  A3  B3  C3

df.loc['ONE'] = 0
print(df)
#         A   B   C
# ONE     0   0   0
# TWO    A2  B2  C2
# THREE  A3  B3  C3

As with columns, by specifying a non-existent row name, you can add the row and assign values to it.

The approach is the same as for columns. You can assign a scalar value or an array-like object.

df.loc['FOUR'] = 0
df.loc['FIVE'] = ['A5', 'B5', 'C5']
print(df)
#         A   B   C
# ONE     0   0   0
# TWO    A2  B2  C2
# THREE  A3  B3  C3
# FOUR    0   0   0
# FIVE   A5  B5  C5

For array-like objects, ensure that the number of elements matches the number of columns; otherwise, it will cause an error.

# df.loc['SIX'] = ['A6', 'B6']
# ValueError: cannot set a row with mismatched columns

Like columns, Series can also be assigned to rows. If the labels do not match, missing values NaN are assigned. If you want to ignore the labels, you can use values to convert to NumPy array ndarray.

s = pd.Series(['B6', 'C6', 'D6'], index=['B', 'C', 'D'], name='SIX')
print(s)
# B    B6
# C    C6
# D    D6
# Name: SIX, dtype: object

df.loc['XXX'] = df.loc['TWO'] + df.loc['THREE']
df.loc['YYY'] = s
df.loc['ZZZ'] = s.values
print(df)
#           A     B     C
# ONE       0     0     0
# TWO      A2    B2    C2
# THREE    A3    B3    C3
# FOUR      0     0     0
# FIVE     A5    B5    C5
# XXX    A2A3  B2B3  C2C3
# YYY     NaN    B6    C6
# ZZZ      B6    C6    D6

The pandas.DataFrame.append() method (deprecated in version 1.4.0)

The append() method was formerly used to add new rows to DataFrame. However, this method was deprecated in version 1.4.0 and removed in version 2.0.0.

In the release notes, it is recommended to use the pandas.concat() function instead.

The pandas.concat() function

You can concatenate multiple DataFrame and Series objects using the concat() function.

Specify a list or tuple of objects you want to concatenate as the first argument to concat(). By default, they are concatenated vertically.

df1 = pd.DataFrame({'A': ['A1', 'A2', 'A3'],
                    'B': ['B1', 'B2', 'B3'],
                    'C': ['C1', 'C2', 'C3']},
                   index=['ONE', 'TWO', 'THREE'])
print(df1)
#         A   B   C
# ONE    A1  B1  C1
# TWO    A2  B2  C2
# THREE  A3  B3  C3

df2 = pd.DataFrame({'B': ['B4', 'B5'], 'C': ['C4', 'C5'], 'D': ['D4', 'D5']},
                   index=['FOUR', 'FIVE'])
print(df2)
#        B   C   D
# FOUR  B4  C4  D4
# FIVE  B5  C5  D5

print(pd.concat([df1, df2]))
#          A   B   C    D
# ONE     A1  B1  C1  NaN
# TWO     A2  B2  C2  NaN
# THREE   A3  B3  C3  NaN
# FOUR   NaN  B4  C4   D4
# FIVE   NaN  B5  C5   D5

To retain only the columns that share common names, specify join='inner'.

print(pd.concat([df1, df2], join='inner'))
#         B   C
# ONE    B1  C1
# TWO    B2  C2
# THREE  B3  C3
# FOUR   B4  C4
# FIVE   B5  C5

You need to exercise caution when concatenating DataFrame and Series vertically.

By default, it looks like this.

s = pd.Series(['A4', 'B4', 'C4'], index=['A', 'B', 'C'], name='FOUR')
print(s)
# A    A4
# B    B4
# C    C4
# Name: FOUR, dtype: object

print(pd.concat([df1, s]))
#          A    B    C    0
# ONE     A1   B1   C1  NaN
# TWO     A2   B2   C2  NaN
# THREE   A3   B3   C3  NaN
# A      NaN  NaN  NaN   A4
# B      NaN  NaN  NaN   B4
# C      NaN  NaN  NaN   C4

By converting the Series to DataFrame with the to_frame() method and transposing it with T, you get a one-row DataFrame. You can concatenate this.

print(s.to_frame().T)
#        A   B   C
# FOUR  A4  B4  C4

print(pd.concat([df1, s.to_frame().T]))
#         A   B   C
# ONE    A1  B1  C1
# TWO    A2  B2  C2
# THREE  A3  B3  C3
# FOUR   A4  B4  C4

Note: Add a large number of rows or columns

It's not recommended to add a large number of rows or columns to a DataFrame individually due to inefficiency.

For example, when you add one column at a time in a for loop, a PerformanceWarning is issued. It seems to be issued when you add more than 100 columns.

df = pd.DataFrame()
for i in range(101):
    df[i] = 0
# PerformanceWarning: DataFrame is highly fragmented.
# This is usually the result of calling `frame.insert` many times, which has poor performance.
# Consider joining all columns at once using pd.concat(axis=1) instead.
# To get a de-fragmented frame, use `newframe = frame.copy()`

Unless you need to use the features of DataFrame every time you add a row or column, it is better to concatenate all at once using concat(), as the warning message suggests.

A comparison of processing speed between adding one by one and adding all at once will be introduced at the end.

Add multiple rows at once

Take the following DataFrame as an example.

df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [10, 20, 30], 'col3': [100, 200, 300]},
                  index=['row1', 'row2', 'row3'])
print(df)
#       col1  col2  col3
# row1     1    10   100
# row2     2    20   200
# row3     3    30   300

Add the data and row names for each row to separate lists. Although the content is simply created here, in the actual code it is created by some data processing.

l_data = []
l_label = []

for i in range(4, 7):
    l_data.append([i, i * 10, i * 100])
    l_label.append(f'row{i}')

print(l_data)
# [[4, 40, 400], [5, 50, 500], [6, 60, 600]]

print(l_label)
# ['row4', 'row5', 'row6']

Create a DataFrame from these lists and the column names columns of the original DataFrame, and concatenate it with the original DataFrame.

df_append = pd.DataFrame(l_data, index=l_label, columns=df.columns)
print(df_append)
#       col1  col2  col3
# row4     4    40   400
# row5     5    50   500
# row6     6    60   600

df_result = pd.concat([df, df_append])
print(df_result)
#       col1  col2  col3
# row1     1    10   100
# row2     2    20   200
# row3     3    30   300
# row4     4    40   400
# row5     5    50   500
# row6     6    60   600

Add multiple columns at once

The concept is the same when adding columns as it was for adding rows as described above.

Take the following DataFrame as an example.

df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [10, 20, 30], 'col3': [100, 200, 300]},
                  index=['row1', 'row2', 'row3'])
print(df)
#       col1  col2  col3
# row1     1    10   100
# row2     2    20   200
# row3     3    30   300

Add the data and column names for each column to separate lists.

l_data = []
l_label = []

for i in range(3, 6):
    l_data.append([10**i, 2 * 10**i, 3 * 10**i])
    l_label.append(f'col{i + 1}')

print(l_data)
# [[1000, 2000, 3000], [10000, 20000, 30000], [100000, 200000, 300000]]

print(l_label)
# ['col4', 'col5', 'col6']

Create a DataFrame from these lists and the row names index of the original DataFrame, and concatenate it with the original DataFrame. Note that you need to transpose the two-dimensional list holding the data.

df_append = pd.DataFrame(zip(*l_data), index=df.index, columns=l_label)
print(df_append)
#       col4   col5    col6
# row1  1000  10000  100000
# row2  2000  20000  200000
# row3  3000  30000  300000

df_result = pd.concat([df, df_append], axis=1)
print(df_result)
#       col1  col2  col3  col4   col5    col6
# row1     1    10   100  1000  10000  100000
# row2     2    20   200  2000  20000  200000
# row3     3    30   300  3000  30000  300000

Processing Speed Comparison

Compare the processing speed between adding rows or columns one by one and adding them all at once.

The following examples use the Jupyter Notebook magic command %%timeit. Note that these will not work if run as Python scripts.

In the case of adding 1000 rows:

%%timeit
df_loc = pd.DataFrame([[0, 0, 0], [1, 1, 1], [2, 2, 2]])

for i in range(3, 1003):
    df_loc.loc[i] = [i] * 3
# 150 ms ± 4.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
df = pd.DataFrame([[0, 0, 0], [1, 1, 1], [2, 2, 2]])
l_data = []
l_label = []

for i in range(3, 1003):
    l_data.append([i] * 3)
    l_label.append(i)

df_concat = pd.concat([df, pd.DataFrame(l_data, index=l_label, columns=df.columns)])
# 487 µs ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In the case of adding 1000 columns:

%%timeit
df_index = pd.DataFrame([[0, 0, 0], [1, 1, 1], [2, 2, 2]])

for i in range(3, 1003):
    df_index[i] = [0, 1, 2]
# 31.2 ms ± 578 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
df = pd.DataFrame([[0, 0, 0], [1, 1, 1], [2, 2, 2]])
l_data = []
l_label = []

for i in range(3, 1003):
    l_data.append([0, 1, 2])
    l_label.append(i)

df_concat = pd.concat([df, pd.DataFrame(zip(*l_data), index=df.index, columns=l_label)], axis=1)
# 3.56 ms ± 54.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In both cases, adding all rows or columns at once proves to be significantly faster.

Related Categories

Related Articles