pandas: Add rows/columns to DataFrame with assign(), insert()

Posted: 2023-08-01 | Tags: Python, pandas

This article explains how to add new rows/columns to a pandas.DataFrame.

Contents

Add a column to a pandas.DataFrame
Add a row to a pandas.DataFrame
Note: Add a large number of rows or columns

Note that the append() method was deprecated in version 1.4.0 and removed in 2.0.0.

What’s new in 1.4.0 (January 22, 2022) — pandas 2.0.3 documentation

The sample code in this article uses pandas version 2.0.3.

import pandas as pd

print(pd.__version__)
# 2.0.3

source: pandas_add_column.py

Add a column to a `pandas.DataFrame`

Add a column using bracket notation `[]`

You can select a column using [column_name] and assign values to it.

pandas: Select rows/columns by index (numbers and names)

df = pd.DataFrame({'A': ['A1', 'A2', 'A3'],
                   'B': ['B1', 'B2', 'B3'],
                   'C': ['C1', 'C2', 'C3']},
                  index=['ONE', 'TWO', 'THREE'])
print(df)
#         A   B   C
# ONE    A1  B1  C1
# TWO    A2  B2  C2
# THREE  A3  B3  C3

df['A'] = 0
print(df)
#        A   B   C
# ONE    0  B1  C1
# TWO    0  B2  C2
# THREE  0  B3  C3

source: pandas_add_column.py

If you specify a non-existent column name, a new column will be added with the assigned value.

Assign a scalar value

When a scalar value is assigned, all elements in the column are set to that value.

df['D'] = 0
print(df)
#        A   B   C  D
# ONE    0  B1  C1  0
# TWO    0  B2  C2  0
# THREE  0  B3  C3  0

source: pandas_add_column.py

Assign an array-like object

If an array-like object such as a list or a NumPy array ndarray is assigned, each element is assigned directly. Note that a mismatch between the number of elements in the list and the number of rows will result in an error.

df['E'] = [0, 1, 2]
print(df)
#        A   B   C  D  E
# ONE    0  B1  C1  0  0
# TWO    0  B2  C2  0  1
# THREE  0  B3  C3  0  2

# df['F'] = [0, 1, 2, 3]
# ValueError: Length of values does not match length of index

source: pandas_add_column.py

Assign a `pandas.Series`

You can also assign a Series.

Since each column of a DataFrame is treated as a Series, you can add new columns based on the results of operations or the processed results of these methods.

pandas: Handle strings (replace, strip, case conversion, etc.)

df['F'] = df['B'] + df['C']
df['G'] = df['B'].str.lower()
print(df)
#        A   B   C  D  E     F   G
# ONE    0  B1  C1  0  0  B1C1  b1
# TWO    0  B2  C2  0  1  B2C2  b2
# THREE  0  B3  C3  0  2  B3C3  b3

source: pandas_add_column.py

If the index label of the Series does not correspond to the column name of the DataFrame, a missing value NaN is assigned.

Missing values in pandas (nan, None, pd.NA)

s = pd.Series(['X2', 'X3', 'X4'], index=['TWO', 'THREE', 'FOUR'], name='X')
print(s)
# TWO      X2
# THREE    X3
# FOUR     X4
# Name: X, dtype: object

df['H'] = s
print(df)
#        A   B   C  D  E     F   G    H
# ONE    0  B1  C1  0  0  B1C1  b1  NaN
# TWO    0  B2  C2  0  1  B2C2  b2   X2
# THREE  0  B3  C3  0  2  B3C3  b3   X3

source: pandas_add_column.py

The values attribute of a Series returns a NumPy array ndarray, treated as an array-like object. Elements are assigned in order, regardless of the index. Note that an error will occur if the number of elements does not match the number of rows.

Convert pandas.DataFrame, Series and numpy.ndarray to each other

print(s.values)
# ['X2' 'X3' 'X4']

df['I'] = s.values
print(df)
#        A   B   C  D  E     F   G    H   I
# ONE    0  B1  C1  0  0  B1C1  b1  NaN  X2
# TWO    0  B2  C2  0  1  B2C2  b2   X2  X3
# THREE  0  B3  C3  0  2  B3C3  b3   X3  X4

source: pandas_add_column.py

The `pandas.DataFrame.assign()` method

The assign() method either appends a new column or assigns new values to an existing column.

pandas.DataFrame.assign — pandas 2.0.3 documentation

You can specify the column name and its value using the keyword argument structure, column_name=value.

If the column name exists, the method assigns the value to it. If the column name is new, it adds a new column. This method returns a new object, while the original object remains unchanged.

df = pd.DataFrame({'A': ['A1', 'A2', 'A3'],
                   'B': ['B1', 'B2', 'B3'],
                   'C': ['C1', 'C2', 'C3']},
                  index=['ONE', 'TWO', 'THREE'])

print(df.assign(A=0))
#        A   B   C
# ONE    0  B1  C1
# TWO    0  B2  C2
# THREE  0  B3  C3

print(df.assign(D=0))
#         A   B   C  D
# ONE    A1  B1  C1  0
# TWO    A2  B2  C2  0
# THREE  A3  B3  C3  0

print(df)
#         A   B   C
# ONE    A1  B1  C1
# TWO    A2  B2  C2
# THREE  A3  B3  C3

source: pandas_add_column.py

Just like when adding a column with [column_name], you can specify lists or Series with the assign() method. You can also add/assign multiple columns simultaneously by specifying multiple keyword arguments.

s = pd.Series(['X2', 'X3', 'X4'], index=['TWO', 'THREE', 'FOUR'], name='X')
print(s)
# TWO      X2
# THREE    X3
# FOUR     X4
# Name: X, dtype: object

df_new = df.assign(C='XXX',
                   D=0, E=[0, 1, 2],
                   F=s, G=s.values,
                   H=df['A'] + df['B'])
print(df_new)
#         A   B    C  D  E    F   G     H
# ONE    A1  B1  XXX  0  0  NaN  X2  A1B1
# TWO    A2  B2  XXX  0  1   X2  X3  A2B2
# THREE  A3  B3  XXX  0  2   X3  X4  A3B3

source: pandas_add_column.py

Note that in the assign() method, you specify the column name as a keyword argument. Therefore, names that are not valid as argument names, such as those with symbols other than underscores _, and reserved words, will result in an error. For information on acceptable argument names in Python, refer to the following article.

Valid variable names and naming rules in Python

The `pandas.DataFrame.insert()` method

The insert() method allows you to add a column at any position in a DataFrame.

pandas.DataFrame.insert — pandas 2.0.3 documentation

Specify the position as the first argument, the column name as the second, and the value to be assigned as the third.

The third argument can accept a scalar value, an array-like object such as a list, or a Series. The concept is similar to the previous examples.

The original DataFrame is directly updated.

df = pd.DataFrame({'A': ['A1', 'A2', 'A3'],
                   'B': ['B1', 'B2', 'B3'],
                   'C': ['C1', 'C2', 'C3']},
                  index=['ONE', 'TWO', 'THREE'])
s = pd.Series(['X2', 'X3', 'X4'], index=['TWO', 'THREE', 'FOUR'], name='X')

df.insert(2, 'X', 0)
print(df)
#         A   B  X   C
# ONE    A1  B1  0  C1
# TWO    A2  B2  0  C2
# THREE  A3  B3  0  C3

df.insert(0, 'Y', s)
print(df)
#          Y   A   B  X   C
# ONE    NaN  A1  B1  0  C1
# TWO     X2  A2  B2  0  C2
# THREE   X3  A3  B3  0  C3

source: pandas_add_column.py

Note that specifying a value exceeding the number of rows as the first argument will cause an error. Using a negative value to specify the position from the end is not allowed. To specify the end as the position for the new column, use len(df.columns) or df.shape[1] to get the number of existing columns.

pandas: Get the number of rows, columns, elements (size) of DataFrame

# df.insert(10, 'Z', 10)
# IndexError: index 10 is out of bounds for axis 0 with size 5

# df.insert(-1, 'Z', 10)
# ValueError: unbounded slice

source: pandas_add_column.py

Also, assigning an existing column name as the second argument will lead to an error. Although it's possible to allow duplicates by setting the allow_duplicates argument to True, it's not recommended due to potential confusion caused by duplicated column names.

# df.insert(0, 'Y', 10)
# ValueError: cannot insert Y, already exists

df.insert(0, 'Y', 10, allow_duplicates=True)
print(df)
#         Y    Y   A   B  X   C
# ONE    10  NaN  A1  B1  0  C1
# TWO    10   X2  A2  B2  0  C2
# THREE  10   X3  A3  B3  0  C3

source: pandas_add_column.py

The `pandas.concat()` function

You can concatenate multiple DataFrame and Series objects using the concat() function.

pandas: Concat multiple DataFrame/Series with concat()

By concatenating a Series to a DataFrame, you can add a new column.

In the previous examples, when adding a Series, its name attribute was ignored. However, when concatenating horizontally with the concat() function with axis=1, the name of the Series is used as the column name.

Specify a list or tuple of objects you want to concatenate as the first argument to concat().

df = pd.DataFrame({'A': ['A1', 'A2', 'A3'],
                   'B': ['B1', 'B2', 'B3'],
                   'C': ['C1', 'C2', 'C3']},
                  index=['ONE', 'TWO', 'THREE'])
s = pd.Series(['X2', 'X3', 'X4'], index=['TWO', 'THREE', 'FOUR'], name='X')

print(pd.concat([df, s], axis=1))
#          A    B    C    X
# ONE     A1   B1   C1  NaN
# TWO     A2   B2   C2   X2
# THREE   A3   B3   C3   X3
# FOUR   NaN  NaN  NaN   X4

source: pandas_add_column.py

To keep only the rows sharing common indices, specify join='inner'.

print(pd.concat([df, s], axis=1, join='inner'))
#         A   B   C   X
# TWO    A2  B2  C2  X2
# THREE  A3  B3  C3  X3

source: pandas_add_column.py

The function allows you to concatenate multiple Series and DataFrame objects.

s1 = pd.Series(['X1', 'X2', 'X3'], index=df.index, name='X')
s2 = pd.Series(['Y1', 'Y2', 'Y3'], index=df.index, name='Y')

df2 = pd.DataFrame({'df_col1': 0, 'df_col2': range(3)}, index=df.index)

print(pd.concat([df, s1, s2, df2], axis=1))
#         A   B   C   X   Y  df_col1  df_col2
# ONE    A1  B1  C1  X1  Y1        0        0
# TWO    A2  B2  C2  X2  Y2        0        1
# THREE  A3  B3  C3  X3  Y3        0        2

source: pandas_add_column.py

Add a row to a `pandas.DataFrame`

Add a row using `.loc[]`

You can select a row using loc[row_name] and assign values to it.

pandas: Get/Set values with loc, iloc, at, iat

df = pd.DataFrame({'A': ['A1', 'A2', 'A3'],
                   'B': ['B1', 'B2', 'B3'],
                   'C': ['C1', 'C2', 'C3']},
                  index=['ONE', 'TWO', 'THREE'])
print(df)
#         A   B   C
# ONE    A1  B1  C1
# TWO    A2  B2  C2
# THREE  A3  B3  C3

df.loc['ONE'] = 0
print(df)
#         A   B   C
# ONE     0   0   0
# TWO    A2  B2  C2
# THREE  A3  B3  C3

source: pandas_add_row.py

As with columns, by specifying a non-existent row name, you can add the row and assign values to it.

The approach is the same as for columns. You can assign a scalar value or an array-like object.

df.loc['FOUR'] = 0
df.loc['FIVE'] = ['A5', 'B5', 'C5']
print(df)
#         A   B   C
# ONE     0   0   0
# TWO    A2  B2  C2
# THREE  A3  B3  C3
# FOUR    0   0   0
# FIVE   A5  B5  C5

source: pandas_add_row.py

For array-like objects, ensure that the number of elements matches the number of columns; otherwise, it will cause an error.

# df.loc['SIX'] = ['A6', 'B6']
# ValueError: cannot set a row with mismatched columns

source: pandas_add_row.py

Like columns, Series can also be assigned to rows. If the labels do not match, missing values NaN are assigned. If you want to ignore the labels, you can use values to convert to NumPy array ndarray.

s = pd.Series(['B6', 'C6', 'D6'], index=['B', 'C', 'D'], name='SIX')
print(s)
# B    B6
# C    C6
# D    D6
# Name: SIX, dtype: object

df.loc['XXX'] = df.loc['TWO'] + df.loc['THREE']
df.loc['YYY'] = s
df.loc['ZZZ'] = s.values
print(df)
#           A     B     C
# ONE       0     0     0
# TWO      A2    B2    C2
# THREE    A3    B3    C3
# FOUR      0     0     0
# FIVE     A5    B5    C5
# XXX    A2A3  B2B3  C2C3
# YYY     NaN    B6    C6
# ZZZ      B6    C6    D6

source: pandas_add_row.py

The `pandas.DataFrame.append()` method (deprecated in version `1.4.0`)

The append() method was formerly used to add new rows to DataFrame. However, this method was deprecated in version 1.4.0 and removed in version 2.0.0.

In the release notes, it is recommended to use the pandas.concat() function instead.

The `pandas.concat()` function

You can concatenate multiple DataFrame and Series objects using the concat() function.

pandas: Concat multiple DataFrame/Series with concat()

Specify a list or tuple of objects you want to concatenate as the first argument to concat(). By default, they are concatenated vertically.

df1 = pd.DataFrame({'A': ['A1', 'A2', 'A3'],
                    'B': ['B1', 'B2', 'B3'],
                    'C': ['C1', 'C2', 'C3']},
                   index=['ONE', 'TWO', 'THREE'])
print(df1)
#         A   B   C
# ONE    A1  B1  C1
# TWO    A2  B2  C2
# THREE  A3  B3  C3

df2 = pd.DataFrame({'B': ['B4', 'B5'], 'C': ['C4', 'C5'], 'D': ['D4', 'D5']},
                   index=['FOUR', 'FIVE'])
print(df2)
#        B   C   D
# FOUR  B4  C4  D4
# FIVE  B5  C5  D5

print(pd.concat([df1, df2]))
#          A   B   C    D
# ONE     A1  B1  C1  NaN
# TWO     A2  B2  C2  NaN
# THREE   A3  B3  C3  NaN
# FOUR   NaN  B4  C4   D4
# FIVE   NaN  B5  C5   D5

source: pandas_add_row.py

To retain only the columns that share common names, specify join='inner'.

print(pd.concat([df1, df2], join='inner'))
#         B   C
# ONE    B1  C1
# TWO    B2  C2
# THREE  B3  C3
# FOUR   B4  C4
# FIVE   B5  C5

source: pandas_add_row.py

You need to exercise caution when concatenating DataFrame and Series vertically.

By default, it looks like this.

s = pd.Series(['A4', 'B4', 'C4'], index=['A', 'B', 'C'], name='FOUR')
print(s)
# A    A4
# B    B4
# C    C4
# Name: FOUR, dtype: object

print(pd.concat([df1, s]))
#          A    B    C    0
# ONE     A1   B1   C1  NaN
# TWO     A2   B2   C2  NaN
# THREE   A3   B3   C3  NaN
# A      NaN  NaN  NaN   A4
# B      NaN  NaN  NaN   B4
# C      NaN  NaN  NaN   C4

source: pandas_add_row.py

By converting the Series to DataFrame with the to_frame() method and transposing it with T, you get a one-row DataFrame. You can concatenate this.

pandas: Transpose DataFrame (swap rows and columns)

print(s.to_frame().T)
#        A   B   C
# FOUR  A4  B4  C4

print(pd.concat([df1, s.to_frame().T]))
#         A   B   C
# ONE    A1  B1  C1
# TWO    A2  B2  C2
# THREE  A3  B3  C3
# FOUR   A4  B4  C4

source: pandas_add_row.py

Note: Add a large number of rows or columns

It's not recommended to add a large number of rows or columns to a DataFrame individually due to inefficiency.

For example, when you add one column at a time in a for loop, a PerformanceWarning is issued. It seems to be issued when you add more than 100 columns.

df = pd.DataFrame()
for i in range(101):
    df[i] = 0
# PerformanceWarning: DataFrame is highly fragmented.
# This is usually the result of calling `frame.insert` many times, which has poor performance.
# Consider joining all columns at once using pd.concat(axis=1) instead.
# To get a de-fragmented frame, use `newframe = frame.copy()`

source: pandas_add_columns_rows.py

Unless you need to use the features of DataFrame every time you add a row or column, it is better to concatenate all at once using concat(), as the warning message suggests.

A comparison of processing speed between adding one by one and adding all at once will be introduced at the end.

Add multiple rows at once

Take the following DataFrame as an example.

df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [10, 20, 30], 'col3': [100, 200, 300]},
                  index=['row1', 'row2', 'row3'])
print(df)
#       col1  col2  col3
# row1     1    10   100
# row2     2    20   200
# row3     3    30   300

source: pandas_add_columns_rows.py

Add the data and row names for each row to separate lists. Although the content is simply created here, in the actual code it is created by some data processing.

l_data = []
l_label = []

for i in range(4, 7):
    l_data.append([i, i * 10, i * 100])
    l_label.append(f'row{i}')

print(l_data)
# [[4, 40, 400], [5, 50, 500], [6, 60, 600]]

print(l_label)
# ['row4', 'row5', 'row6']

source: pandas_add_columns_rows.py

Create a DataFrame from these lists and the column names columns of the original DataFrame, and concatenate it with the original DataFrame.

df_append = pd.DataFrame(l_data, index=l_label, columns=df.columns)
print(df_append)
#       col1  col2  col3
# row4     4    40   400
# row5     5    50   500
# row6     6    60   600

df_result = pd.concat([df, df_append])
print(df_result)
#       col1  col2  col3
# row1     1    10   100
# row2     2    20   200
# row3     3    30   300
# row4     4    40   400
# row5     5    50   500
# row6     6    60   600

source: pandas_add_columns_rows.py

Add multiple columns at once

The concept is the same when adding columns as it was for adding rows as described above.

Take the following DataFrame as an example.

df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [10, 20, 30], 'col3': [100, 200, 300]},
                  index=['row1', 'row2', 'row3'])
print(df)
#       col1  col2  col3
# row1     1    10   100
# row2     2    20   200
# row3     3    30   300

source: pandas_add_columns_rows.py

Add the data and column names for each column to separate lists.

l_data = []
l_label = []

for i in range(3, 6):
    l_data.append([10**i, 2 * 10**i, 3 * 10**i])
    l_label.append(f'col{i + 1}')

print(l_data)
# [[1000, 2000, 3000], [10000, 20000, 30000], [100000, 200000, 300000]]

print(l_label)
# ['col4', 'col5', 'col6']

source: pandas_add_columns_rows.py

Create a DataFrame from these lists and the row names index of the original DataFrame, and concatenate it with the original DataFrame. Note that you need to transpose the two-dimensional list holding the data.

Transpose 2D list in Python (swap rows and columns)

df_append = pd.DataFrame(zip(*l_data), index=df.index, columns=l_label)
print(df_append)
#       col4   col5    col6
# row1  1000  10000  100000
# row2  2000  20000  200000
# row3  3000  30000  300000

df_result = pd.concat([df, df_append], axis=1)
print(df_result)
#       col1  col2  col3  col4   col5    col6
# row1     1    10   100  1000  10000  100000
# row2     2    20   200  2000  20000  200000
# row3     3    30   300  3000  30000  300000

source: pandas_add_columns_rows.py

Processing Speed Comparison

Compare the processing speed between adding rows or columns one by one and adding them all at once.

The following examples use the Jupyter Notebook magic command %%timeit. Note that these will not work if run as Python scripts.

Measure execution time with timeit in Python

In the case of adding 1000 rows:

%%timeit
df_loc = pd.DataFrame([[0, 0, 0], [1, 1, 1], [2, 2, 2]])

for i in range(3, 1003):
    df_loc.loc[i] = [i] * 3
# 150 ms ± 4.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

source: pandas_add_columns_rows_timeit.py

%%timeit
df = pd.DataFrame([[0, 0, 0], [1, 1, 1], [2, 2, 2]])
l_data = []
l_label = []

for i in range(3, 1003):
    l_data.append([i] * 3)
    l_label.append(i)

df_concat = pd.concat([df, pd.DataFrame(l_data, index=l_label, columns=df.columns)])
# 487 µs ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

source: pandas_add_columns_rows_timeit.py

In the case of adding 1000 columns:

%%timeit
df_index = pd.DataFrame([[0, 0, 0], [1, 1, 1], [2, 2, 2]])

for i in range(3, 1003):
    df_index[i] = [0, 1, 2]
# 31.2 ms ± 578 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

source: pandas_add_columns_rows_timeit.py

%%timeit
df = pd.DataFrame([[0, 0, 0], [1, 1, 1], [2, 2, 2]])
l_data = []
l_label = []

for i in range(3, 1003):
    l_data.append([0, 1, 2])
    l_label.append(i)

df_concat = pd.concat([df, pd.DataFrame(zip(*l_data), index=df.index, columns=l_label)], axis=1)
# 3.56 ms ± 54.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

source: pandas_add_columns_rows_timeit.py

In both cases, adding all rows or columns at once proves to be significantly faster.

pandas: Add rows/columns to DataFrame with assign(), insert()

Add a column to a `pandas.DataFrame`

Add a column using bracket notation `[]`

Assign a scalar value

Assign an array-like object

Assign a `pandas.Series`

The `pandas.DataFrame.assign()` method

The `pandas.DataFrame.insert()` method

The `pandas.concat()` function

Add a row to a `pandas.DataFrame`

Add a row using `.loc[]`

The `pandas.DataFrame.append()` method (deprecated in version `1.4.0`)

The `pandas.concat()` function

Note: Add a large number of rows or columns

Add multiple rows at once

Add multiple columns at once

Processing Speed Comparison

Related Categories

Related Articles

pandas: Add rows/columns to DataFrame with assign(), insert()

Add a column to a pandas.DataFrame

Add a column using bracket notation []

Assign a scalar value

Assign an array-like object

Assign a pandas.Series

The pandas.DataFrame.assign() method

The pandas.DataFrame.insert() method

The pandas.concat() function

Add a row to a pandas.DataFrame

Add a row using .loc[]

The pandas.DataFrame.append() method (deprecated in version 1.4.0)

The pandas.concat() function

Note: Add a large number of rows or columns

Add multiple rows at once

Add multiple columns at once

Processing Speed Comparison

Related Categories

Related Articles

Add a column to a `pandas.DataFrame`

Add a column using bracket notation `[]`

Assign a `pandas.Series`

The `pandas.DataFrame.assign()` method

The `pandas.DataFrame.insert()` method

The `pandas.concat()` function

Add a row to a `pandas.DataFrame`

Add a row using `.loc[]`

The `pandas.DataFrame.append()` method (deprecated in version `1.4.0`)

The `pandas.concat()` function