pandas: Iterate DataFrame with for loop (iterrows, itertuples, items)

Modified: | Tags: Python, pandas

This article explains how to iterate over a pandas.DataFrame with a for loop.

When you simply iterate over a DataFrame, it returns the column names; however, you can iterate over its columns or rows using methods like items() (formerly iteritems()), iterrows(), and itertuples().

The latter part of this article also discusses approaches for processing a DataFrame without a for loop.

For more details on for loops in Python, see the following article.

The pandas version used in this article is as follows. Note that functionality may vary between versions. The following DataFrame is used as an example.

import pandas as pd

print(pd.__version__)
# 2.1.4

df = pd.DataFrame({'age': [24, 42], 'state': ['NY', 'CA'], 'point': [64, 92]},
                  index=['Alice', 'Bob'])
print(df)
#        age state  point
# Alice   24    NY     64
# Bob     42    CA     92

Iterate over a DataFrame

Iterating directly over a DataFrame with a for loop extracts the column names sequentially.

for column_name in df:
    print(column_name)
# age
# state
# point

Iterate over columns of a DataFrame: items()(formerly iteritems()

The items() method iterates over the columns of a DataFrame as (column_name, Series) pairs.

You can extract each value by specifying the label in the Series.

for column_name, item in df.items():
    print(column_name)
    print(type(item))
    print(item['Alice'], item['Bob'])
    print('======')
# age
# <class 'pandas.core.series.Series'>
# 24 42
# ======
# state
# <class 'pandas.core.series.Series'>
# NY CA
# ======
# point
# <class 'pandas.core.series.Series'>
# 64 92
# ======

Note that this method was previously named iteritems(), but it was changed to items(). iteritems() was removed in pandas version 2.0.

Iterate over rows of a DataFrame: iterrows(), itertuples()

You can use the iterrows() and itertuples() methods to iterate over rows of a DataFrame. itertuples() is faster than iterrows().

If you only need the values of a specific column, it is even faster to iterate over that column individually, as described next. The results of an experiment on processing speed are shown at the end.

iterrows()

The iterrows() method iterates over the rows of a DataFrame as (index, Series) pairs.

for index, row in df.iterrows():
    print(index)
    print(type(row))
    print(row['age'], row['state'], row['point'])
    print('======')
# Alice
# <class 'pandas.core.series.Series'>
# 24 NY 64
# ======
# Bob
# <class 'pandas.core.series.Series'>
# 42 CA 92
# ======

itertuples()

The itertuples() method iterates over the rows of a DataFrame, returning each as a namedtuple.

By default, it returns a namedtuple named Pandas, with the first element representing the index (row name). You can access each value with both [] and ..

for row in df.itertuples():
    print(type(row))
    print(row)
    print(row[0], row[1], row[2], row[3])
    print(row.Index, row.age, row.state, row.point)
    print('======')
# <class 'pandas.core.frame.Pandas'>
# Pandas(Index='Alice', age=24, state='NY', point=64)
# Alice 24 NY 64
# Alice 24 NY 64
# ======
# <class 'pandas.core.frame.Pandas'>
# Pandas(Index='Bob', age=42, state='CA', point=92)
# Bob 42 CA 92
# Bob 42 CA 92
# ======

Setting the index argument to False excludes the index from the namedtuple. You can also specify the name of the namedtuple with the name argument.

for row in df.itertuples(index=False, name='Person'):
    print(type(row))
    print(row)
    print(row[0], row[1], row[2])
    print(row.age, row.state, row.point)
    print('======')
# <class 'pandas.core.frame.Person'>
# Person(age=24, state='NY', point=64)
# 24 NY 64
# 24 NY 64
# ======
# <class 'pandas.core.frame.Person'>
# Person(age=42, state='CA', point=92)
# 42 CA 92
# 42 CA 92
# ======

Setting the name argument to None returns a normal tuple.

for row in df.itertuples(name=None):
    print(type(row))
    print(row)
    print(row[0], row[1], row[2], row[3])
    print('======')
# <class 'tuple'>
# ('Alice', 24, 'NY', 64)
# Alice 24 NY 64
# ======
# <class 'tuple'>
# ('Bob', 42, 'CA', 92)
# Bob 42 CA 92
# ======

Iterate over a specific column (= Series) of a DataFrame

Although the iterrows() and itertuples() methods yield all values of each row, if you only need values of a specific column, you can iterate over it.

A column in a DataFrame is a Series.

print(df['age'])
# Alice    24
# Bob      42
# Name: age, dtype: int64

print(type(df['age']))
# <class 'pandas.core.series.Series'>

Since iterating over a Series yields its values, you can sequentially retrieve the values of the DataFrame column by using a for loop.

for age in df['age']:
    print(age)
# 24
# 42

The built-in zip() function can be used to retrieve values from multiple columns together.

for age, point in zip(df['age'], df['point']):
    print(age, point)
# 24 64
# 42 92

To retrieve the row names, use the index attribute. As with the above example, you can retrieve them together with other columns using zip().

print(df.index)
# Index(['Alice', 'Bob'], dtype='object')

print(type(df.index))
# <class 'pandas.core.indexes.base.Index'>

for index in df.index:
    print(index)
# Alice
# Bob

for index, state in zip(df.index, df['state']):
    print(index, state)
# Alice NY
# Bob CA

Update values within a for loop

The Series returned by iterrows() may be a copy, not a view, so modifying it may not update the original data.

print(df)
#        age state  point
# Alice   24    NY     64
# Bob     42    CA     92

for index, row in df.iterrows():
    row['point'] += row['age']

print(df)
#        age state  point
# Alice   24    NY     64
# Bob     42    CA     92

You can update values by selecting an element of the original DataFrame with at[].

for index, row in df.iterrows():
    df.at[index, 'point'] += row['age']

print(df)
#        age state  point
# Alice   24    NY     88
# Bob     42    CA    134

Although the previous example demonstrates the use of at[] for updating values, it is important to note that in many situations, a for loop is unnecessary for such updates. Often, alternative methods are not only simpler but also more efficient. The next section introduces specific examples of these alternatives.

Process a DataFrame without a for loop

The operation demonstrated in the previous section with a for loop can also be achieved without a for loop as follows.

df = pd.DataFrame({'age': [24, 42], 'state': ['NY', 'CA'], 'point': [64, 92]},
                  index=['Alice', 'Bob'])
print(df)
#        age state  point
# Alice   24    NY     64
# Bob     42    CA     92

df['point'] += df['age']
print(df)
#        age state  point
# Alice   24    NY     88
# Bob     42    CA    134

It is also possible to process existing columns and add them as new columns.

df['new'] = df['point'] + df['age'] * 2 + 1000
print(df)
#        age state  point   new
# Alice   24    NY     88  1136
# Bob     42    CA    134  1218

In addition to arithmetic operations using operators like + and *, you can apply NumPy functions to each element of a column.

import numpy as np

df['age_sqrt'] = np.sqrt(df['age'])
print(df)
#        age state  point   new  age_sqrt
# Alice   24    NY     88  1136  4.898979
# Bob     42    CA    134  1218  6.480741

For string processing, pandas offers specific methods to handle columns (Series) directly.

df['state_0'] = df['state'].str.lower().str[0]
print(df)
#        age state  point   new  age_sqrt state_0
# Alice   24    NY     88  1136  4.898979       n
# Bob     42    CA    134  1218  6.480741       c

Furthermore, you can apply any function to each element or to each row/column using themap() and apply() methods.

df['point_hex'] = df['point'].map(hex)
print(df)
#        age state  point   new  age_sqrt state_0 point_hex
# Alice   24    NY     88  1136  4.898979       n      0x58
# Bob     42    CA    134  1218  6.480741       c      0x86

Processing speed comparison

This section compares the processing speeds of methods such as iterrows(), itertuples(), and column-specific for loops.

Consider the following DataFrame with 100 rows and 10 columns.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(1000).reshape(100, 10))
print(df.shape)
# (100, 10)

print(df.head(3))
#     0   1   2   3   4   5   6   7   8   9
# 0   0   1   2   3   4   5   6   7   8   9
# 1  10  11  12  13  14  15  16  17  18  19
# 2  20  21  22  23  24  25  26  27  28  29

print(df.tail(3))
#       0    1    2    3    4    5    6    7    8    9
# 97  970  971  972  973  974  975  976  977  978  979
# 98  980  981  982  983  984  985  986  987  988  989
# 99  990  991  992  993  994  995  996  997  998  999

The following code was measured using the %%timeit magic command in Jupyter Notebook. Note that it is not measured if executed as a Python script.

%%timeit
for i, row in df.iterrows():
    pass
# 735 µs ± 20.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%%timeit
for t in df.itertuples():
    pass
# 202 µs ± 1.74 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%%timeit
for t in df.itertuples(name=None):
    pass
# 148 µs ± 780 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

%%timeit
for i in df[0]:
    pass
# 4.27 µs ± 30.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

%%timeit
for i, j, k in zip(df[0], df[4], df[9]):
    pass
# 13.5 µs ± 53.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

%%timeit
for t in zip(df[0], df[1], df[2], df[3], df[4], df[5], df[6], df[7], df[8], df[9]):
    pass
# 41.3 µs ± 281 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

iterrows() tends to be quite slow, as it converts each row into a Series, whereas itertuples() is faster. Specifying columns for iteration, however, is the fastest method. In our example environment, column-specific iteration proved faster than itertuples(), even when extracting all columns.

While the speed difference may not be significant for datasets with around 100 rows, iterrows() slows significantly with larger datasets. In such cases, it is advisable to use itertuples() or column-specific iteration.

As mentioned earlier, the most efficient approach often involves performing operations without for loops.

Related Categories

Related Articles