pandas: Iterate DataFrame with for loop (iterrows, itertuples, items)
This article explains how to iterate over a pandas.DataFrame
with a for
loop.
When you simply iterate over a DataFrame
, it returns the column names; however, you can iterate over its columns or rows using methods like items()
(formerly iteritems()
), iterrows()
, and itertuples()
.
The latter part of this article also discusses approaches for processing a DataFrame
without a for
loop.
- Iterate over a
DataFrame
- Iterate over columns of a
DataFrame
:items()
(formerlyiteritems()
) - Iterate over rows of a
DataFrame
:iterrows()
,itertuples()
- Iterate over a specific column (=
Series
) of aDataFrame
- Update values within a
for
loop - Process a
DataFrame
without afor
loop - Processing speed comparison
For more details on for
loops in Python, see the following article.
The pandas version used in this article is as follows. Note that functionality may vary between versions. The following DataFrame
is used as an example.
import pandas as pd
print(pd.__version__)
# 2.1.4
df = pd.DataFrame({'age': [24, 42], 'state': ['NY', 'CA'], 'point': [64, 92]},
index=['Alice', 'Bob'])
print(df)
# age state point
# Alice 24 NY 64
# Bob 42 CA 92
Iterate over a DataFrame
Iterating directly over a DataFrame
with a for
loop extracts the column names sequentially.
for column_name in df:
print(column_name)
# age
# state
# point
Iterate over columns of a DataFrame
: items()
(formerly iteritems()
)
The items()
method iterates over the columns of a DataFrame
as (column_name, Series)
pairs.
You can extract each value by specifying the label in the Series
.
for column_name, item in df.items():
print(column_name)
print(type(item))
print(item['Alice'], item['Bob'])
print('======')
# age
# <class 'pandas.core.series.Series'>
# 24 42
# ======
# state
# <class 'pandas.core.series.Series'>
# NY CA
# ======
# point
# <class 'pandas.core.series.Series'>
# 64 92
# ======
Note that this method was previously named iteritems()
, but it was changed to items()
. iteritems()
was removed in pandas version 2.0.
- What’s new in 2.0.0 (April 3, 2023) — pandas 2.1.4 documentation
- DEPR: Series/DataFrame/HDFStore.iteritems() by mroeschke · Pull Request #45321 · pandas-dev/pandas
Iterate over rows of a DataFrame
: iterrows()
, itertuples()
You can use the iterrows()
and itertuples()
methods to iterate over rows of a DataFrame
. itertuples()
is faster than iterrows()
.
If you only need the values of a specific column, it is even faster to iterate over that column individually, as described next. The results of an experiment on processing speed are shown at the end.
iterrows()
The iterrows()
method iterates over the rows of a DataFrame
as (index, Series)
pairs.
for index, row in df.iterrows():
print(index)
print(type(row))
print(row['age'], row['state'], row['point'])
print('======')
# Alice
# <class 'pandas.core.series.Series'>
# 24 NY 64
# ======
# Bob
# <class 'pandas.core.series.Series'>
# 42 CA 92
# ======
itertuples()
The itertuples()
method iterates over the rows of a DataFrame
, returning each as a namedtuple
.
By default, it returns a namedtuple
named Pandas
, with the first element representing the index (row name). You can access each value with both []
and .
.
for row in df.itertuples():
print(type(row))
print(row)
print(row[0], row[1], row[2], row[3])
print(row.Index, row.age, row.state, row.point)
print('======')
# <class 'pandas.core.frame.Pandas'>
# Pandas(Index='Alice', age=24, state='NY', point=64)
# Alice 24 NY 64
# Alice 24 NY 64
# ======
# <class 'pandas.core.frame.Pandas'>
# Pandas(Index='Bob', age=42, state='CA', point=92)
# Bob 42 CA 92
# Bob 42 CA 92
# ======
Setting the index
argument to False
excludes the index from the namedtuple
. You can also specify the name of the namedtuple
with the name
argument.
for row in df.itertuples(index=False, name='Person'):
print(type(row))
print(row)
print(row[0], row[1], row[2])
print(row.age, row.state, row.point)
print('======')
# <class 'pandas.core.frame.Person'>
# Person(age=24, state='NY', point=64)
# 24 NY 64
# 24 NY 64
# ======
# <class 'pandas.core.frame.Person'>
# Person(age=42, state='CA', point=92)
# 42 CA 92
# 42 CA 92
# ======
Setting the name
argument to None
returns a normal tuple
.
for row in df.itertuples(name=None):
print(type(row))
print(row)
print(row[0], row[1], row[2], row[3])
print('======')
# <class 'tuple'>
# ('Alice', 24, 'NY', 64)
# Alice 24 NY 64
# ======
# <class 'tuple'>
# ('Bob', 42, 'CA', 92)
# Bob 42 CA 92
# ======
Iterate over a specific column (= Series
) of a DataFrame
Although the iterrows()
and itertuples()
methods yield all values of each row, if you only need values of a specific column, you can iterate over it.
A column in a DataFrame
is a Series
.
print(df['age'])
# Alice 24
# Bob 42
# Name: age, dtype: int64
print(type(df['age']))
# <class 'pandas.core.series.Series'>
Since iterating over a Series
yields its values, you can sequentially retrieve the values of the DataFrame
column by using a for
loop.
for age in df['age']:
print(age)
# 24
# 42
The built-in zip()
function can be used to retrieve values from multiple columns together.
for age, point in zip(df['age'], df['point']):
print(age, point)
# 24 64
# 42 92
To retrieve the row names, use the index
attribute. As with the above example, you can retrieve them together with other columns using zip()
.
print(df.index)
# Index(['Alice', 'Bob'], dtype='object')
print(type(df.index))
# <class 'pandas.core.indexes.base.Index'>
for index in df.index:
print(index)
# Alice
# Bob
for index, state in zip(df.index, df['state']):
print(index, state)
# Alice NY
# Bob CA
Update values within a for
loop
The Series
returned by iterrows()
may be a copy, not a view, so modifying it may not update the original data.
print(df)
# age state point
# Alice 24 NY 64
# Bob 42 CA 92
for index, row in df.iterrows():
row['point'] += row['age']
print(df)
# age state point
# Alice 24 NY 64
# Bob 42 CA 92
You can update values by selecting an element of the original DataFrame
with at[]
.
for index, row in df.iterrows():
df.at[index, 'point'] += row['age']
print(df)
# age state point
# Alice 24 NY 88
# Bob 42 CA 134
Although the previous example demonstrates the use of at[]
for updating values, it is important to note that in many situations, a for
loop is unnecessary for such updates. Often, alternative methods are not only simpler but also more efficient. The next section introduces specific examples of these alternatives.
Process a DataFrame
without a for
loop
The operation demonstrated in the previous section with a for
loop can also be achieved without a for
loop as follows.
df = pd.DataFrame({'age': [24, 42], 'state': ['NY', 'CA'], 'point': [64, 92]},
index=['Alice', 'Bob'])
print(df)
# age state point
# Alice 24 NY 64
# Bob 42 CA 92
df['point'] += df['age']
print(df)
# age state point
# Alice 24 NY 88
# Bob 42 CA 134
It is also possible to process existing columns and add them as new columns.
df['new'] = df['point'] + df['age'] * 2 + 1000
print(df)
# age state point new
# Alice 24 NY 88 1136
# Bob 42 CA 134 1218
In addition to arithmetic operations using operators like +
and *
, you can apply NumPy functions to each element of a column.
import numpy as np
df['age_sqrt'] = np.sqrt(df['age'])
print(df)
# age state point new age_sqrt
# Alice 24 NY 88 1136 4.898979
# Bob 42 CA 134 1218 6.480741
For string processing, pandas offers specific methods to handle columns (Series
) directly.
- pandas: Handle strings (replace, strip, case conversion, etc.)
- pandas: Slice substrings from each element in columns
df['state_0'] = df['state'].str.lower().str[0]
print(df)
# age state point new age_sqrt state_0
# Alice 24 NY 88 1136 4.898979 n
# Bob 42 CA 134 1218 6.480741 c
Furthermore, you can apply any function to each element or to each row/column using themap()
and apply()
methods.
df['point_hex'] = df['point'].map(hex)
print(df)
# age state point new age_sqrt state_0 point_hex
# Alice 24 NY 88 1136 4.898979 n 0x58
# Bob 42 CA 134 1218 6.480741 c 0x86
Processing speed comparison
This section compares the processing speeds of methods such as iterrows()
, itertuples()
, and column-specific for
loops.
Consider the following DataFrame
with 100 rows and 10 columns.
- numpy.arange(), linspace(): Generate ndarray with evenly spaced values
- pandas: Get first/last n rows of DataFrame with head() and tail()
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(1000).reshape(100, 10))
print(df.shape)
# (100, 10)
print(df.head(3))
# 0 1 2 3 4 5 6 7 8 9
# 0 0 1 2 3 4 5 6 7 8 9
# 1 10 11 12 13 14 15 16 17 18 19
# 2 20 21 22 23 24 25 26 27 28 29
print(df.tail(3))
# 0 1 2 3 4 5 6 7 8 9
# 97 970 971 972 973 974 975 976 977 978 979
# 98 980 981 982 983 984 985 986 987 988 989
# 99 990 991 992 993 994 995 996 997 998 999
The following code was measured using the %%timeit
magic command in Jupyter Notebook. Note that it is not measured if executed as a Python script.
%%timeit
for i, row in df.iterrows():
pass
# 735 µs ± 20.5 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
for t in df.itertuples():
pass
# 202 µs ± 1.74 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
for t in df.itertuples(name=None):
pass
# 148 µs ± 780 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%%timeit
for i in df[0]:
pass
# 4.27 µs ± 30.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
%%timeit
for i, j, k in zip(df[0], df[4], df[9]):
pass
# 13.5 µs ± 53.4 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
%%timeit
for t in zip(df[0], df[1], df[2], df[3], df[4], df[5], df[6], df[7], df[8], df[9]):
pass
# 41.3 µs ± 281 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
iterrows()
tends to be quite slow, as it converts each row into a Series
, whereas itertuples()
is faster. Specifying columns for iteration, however, is the fastest method. In our example environment, column-specific iteration proved faster than itertuples()
, even when extracting all columns.
While the speed difference may not be significant for datasets with around 100 rows, iterrows()
slows significantly with larger datasets. In such cases, it is advisable to use itertuples()
or column-specific iteration.
As mentioned earlier, the most efficient approach often involves performing operations without for
loops.