pandas: Replace Series values with map()
In pandas, you can replace values in a Series
using the map()
method with a dictionary. The replace()
method can also replace values, but depending on the conditions, map()
may be faster.
map()
is also used to apply functions to each value in a Series
.
The pandas version used in this article is as follows. Note that functionality may vary between versions.
import pandas as pd
print(pd.__version__)
# 2.1.2
Differences between map()
and replace()
in replacement
When a dictionary (dict
) is specified in map()
, values in the Series
matching a dictionary key are replaced with the corresponding dictionary value.
s = pd.Series(['A', 'B', 'C', 'A', 'B'])
print(s)
# 0 A
# 1 B
# 2 C
# 3 A
# 4 B
# dtype: object
print(s.map({'A': 'XX', 'B': 'YY', 'C': 'ZZ'}))
# 0 XX
# 1 YY
# 2 ZZ
# 3 XX
# 4 YY
# dtype: object
You can also specify a dictionary in replace()
. If all values in the Series
are to be replaced, the result is the same as with map()
.
print(s.replace({'A': 'XX', 'B': 'YY', 'C': 'ZZ'}))
# 0 XX
# 1 YY
# 2 ZZ
# 3 XX
# 4 YY
# dtype: object
When the dictionary keys do not cover all values in the Series
, the results differ. With map()
, unmatched values become NaN
, whereas with replace()
, they remain unchanged.
print(s.map({'A': 'XX'}))
# 0 XX
# 1 NaN
# 2 NaN
# 3 XX
# 4 NaN
# dtype: object
print(s.replace({'A': 'XX'}))
# 0 XX
# 1 B
# 2 C
# 3 XX
# 4 B
# dtype: object
To preserve values in the Series
that map()
does not match, use the original Series
in the fillna()
method to fill NaN
.
print(s.map({'A': 'XX'}).fillna(s))
# 0 XX
# 1 B
# 2 C
# 3 XX
# 4 B
# dtype: object
Note that replace()
allows for more complex operations such as using regular expressions to replace parts of strings, or replacing values differently for each column in a DataFrame
. For more details, see the following article.
Speed comparison
Measure the execution time of map()
and replace()
using the Jupyter Notebook magic command, %%timeit
, which does not function in a regular Python script.
Consider a Series
of 100 values.
s = pd.Series(range(100))
map()
is faster than replace()
when all values are replaced.
d_100 = {i: i * 10 for i in range(100)}
%%timeit
s.map(d_100)
# 70.7 µs ± 2.08 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%%timeit
s.replace(d_100)
# 1.31 ms ± 26.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Even when replacing with a dictionary of 50 elements, map()
combined with fillna()
is faster than replace()
.
d_50 = {i: i * 10 for i in range(50)}
%%timeit
s.map(d_50).fillna(s)
# 108 µs ± 3.1 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%%timeit
s.replace(d_50)
# 653 µs ± 3.73 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
However, when replacing with a dictionary of 5 elements, replace()
is faster than map()
combined with fillna()
.
d_5 = {i: i * 10 for i in range(5)}
%%timeit
s.map(d_5).fillna(s)
# 104 µs ± 3.85 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%%timeit
s.replace(d_5)
# 78.5 µs ± 860 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
The execution time of replace()
greatly depends on the size of the dictionary.
Since results can vary based on the execution environment and other factors, it is recommended to test both map()
and replace()
under real-world conditions, especially when speed is crucial, before making a decision.