pandas: Apply functions to values, rows, columns with map(), apply()
In pandas, you can use map()
, apply()
, and applymap()
methods to apply functions to values (element-wise), rows, or columns in DataFrames
and Series
.
As mentioned later, DataFrame
and Series
already include methods for common operations. Additionally, you can apply NumPy functions to DataFrame
and Series
. Using dedicated methods or NumPy functions is preferable to map()
or apply()
due to better performance.
The pandas and NumPy versions used in this article are as follows. Note that functionality may vary between versions.
import pandas as pd
import numpy as np
print(pd.__version__)
# 2.1.2
print(np.__version__)
# 1.26.1
Apply functions to values in Series
: map()
, apply()
To apply a function to each value in a Series
(element-wise), use the map()
or apply()
methods.
How to use map()
Passing a function to map()
returns a new Series
, with the function applied to each value. For example, apply the built-in hex()
function to convert integers to hexadecimal strings.
s = pd.Series([1, 10, 100])
print(s)
# 0 1
# 1 10
# 2 100
# dtype: int64
print(s.map(hex))
# 0 0x1
# 1 0xa
# 2 0x64
# dtype: object
You can also apply functions defined with def
or lambda expressions.
def my_func(x):
return x * 10
print(s.map(my_func))
# 0 10
# 1 100
# 2 1000
# dtype: int64
print(s.map(lambda x: x * 10))
# 0 10
# 1 100
# 2 1000
# dtype: int64
The above example is for illustrative purposes; simple arithmetic operations can be directly performed on a Series
.
print(s * 10)
# 0 10
# 1 100
# 2 1000
# dtype: int64
By default, missing values (NaN
) are passed to the function, but if you set the second argument na_action
to 'ignore'
, NaN
will not be passed to the function and the result will remain as NaN
.
Because the presence of NaN
changes the data type (dtype
) to a floating-point number (float
), values are converted to integers (int
) using int()
before being passed to hex()
in the following example.
s_nan = pd.Series([1, float('nan'), 100])
print(s_nan)
# 0 1.0
# 1 NaN
# 2 100.0
# dtype: float64
# print(s_nan.map(lambda x: hex(int(x))))
# ValueError: cannot convert float NaN to integer
print(s_nan.map(lambda x: hex(int(x)), na_action='ignore'))
# 0 0x1
# 1 NaN
# 2 0x64
# dtype: object
You can also pass a dictionary (dict
) to map()
. In this case, it replaces values. For more details, refer to the following article.
How to use apply()
Similar to map()
, the function specified as the first argument in apply()
is applied to each value. The difference is that apply()
allows you to specify arguments to be passed to the function.
With map()
, you need to use a lambda expression or similar approach to pass arguments to the function. For example, specify the base
argument in the int()
function, which converts strings to integers.
s = pd.Series(['11', 'AA', 'FF'])
print(s)
# 0 11
# 1 AA
# 2 FF
# dtype: object
# print(s.map(int, base=16))
# TypeError: Series.map() got an unexpected keyword argument 'base'
print(s.map(lambda x: int(x, 16)))
# 0 17
# 1 170
# 2 255
# dtype: int64
With apply()
, any specified keyword arguments are passed directly to the function. It is also possible to specify positional arguments using the args
argument.
print(s.apply(int, base=16))
# 0 17
# 1 170
# 2 255
# dtype: int64
print(s.apply(int, args=(16,)))
# 0 17
# 1 170
# 2 255
# dtype: int64
Note that even if there is only one positional argument, it must be specified as a tuple or list in the args
argument. A comma is necessary at the end of a one-element tuple.
As of version 2.1.2, apply()
does not have the na_action
argument.
Apply functions to values in DataFrame
: map()
, applymap()
To apply a function to each value in a DataFrame
(element-wise), use the map()
or applymap()
methods.
As of version 2.1.0, applymap()
has been renamed to map()
and marked as deprecated.
- What’s new in 2.1.0 (Aug 30, 2023) — pandas 2.1.3 documentation
- pandas.DataFrame.map — pandas 2.1.3 documentation
- pandas.DataFrame.applymap — pandas 2.1.3 documentation
As of version 2.1.2, applymap()
is still usable but issues a FutureWarning
.
df = pd.DataFrame([[1, 10, 100], [2, 20, 200]])
print(df)
# 0 1 2
# 0 1 10 100
# 1 2 20 200
print(df.map(hex))
# 0 1 2
# 0 0x1 0xa 0x64
# 1 0x2 0x14 0xc8
print(df.applymap(hex))
# 0 1 2
# 0 0x1 0xa 0x64
# 1 0x2 0x14 0xc8
#
# /var/folders/rf/b7l8_vgj5mdgvghn_326rn_c0000gn/T/ipykernel_36685/2076800564.py:1: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.
The following example uses map()
, but applymap()
has the same usage and functionality. In versions before 2.1.0, use applymap()
.
As with map()
of Series
, the na_action
argument can be specified for map()
of DataFrame
. By default, missing values (NaN
) are passed to the function, but if na_action
is set to 'ignore'
, NaN
is not passed to the function and the result remains as NaN
.
df_nan = pd.DataFrame([[1, float('nan'), 100], [2, 20, 200]])
print(df_nan)
# 0 1 2
# 0 1 NaN 100
# 1 2 20.0 200
# print(df_nan.map(lambda x: hex(int(x))))
# ValueError: cannot convert float NaN to integer
print(df_nan.map(lambda x: hex(int(x)), na_action='ignore'))
# 0 1 2
# 0 0x1 NaN 0x64
# 1 0x2 0x14 0xc8
Unlike map()
of Series
, map()
of DataFrame
passes the specified keyword argument to the function.
df = pd.DataFrame([['1', 'A', 'F'], ['11', 'AA', 'FF']])
print(df)
# 0 1 2
# 0 1 A F
# 1 11 AA FF
print(df.map(int, base=16))
# 0 1 2
# 0 1 10 15
# 1 17 170 255
As of version 2.1.2, map()
of DataFrame
does not have the args
argument, which means you cannot specify positional arguments.
Apply functions to rows and columns in DataFrame
: apply()
To apply a function to rows or columns in a DataFrame
, use the apply()
method.
For the agg()
method applying multiple operations at once, see the following article.
Basic usage
Specify the function you want to apply as the first argument.
Note that the built-in sum()
function is used for explanation purposes, but if you need to calculate a sum, it is better to use the sum()
method mentioned later.
df = pd.DataFrame([[10, 20, 30], [40, 50, 60]], index=['X', 'Y'], columns=['A', 'B', 'C'])
print(df)
# A B C
# X 10 20 30
# Y 40 50 60
print(df.apply(sum))
# A 50
# B 70
# C 90
# dtype: int64
By default, each column is passed to the function as a Series
. If the function cannot accept a Series
as an argument, an error will occur.
print(df.apply(lambda x: type(x)))
# A <class 'pandas.core.series.Series'>
# B <class 'pandas.core.series.Series'>
# C <class 'pandas.core.series.Series'>
# dtype: object
# print(hex(df['A']))
# TypeError: 'Series' object cannot be interpreted as an integer
# print(df.apply(hex))
# TypeError: 'Series' object cannot be interpreted as an integer
Specify rows or columns: axis
By default, the function is applied to each column. However, setting the axis
argument to 1
or 'columns'
applies it to each row.
df = pd.DataFrame([[10, 20, 30], [40, 50, 60]], index=['X', 'Y'], columns=['A', 'B', 'C'])
print(df)
# A B C
# X 10 20 30
# Y 40 50 60
print(df.apply(sum, axis=1))
# X 60
# Y 150
# dtype: int64
Specify arguments for the function: Keyword arguments, args
Any keyword arguments specified in apply()
are passed to the function being applied. You can also specify positional arguments using the args
argument.
df = pd.DataFrame([[10, 20, 30], [40, 50, 60]], index=['X', 'Y'], columns=['A', 'B', 'C'])
print(df)
# A B C
# X 10 20 30
# Y 40 50 60
def my_func(x, y, z):
return sum(x) + y + z * 2
print(df.apply(my_func, y=100, z=1000))
# A 2150
# B 2170
# C 2190
# dtype: int64
print(df.apply(my_func, args=(100, 1000)))
# A 2150
# B 2170
# C 2190
# dtype: int64
Pass as ndarray
instead of Series
: raw
By default, each row or column is passed as a Series
. If you set the raw
argument to True
, they are passed as NumPy arrays (ndarray
).
df = pd.DataFrame([[10, 20, 30], [40, 50, 60]], index=['X', 'Y'], columns=['A', 'B', 'C'])
print(df)
# A B C
# X 10 20 30
# Y 40 50 60
print(df.apply(lambda x: type(x), raw=True))
# A <class 'numpy.ndarray'>
# B <class 'numpy.ndarray'>
# C <class 'numpy.ndarray'>
# dtype: object
If there's no need for a Series
, using raw=True
is faster since the conversion process is omitted. However, if the function requires Series
methods or attributes, setting raw=True
will raise an error.
print(df.apply(lambda x: x.name * 3))
# A AAA
# B BBB
# C CCC
# dtype: object
# print(df.apply(lambda x: x.name * 3, raw=True))
# AttributeError: 'numpy.ndarray' object has no attribute 'name'
Apply functions to specific rows or columns
To apply a function to a specific row or column, extract the row or column as a Series
and use the map()
or apply()
methods of Series
.
df = pd.DataFrame([[10, 20, 30], [40, 50, 60]], index=['X', 'Y'], columns=['A', 'B', 'C'])
print(df)
# A B C
# X 10 20 30
# Y 40 50 60
print(df['A'].map(lambda x: x**2))
# X 100
# Y 1600
# Name: A, dtype: int64
print(df.loc['Y'].map(hex))
# A 0x28
# B 0x32
# C 0x3c
# Name: Y, dtype: object
You can add them as new rows or columns. If the same row or column names are specified, they will be overwritten.
df['A'] = df['A'].map(lambda x: x**2)
df.loc['Y_hex'] = df.loc['Y'].map(hex)
print(df)
# A B C
# X 100 20 30
# Y 1600 50 60
# Y_hex 0x640 0x32 0x3c
Use methods of DataFrame
and Series
, and arithmetic Operators
In pandas, common operations are provided as methods for DataFrame
and Series
, so there's no need to use map()
or apply()
.
df = pd.DataFrame([[1, -2, 3], [-4, 5, -6]], index=['X', 'Y'], columns=['A', 'B', 'C'])
print(df)
# A B C
# X 1 -2 3
# Y -4 5 -6
print(df.abs())
# A B C
# X 1 2 3
# Y 4 5 6
print(df.sum())
# A -3
# B 3
# C -3
# dtype: int64
print(df.sum(axis=1))
# X 2
# Y -5
# dtype: int64
For a list of available methods, refer to the official documentation.
- DataFrame - Computations / descriptive stats — pandas 2.1.3 documentation
- Series - Computations / descriptive stats — pandas 2.1.3 documentation
You can also process DataFrame
and Series
directly using arithmetic operators.
print(df * 10)
# A B C
# X 10 -20 30
# Y -40 50 -60
print(df['A'].abs() + df['B'] * 100)
# X -199
# Y 504
# dtype: int64
Methods for string manipulation are also available through the str
accessor of Series
.
df = pd.DataFrame([['a', 'ab', 'abc'], ['x', 'xy', 'xyz']], index=['X', 'Y'], columns=['A', 'B', 'C'])
print(df)
# A B C
# X a ab abc
# Y x xy xyz
print(df['A'] + '-' + df['B'].str.upper() + '-' + df['C'].str.title())
# X a-AB-Abc
# Y x-XY-Xyz
# dtype: object
Use NumPy functions
You can process DataFrame
and Series
by passing them to NumPy functions.
For example, although pandas does not provide a method for truncating decimals, you can use np.floor()
instead. For DataFrame
, a DataFrame
is returned; for Series
, a Series
is returned.
df = pd.DataFrame([[0.1, 0.5, 0.9], [-0.1, -0.5, -0.9]], index=['X', 'Y'], columns=['A', 'B', 'C'])
print(df)
# A B C
# X 0.1 0.5 0.9
# Y -0.1 -0.5 -0.9
print(np.floor(df))
# A B C
# X 0.0 0.0 0.0
# Y -1.0 -1.0 -1.0
print(type(np.floor(df)))
# <class 'pandas.core.frame.DataFrame'>
print(np.floor(df['A']))
# X 0.0
# Y -1.0
# Name: A, dtype: float64
print(type(np.floor(df['A'])))
# <class 'pandas.core.series.Series'>
It is also possible to specify the axis
argument in the NumPy function.
print(np.sum(df, axis=0))
# A 0.0
# B 0.0
# C 0.0
# dtype: float64
print(np.sum(df, axis=1))
# X 1.5
# Y -1.5
# dtype: float64
print(type(np.sum(df, axis=0)))
# <class 'pandas.core.series.Series'>
Speed comparison
Compare the processing speeds of the map()
and apply()
methods of DataFrame
with other dedicated methods and NumPy functions.
Consider a DataFrame
with 100 rows and 100 columns.
df = pd.DataFrame(np.arange(-5000, 5000).reshape(100, 100))
print(df.shape)
# (100, 100)
Note that the following examples use the %%timeit
magic command in Jupyter Notebook. They won't work if executed as a Python script.
The results for using the built-in abs()
function with map()
, compared to using the abs()
method of DataFrame
and the np.abs()
function, are as follows. It can be observed that map()
is slower.
%%timeit
df.map(abs)
# 2.07 ms ± 16.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df.abs()
# 5.06 µs ± 55 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
%%timeit
np.abs(df)
# 7.81 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
The results for using the built-in sum()
function with apply()
, compared to using the sum()
method of DataFrame
and the np.sum()
function, are as follows. It can be seen that apply()
is slower. Although setting raw=True
does speed it up, it is still significantly slower than sum()
of DataFrame
or np.sum()
.
%%timeit
df.apply(sum)
# 932 µs ± 95.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
df.apply(sum, raw=True)
# 427 µs ± 4.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
df.sum()
# 35 µs ± 140 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%%timeit
np.sum(df, axis=0)
# 37.3 µs ± 66.9 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
The map()
and apply()
methods should be used primarily for complex operations that cannot be achieved with other methods or NumPy functions. If possible, it is better to use other methods or NumPy functions.