pandas: Apply functions to values, rows, columns with map(), apply()

Modified: | Tags: Python, pandas

In pandas, you can use map(), apply(), and applymap() methods to apply functions to values (element-wise), rows, or columns in DataFrames and Series.

As mentioned later, DataFrame and Series already include methods for common operations. Additionally, you can apply NumPy functions to DataFrame and Series. Using dedicated methods or NumPy functions is preferable to map() or apply() due to better performance.

The pandas and NumPy versions used in this article are as follows. Note that functionality may vary between versions.

import pandas as pd
import numpy as np

print(pd.__version__)
# 2.1.2

print(np.__version__)
# 1.26.1

Apply functions to values in Series: map(), apply()

To apply a function to each value in a Series (element-wise), use the map() or apply() methods.

How to use map()

Passing a function to map() returns a new Series, with the function applied to each value. For example, apply the built-in hex() function to convert integers to hexadecimal strings.

s = pd.Series([1, 10, 100])
print(s)
# 0      1
# 1     10
# 2    100
# dtype: int64

print(s.map(hex))
# 0     0x1
# 1     0xa
# 2    0x64
# dtype: object

You can also apply functions defined with def or lambda expressions.

def my_func(x):
    return x * 10

print(s.map(my_func))
# 0      10
# 1     100
# 2    1000
# dtype: int64

print(s.map(lambda x: x * 10))
# 0      10
# 1     100
# 2    1000
# dtype: int64

The above example is for illustrative purposes; simple arithmetic operations can be directly performed on a Series.

print(s * 10)
# 0      10
# 1     100
# 2    1000
# dtype: int64

By default, missing values (NaN) are passed to the function, but if you set the second argument na_action to 'ignore', NaN will not be passed to the function and the result will remain as NaN.

Because the presence of NaN changes the data type (dtype) to a floating-point number (float), values are converted to integers (int) using int() before being passed to hex() in the following example.

s_nan = pd.Series([1, float('nan'), 100])
print(s_nan)
# 0      1.0
# 1      NaN
# 2    100.0
# dtype: float64

# print(s_nan.map(lambda x: hex(int(x))))
# ValueError: cannot convert float NaN to integer

print(s_nan.map(lambda x: hex(int(x)), na_action='ignore'))
# 0     0x1
# 1     NaN
# 2    0x64
# dtype: object

You can also pass a dictionary (dict) to map(). In this case, it replaces values. For more details, refer to the following article.

How to use apply()

Similar to map(), the function specified as the first argument in apply() is applied to each value. The difference is that apply() allows you to specify arguments to be passed to the function.

With map(), you need to use a lambda expression or similar approach to pass arguments to the function. For example, specify the base argument in the int() function, which converts strings to integers.

s = pd.Series(['11', 'AA', 'FF'])
print(s)
# 0    11
# 1    AA
# 2    FF
# dtype: object

# print(s.map(int, base=16))
# TypeError: Series.map() got an unexpected keyword argument 'base'

print(s.map(lambda x: int(x, 16)))
# 0     17
# 1    170
# 2    255
# dtype: int64

With apply(), any specified keyword arguments are passed directly to the function. It is also possible to specify positional arguments using the args argument.

print(s.apply(int, base=16))
# 0     17
# 1    170
# 2    255
# dtype: int64

print(s.apply(int, args=(16,)))
# 0     17
# 1    170
# 2    255
# dtype: int64

Note that even if there is only one positional argument, it must be specified as a tuple or list in the args argument. A comma is necessary at the end of a one-element tuple.

As of version 2.1.2, apply() does not have the na_action argument.

Apply functions to values in DataFrame: map(), applymap()

To apply a function to each value in a DataFrame (element-wise), use the map() or applymap() methods.

As of version 2.1.0, applymap() has been renamed to map() and marked as deprecated.

As of version 2.1.2, applymap() is still usable but issues a FutureWarning.

df = pd.DataFrame([[1, 10, 100], [2, 20, 200]])
print(df)
#    0   1    2
# 0  1  10  100
# 1  2  20  200

print(df.map(hex))
#      0     1     2
# 0  0x1   0xa  0x64
# 1  0x2  0x14  0xc8

print(df.applymap(hex))
#      0     1     2
# 0  0x1   0xa  0x64
# 1  0x2  0x14  0xc8
# 
# /var/folders/rf/b7l8_vgj5mdgvghn_326rn_c0000gn/T/ipykernel_36685/2076800564.py:1: FutureWarning: DataFrame.applymap has been deprecated. Use DataFrame.map instead.

The following example uses map(), but applymap() has the same usage and functionality. In versions before 2.1.0, use applymap().

As with map() of Series, the na_action argument can be specified for map() of DataFrame. By default, missing values (NaN) are passed to the function, but if na_action is set to 'ignore', NaN is not passed to the function and the result remains as NaN.

df_nan = pd.DataFrame([[1, float('nan'), 100], [2, 20, 200]])
print(df_nan)
#    0     1    2
# 0  1   NaN  100
# 1  2  20.0  200

# print(df_nan.map(lambda x: hex(int(x))))
# ValueError: cannot convert float NaN to integer

print(df_nan.map(lambda x: hex(int(x)), na_action='ignore'))
#      0     1     2
# 0  0x1   NaN  0x64
# 1  0x2  0x14  0xc8

Unlike map() of Series, map() of DataFrame passes the specified keyword argument to the function.

df = pd.DataFrame([['1', 'A', 'F'], ['11', 'AA', 'FF']])
print(df)
#     0   1   2
# 0   1   A   F
# 1  11  AA  FF

print(df.map(int, base=16))
#     0    1    2
# 0   1   10   15
# 1  17  170  255

As of version 2.1.2, map() of DataFrame does not have the args argument, which means you cannot specify positional arguments.

Apply functions to rows and columns in DataFrame: apply()

To apply a function to rows or columns in a DataFrame, use the apply() method.

For the agg() method applying multiple operations at once, see the following article.

Basic usage

Specify the function you want to apply as the first argument.

Note that the built-in sum() function is used for explanation purposes, but if you need to calculate a sum, it is better to use the sum() method mentioned later.

df = pd.DataFrame([[10, 20, 30], [40, 50, 60]], index=['X', 'Y'], columns=['A', 'B', 'C'])
print(df)
#     A   B   C
# X  10  20  30
# Y  40  50  60

print(df.apply(sum))
# A    50
# B    70
# C    90
# dtype: int64

By default, each column is passed to the function as a Series. If the function cannot accept a Series as an argument, an error will occur.

print(df.apply(lambda x: type(x)))
# A    <class 'pandas.core.series.Series'>
# B    <class 'pandas.core.series.Series'>
# C    <class 'pandas.core.series.Series'>
# dtype: object

# print(hex(df['A']))
# TypeError: 'Series' object cannot be interpreted as an integer

# print(df.apply(hex))
# TypeError: 'Series' object cannot be interpreted as an integer

Specify rows or columns: axis

By default, the function is applied to each column. However, setting the axis argument to 1 or 'columns' applies it to each row.

df = pd.DataFrame([[10, 20, 30], [40, 50, 60]], index=['X', 'Y'], columns=['A', 'B', 'C'])
print(df)
#     A   B   C
# X  10  20  30
# Y  40  50  60

print(df.apply(sum, axis=1))
# X     60
# Y    150
# dtype: int64

Specify arguments for the function: Keyword arguments, args

Any keyword arguments specified in apply() are passed to the function being applied. You can also specify positional arguments using the args argument.

df = pd.DataFrame([[10, 20, 30], [40, 50, 60]], index=['X', 'Y'], columns=['A', 'B', 'C'])
print(df)
#     A   B   C
# X  10  20  30
# Y  40  50  60

def my_func(x, y, z):
    return sum(x) + y + z * 2

print(df.apply(my_func, y=100, z=1000))
# A    2150
# B    2170
# C    2190
# dtype: int64

print(df.apply(my_func, args=(100, 1000)))
# A    2150
# B    2170
# C    2190
# dtype: int64

Pass as ndarray instead of Series: raw

By default, each row or column is passed as a Series. If you set the raw argument to True, they are passed as NumPy arrays (ndarray).

df = pd.DataFrame([[10, 20, 30], [40, 50, 60]], index=['X', 'Y'], columns=['A', 'B', 'C'])
print(df)
#     A   B   C
# X  10  20  30
# Y  40  50  60

print(df.apply(lambda x: type(x), raw=True))
# A    <class 'numpy.ndarray'>
# B    <class 'numpy.ndarray'>
# C    <class 'numpy.ndarray'>
# dtype: object

If there's no need for a Series, using raw=True is faster since the conversion process is omitted. However, if the function requires Series methods or attributes, setting raw=True will raise an error.

print(df.apply(lambda x: x.name * 3))
# A    AAA
# B    BBB
# C    CCC
# dtype: object

# print(df.apply(lambda x: x.name * 3, raw=True))
# AttributeError: 'numpy.ndarray' object has no attribute 'name'

Apply functions to specific rows or columns

To apply a function to a specific row or column, extract the row or column as a Series and use the map() or apply() methods of Series.

df = pd.DataFrame([[10, 20, 30], [40, 50, 60]], index=['X', 'Y'], columns=['A', 'B', 'C'])
print(df)
#     A   B   C
# X  10  20  30
# Y  40  50  60

print(df['A'].map(lambda x: x**2))
# X     100
# Y    1600
# Name: A, dtype: int64

print(df.loc['Y'].map(hex))
# A    0x28
# B    0x32
# C    0x3c
# Name: Y, dtype: object

You can add them as new rows or columns. If the same row or column names are specified, they will be overwritten.

df['A'] = df['A'].map(lambda x: x**2)
df.loc['Y_hex'] = df.loc['Y'].map(hex)
print(df)
#            A     B     C
# X        100    20    30
# Y       1600    50    60
# Y_hex  0x640  0x32  0x3c

Use methods of DataFrame and Series, and arithmetic Operators

In pandas, common operations are provided as methods for DataFrame and Series, so there's no need to use map() or apply().

df = pd.DataFrame([[1, -2, 3], [-4, 5, -6]], index=['X', 'Y'], columns=['A', 'B', 'C'])
print(df)
#    A  B  C
# X  1 -2  3
# Y -4  5 -6

print(df.abs())
#    A  B  C
# X  1  2  3
# Y  4  5  6

print(df.sum())
# A   -3
# B    3
# C   -3
# dtype: int64

print(df.sum(axis=1))
# X    2
# Y   -5
# dtype: int64

For a list of available methods, refer to the official documentation.

You can also process DataFrame and Series directly using arithmetic operators.

print(df * 10)
#     A   B   C
# X  10 -20  30
# Y -40  50 -60

print(df['A'].abs() + df['B'] * 100)
# X   -199
# Y    504
# dtype: int64

Methods for string manipulation are also available through the str accessor of Series.

df = pd.DataFrame([['a', 'ab', 'abc'], ['x', 'xy', 'xyz']], index=['X', 'Y'], columns=['A', 'B', 'C'])
print(df)
#    A   B    C
# X  a  ab  abc
# Y  x  xy  xyz

print(df['A'] + '-' + df['B'].str.upper() + '-' + df['C'].str.title())
# X    a-AB-Abc
# Y    x-XY-Xyz
# dtype: object

Use NumPy functions

You can process DataFrame and Series by passing them to NumPy functions.

For example, although pandas does not provide a method for truncating decimals, you can use np.floor() instead. For DataFrame, a DataFrame is returned; for Series, a Series is returned.

df = pd.DataFrame([[0.1, 0.5, 0.9], [-0.1, -0.5, -0.9]], index=['X', 'Y'], columns=['A', 'B', 'C'])
print(df)
#      A    B    C
# X  0.1  0.5  0.9
# Y -0.1 -0.5 -0.9

print(np.floor(df))
#      A    B    C
# X  0.0  0.0  0.0
# Y -1.0 -1.0 -1.0

print(type(np.floor(df)))
# <class 'pandas.core.frame.DataFrame'>

print(np.floor(df['A']))
# X    0.0
# Y   -1.0
# Name: A, dtype: float64

print(type(np.floor(df['A'])))
# <class 'pandas.core.series.Series'>

It is also possible to specify the axis argument in the NumPy function.

print(np.sum(df, axis=0))
# A    0.0
# B    0.0
# C    0.0
# dtype: float64

print(np.sum(df, axis=1))
# X    1.5
# Y   -1.5
# dtype: float64

print(type(np.sum(df, axis=0)))
# <class 'pandas.core.series.Series'>

Speed comparison

Compare the processing speeds of the map() and apply() methods of DataFrame with other dedicated methods and NumPy functions.

Consider a DataFrame with 100 rows and 100 columns.

df = pd.DataFrame(np.arange(-5000, 5000).reshape(100, 100))

print(df.shape)
# (100, 100)

Note that the following examples use the %%timeit magic command in Jupyter Notebook. They won't work if executed as a Python script.

The results for using the built-in abs() function with map(), compared to using the abs() method of DataFrame and the np.abs() function, are as follows. It can be observed that map() is slower.

%%timeit
df.map(abs)
# 2.07 ms ± 16.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
df.abs()
# 5.06 µs ± 55 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

%%timeit
np.abs(df)
# 7.81 µs ± 120 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

The results for using the built-in sum() function with apply(), compared to using the sum() method of DataFrame and the np.sum() function, are as follows. It can be seen that apply() is slower. Although setting raw=True does speed it up, it is still significantly slower than sum() of DataFrame or np.sum().

%%timeit
df.apply(sum)
# 932 µs ± 95.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%%timeit
df.apply(sum, raw=True)
# 427 µs ± 4.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

%%timeit
df.sum()
# 35 µs ± 140 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

%%timeit
np.sum(df, axis=0)
# 37.3 µs ± 66.9 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

The map() and apply() methods should be used primarily for complex operations that cannot be achieved with other methods or NumPy functions. If possible, it is better to use other methods or NumPy functions.

Related Categories

Related Articles