GROUP BY in Python: itertools.groupby
In Python, you can group consecutive elements of the same value in an iterable object, such as a list, with itertools.groupby()
.
import itertools
l = [0, 0, 0, 1, 1, 2, 0, 0]
print([(k, list(g)) for k, g in itertools.groupby(l)])
# [(0, [0, 0, 0]), (1, [1, 1]), (2, [2]), (0, [0, 0])]
To count the number of elements of the same value, regardless of their order (be it consecutive or non-consecutive), you can use collections.Counter
.
How to use itertools.groupby()
itertools.groupby()
returns an iterator of keys and groups. Note that these values are not displayed when using print()
.
l = [0, 0, 0, 1, 1, 2, 0, 0]
print(itertools.groupby(l))
# <itertools.groupby object at 0x110ab58b0>
The returned group is also an iterator. You can convert this into a list using list()
, as shown below:
The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list: itertools.groupby() — Functions creating iterators for efficient looping — Python 3.11.3 documentation
for k, g in itertools.groupby(l):
print(k, g)
# 0 <itertools._grouper object at 0x110a26940>
# 1 <itertools._grouper object at 0x110a2c400>
# 2 <itertools._grouper object at 0x110aa8f10>
# 0 <itertools._grouper object at 0x110aa8ee0>
for k, g in itertools.groupby(l):
print(k, list(g))
# 0 [0, 0, 0]
# 1 [1, 1]
# 2 [2]
# 0 [0, 0]
You can use the list comprehensions to get a list of keys only, groups only, or both (tuples of key and group).
print([k for k, g in itertools.groupby(l)])
# [0, 1, 2, 0]
print([list(g) for k, g in itertools.groupby(l)])
# [[0, 0, 0], [1, 1], [2], [0, 0]]
print([(k, list(g)) for k, g in itertools.groupby(l)])
# [(0, [0, 0, 0]), (1, [1, 1]), (2, [2]), (0, [0, 0])]
Specify a function computing a key value for each element: key
You can specify the key
parameter for itertools.groupby()
. The key
parameter is used in the same way as in other functions such as sorted()
, max()
, min()
, and others.
The function (callable object) specified in key
determines whether the values of consecutive elements are the same. For example, by specifying the built-in len()
function, which returns the length of a string, you can group elements of the same length.
l = ['aaa', 'bbb', 'ccc', 'a', 'b', 'aa', 'bb']
print([(k, list(g)) for k, g in itertools.groupby(l, len)])
# [(3, ['aaa', 'bbb', 'ccc']), (1, ['a', 'b']), (2, ['aa', 'bb'])]
In the following example, a lambda expression is used to group by even or odd numbers.
l = [0, 2, 0, 3, 1, 4, 4, 0]
print([(k, list(g)) for k, g in itertools.groupby(l, lambda x: x % 2)])
# [(0, [0, 2, 0]), (1, [3, 1]), (0, [4, 4, 0])]
Aggregate like GROUP BY
in SQL
For two-dimensional data, such as a list of lists, you can use key
to group data based on a given column, similar to GROUP BY
in SQL.
In the following example, a lambda expression is used to fetch the element at a desired position in the list. operator.itemgetter()
can also be used for this purpose.
While a for
loop is used here for readability, you can also use list comprehensions, as shown in previous examples.
l = [[0, 'Alice', 0],
[1, 'Alice', 10],
[2, 'Bob', 20],
[3, 'Bob', 30],
[4, 'Alice', 40]]
for k, g in itertools.groupby(l, lambda x: x[1]):
print(k, list(g))
# Alice [[0, 'Alice', 0], [1, 'Alice', 10]]
# Bob [[2, 'Bob', 20], [3, 'Bob', 30]]
# Alice [[4, 'Alice', 40]]
itertools.groupby()
groups only consecutive elements of the same value. To group elements regardless of their order, use sorted()
to sort the original list.
When sorting a list of lists, the list is sorted by the first element of each list by default. To sort by the element at a given position, specify the key
parameter of sorted()
.
for k, g in itertools.groupby(sorted(l, key=lambda x: x[1]), lambda x: x[1]):
print(k, list(g))
# Alice [[0, 'Alice', 0], [1, 'Alice', 10], [4, 'Alice', 40]]
# Bob [[2, 'Bob', 20], [3, 'Bob', 30]]
You can sum numbers with a generator expression:
for k, g in itertools.groupby(sorted(l, key=lambda x: x[1]), lambda x: x[1]):
print(k, sum(x[2] for x in g))
# Alice 50
# Bob 50
Note that the pandas library also offers groupby()
for grouping and aggregation, which can be more convenient for handling complex data.
For tuples and strings
You can use itertools.groupby()
to handle not only lists but also other iterable objects like tuples and strings.
For tuples:
t = (0, 0, 0, 1, 1, 2, 0, 0)
print([(k, list(g)) for k, g in itertools.groupby(t)])
# [(0, [0, 0, 0]), (1, [1, 1]), (2, [2]), (0, [0, 0])]
To convert a group into a tuple instead of a list, use tuple()
.
print(tuple((k, tuple(g)) for k, g in itertools.groupby(t)))
# ((0, (0, 0, 0)), (1, (1, 1)), (2, (2,)), (0, (0, 0)))
For strings:
s = 'aaabbcaa'
print([(k, list(g)) for k, g in itertools.groupby(s)])
# [('a', ['a', 'a', 'a']), ('b', ['b', 'b']), ('c', ['c']), ('a', ['a', 'a'])]
To convert a group into a string, use join()
.
print([(k, ''.join(g)) for k, g in itertools.groupby(s)])
# [('a', 'aaa'), ('b', 'bb'), ('c', 'c'), ('a', 'aa')]
Of course, you can also handle any other iterable object with itertools.groupby()
.