Regular Expressions in Python: the re Module

Modified: 2023-05-09 | Tags: Python, String, Regex

In Python, the re module allows you to work with regular expressions (regex) to extract, replace, and split strings based on specific patterns.

This article first explains the functions and methods of the re module, then explains the metacharacters (special characters) and special sequences available in the re module. While the syntax is mostly standard for regular expressions, be cautious with flags, especially re.ASCII.

Contents

Compile a regex pattern: compile()
- Execute with module-level functions
- Execute with regex pattern object methods
Match object
Match string beginning: match()
Match anywhere in the string: search()
Match entire string: fullmatch()
Get all matches in a list: findall()
Get all matches as an iterator: finditer()
Replace matching parts: sub(), subn()
Split a string using a regex pattern: split()
Metacharacters and special sequences in the re module
Flags
Greedy and non-greedy matching

Compile a regex pattern: `compile()`

There are two ways to execute regex processing with the re module.

Execute with module-level functions

The first method uses module-level functions. Functions like re.match() and re.sub() allow string extraction and replacement based on regex patterns.

These functions take a regex pattern string and the target string as arguments. Details will be explained later.

import re

s = 'aaa@xxx.com bbb@yyy.net ccc@zzz.org'

print(re.match(r'([a-z]+)@([a-z]+)\.com', s))
# <re.Match object; span=(0, 11), match='aaa@xxx.com'>

print(re.sub(r'([a-z]+)@([a-z]+)\.com', 'NEW_ADDRESS', s))
# NEW_ADDRESS bbb@yyy.net ccc@zzz.org

source: re_compile.py

In this example, the regex pattern [a-z] matches any character from a to z, and + symbol indicates that the previous pattern should be repeated one or more times. So, [a-z]+ matches a string with one or more repeated lowercase alphabetic characters.

. is a metacharacter, which is a character with special meaning, so it needs to be escaped with \.

Regex pattern strings often use a backslash \, so using raw strings, like in this example, can be helpful.

Raw strings in Python

Execute with regex pattern object methods

The second method involves using regex pattern object methods.

You can compile a regex pattern string into a regex pattern object using re.compile().

p = re.compile(r'([a-z]+)@([a-z]+)\.com')

print(p)
# re.compile('([a-z]+)@([a-z]+)\\.com')

print(type(p))
# <class 're.Pattern'>

source: re_compile.py

The same processing done by module-level functions like re.match() and re.sub() can be executed using regex object methods such as match() and sub().

print(p.match(s))
# <re.Match object; span=(0, 11), match='aaa@xxx.com'>

print(p.sub('NEW_ADDRESS', s))
# NEW_ADDRESS bbb@yyy.net ccc@zzz.org

source: re_compile.py

All functions such as re.xxx() described below are also provided as methods of regex objects.

It is more efficient to create and reuse a regex object when repeatedly performing the same pattern-based processing.

In the following sample code, the function is used without compiling, but if the same pattern is used repeatedly, it is recommended to pre-compile and execute as a regex object method.

Match object

Functions like match() and search() return a match object.

s = 'aaa@xxx.com'

m = re.match(r'[a-z]+@[a-z]+\.[a-z]+', s)
print(m)
# <re.Match object; span=(0, 11), match='aaa@xxx.com'>

print(type(m))
# <class 're.Match'>

source: re_match_object.py

Match objects have various useful methods, such as:

Get matched positions: start(), end(), span()
Get matched strings: group()
Get strings of each group: groups()

print(m.start())
# 0

print(m.end())
# 11

print(m.span())
# (0, 11)

print(m.group())
# aaa@xxx.com

source: re_match_object.py

By enclosing a part of the regex pattern with parentheses (), that part is treated as a group. In this case, you can get the strings of the parts that match each group as a tuple by groups().

m = re.match(r'([a-z]+)@([a-z]+)\.([a-z]+)', s)
print(m)
# <re.Match object; span=(0, 11), match='aaa@xxx.com'>

print(m.groups())
# ('aaa', 'xxx', 'com')

source: re_match_object.py

When grouping, you can specify a number as an argument for group() to get the string of any group. If omitted or 0 is specified, the entire match is returned, and if a number from 1 is specified, the strings of each group are returned in order.

print(m.group())
# aaa@xxx.com

print(m.group(0))
# aaa@xxx.com

print(m.group(1))
# aaa

print(m.group(2))
# xxx

print(m.group(3))
# com

source: re_match_object.py

See the following article for more information on match objects.

How to use regex match objects in Python

Match string beginning: `match()`

match() returns a match object if the beginning of the string matches the pattern. As mentioned above, you can use the match object to extract the matched substring or check for a match.

match() only checks the beginning of the string and returns None if there is no match.

s = 'aaa@xxx.com bbb@yyy.net ccc@zzz.org'

print(re.match(r'[a-z]+@[a-z]+\.com', s))
# <re.Match object; span=(0, 11), match='aaa@xxx.com'>

print(re.match(r'[a-z]+@[a-z]+\.net', s))
# None

source: re_match_search_fullmatch.py

Match anywhere in the string: `search()`

search() searches the whole string and returns a match object if a match is found. If multiple matches are found, only the first one is returned.

s = 'aaa@xxx.com bbb@yyy.net ccc@zzz.org'

print(re.search(r'[a-z]+@[a-z]+\.net', s))
# <re.Match object; span=(12, 23), match='bbb@yyy.net'>

print(re.search(r'[a-z]+@[a-z]+\.[a-z]+', s))
# <re.Match object; span=(0, 11), match='aaa@xxx.com'>

source: re_match_search_fullmatch.py

If you want to get all matching parts, use findall() or finditer() described later.

Match entire string: `fullmatch()`

Use fullmatch() to check if the entire string matches the regex pattern. It returns a match object if the entire string matches, and None otherwise.

s = 'aaa@xxx.com'
print(re.fullmatch(r'[a-z]+@[a-z]+\.com', s))
# <re.Match object; span=(0, 11), match='aaa@xxx.com'>

s = '!!!aaa@xxx.com!!!'
print(re.fullmatch(r'[a-z]+@[a-z]+\.com', s))
# None

source: re_match_search_fullmatch.py

Get all matches in a list: `findall()`

findall() returns all matching substrings as a list. Note that the elements of the list are strings, not match objects.

s = 'aaa@xxx.com bbb@yyy.net ccc@zzz.org'

result = re.findall(r'[a-z]+@[a-z]+\.[a-z]+', s)
print(result)
# ['aaa@xxx.com', 'bbb@yyy.net', 'ccc@zzz.org']

source: re_findall_finditer.py

To check the number of matched parts, use the built-in len() function.

print(len(result))
# 3

source: re_findall_finditer.py

If you use parentheses () for grouping in the regex pattern, a list of tuples containing the strings of each group is returned.

print(re.findall(r'([a-z]+)@([a-z]+)\.([a-z]+)', s))
# [('aaa', 'xxx', 'com'), ('bbb', 'yyy', 'net'), ('ccc', 'zzz', 'org')]

source: re_findall_finditer.py

Since the grouping parentheses () can be nested, if you want to get the entire match as well, you can enclose the entire pattern in parentheses ().

print(re.findall(r'(([a-z]+)@([a-z]+)\.([a-z]+))', s))
# [('aaa@xxx.com', 'aaa', 'xxx', 'com'), ('bbb@yyy.net', 'bbb', 'yyy', 'net'), ('ccc@zzz.org', 'ccc', 'zzz', 'org')]

source: re_findall_finditer.py

If there is no match, an empty tuple is returned.

print(re.findall('[0-9]+', s))
# []

source: re_findall_finditer.py

Get all matches as an iterator: `finditer()`

finditer() returns all matching parts as an iterator with match objects as elements.

The iterator does not display its elements when printed with print(). To extract elements one by one, use the built-in next() function or a for loop.

s = 'aaa@xxx.com bbb@yyy.net ccc@zzz.org'

result = re.finditer(r'[a-z]+@[a-z]+\.[a-z]+', s)
print(result)
# <callable_iterator object at 0x107863070>

print(type(result))
# <class 'callable_iterator'>

for m in result:
    print(m)
# <re.Match object; span=(0, 11), match='aaa@xxx.com'>
# <re.Match object; span=(12, 23), match='bbb@yyy.net'>
# <re.Match object; span=(24, 35), match='ccc@zzz.org'>

source: re_findall_finditer.py

You can also convert the iterator to a list using list().

l = list(re.finditer(r'[a-z]+@[a-z]+\.[a-z]+', s))
print(l)
# [<re.Match object; span=(0, 11), match='aaa@xxx.com'>, <re.Match object; span=(12, 23), match='bbb@yyy.net'>, <re.Match object; span=(24, 35), match='ccc@zzz.org'>]

print(l[0])
# <re.Match object; span=(0, 11), match='aaa@xxx.com'>

print(type(l[0]))
# <class 're.Match'>

print(l[0].span())
# (0, 11)

source: re_findall_finditer.py

If you want to get the positions of all matching parts, using list comprehensions is more convenient than using list().

List comprehensions in Python

print([m.span() for m in re.finditer(r'[a-z]+@[a-z]+\.[a-z]+', s)])
# [(0, 11), (12, 23), (24, 35)]

source: re_findall_finditer.py

Iterators access elements in order. Be careful not to try to access elements after reaching the end, as nothing will be left.

for m in result:
    print(m)
# <re.Match object; span=(0, 11), match='aaa@xxx.com'>
# <re.Match object; span=(12, 23), match='bbb@yyy.net'>
# <re.Match object; span=(24, 35), match='ccc@zzz.org'>

print(list(result))
# []

source: re_findall_finditer.py

Replace matching parts: `sub()`, `subn()`

sub() allows you to replace matched parts with another string. Specify the regex pattern as the first argument, the replacement string as the second, and the target string as the third.

s = 'aaa@xxx.com bbb@yyy.net ccc@zzz.org'

print(re.sub('[a-z]+@', 'ABC@', s))
# ABC@xxx.com ABC@yyy.net ABC@zzz.org

source: re_sub_subn.py

You can specify the maximum number of replacements as the fourth argument, count.

print(re.sub('[a-z]+@', 'ABC@', s, 2))
# ABC@xxx.com ABC@yyy.net ccc@zzz.org

source: re_sub_subn.py

If you group using parentheses (), you can use the matched string in the replacement string.

By default, \1, \2, and \3 correspond to the parts matched by the first (), second (), and third (). When using regular strings instead of raw strings, you need to escape the backslash like '\\1'.

print(re.sub('([a-z]+)@([a-z]+)', '\\2@\\1', s))
# xxx@aaa.com yyy@bbb.net zzz@ccc.org

print(re.sub('([a-z]+)@([a-z]+)', r'\2@\1', s))
# xxx@aaa.com yyy@bbb.net zzz@ccc.org

source: re_sub_subn.py

By writing ?P<xxx> at the beginning of () in the reg pattern to give the group a name, you can specify the name with \g<xxx> instead of a number like \1.

print(re.sub('(?P<local>[a-z]+)@(?P<SLD>[a-z]+)', r'\g<SLD>@\g<local>', s))
# xxx@aaa.com yyy@bbb.net zzz@ccc.org

source: re_sub_subn.py

You can also specify a function that takes a match object as an argument for the second argument, allowing for more complex processing.

def func(matchobj):
    return matchobj.group(2).upper() + '@' + matchobj.group(1)

print(re.sub('([a-z]+)@([a-z]+)', func, s))
# XXX@aaa.com YYY@bbb.net ZZZ@ccc.org

source: re_sub_subn.py

Lambda expressions can also be used.

Lambda expressions in Python

print(re.sub('([a-z]+)@([a-z]+)', lambda m: m.group(2).upper() + '@' + m.group(1), s))
# XXX@aaa.com YYY@bbb.net ZZZ@ccc.org

source: re_sub_subn.py

subn() returns a tuple containing the replaced string and the number of replaced parts (the number of matches to the pattern).

t = re.subn('[a-z]*@', 'ABC@', s)
print(t)
# ('ABC@xxx.com ABC@yyy.net ABC@zzz.org', 3)

print(type(t))
# <class 'tuple'>

print(t[0])
# ABC@xxx.com ABC@yyy.net ABC@zzz.org

print(t[1])
# 3

source: re_sub_subn.py

The way to specify arguments is the same as for sub(). You can use grouped parts with () and specify the count argument.

print(re.subn('([a-z]+)@([a-z]+)', r'\2@\1', s, 2))
# ('xxx@aaa.com yyy@bbb.net ccc@zzz.org', 2)

source: re_sub_subn.py

For more information on string replacement, see the following article.

Replace strings in Python (replace, translate, re.sub, re.subn)

Split a string using a regex pattern: `split()`

split() splits a string at the parts that match the pattern and returns the result as a list.

When matching at the beginning or end of a string, be aware that an empty string '' will be included at the beginning and end of the resulting list.

s = '111aaa222bbb333'

print(re.split('[a-z]+', s))
# ['111', '222', '333']

print(re.split('[0-9]+', s))
# ['', 'aaa', 'bbb', '']

source: re_split.py

You can specify the maximum number of splits as the third argument, maxsplit.

print(re.split('[a-z]+', s, 1))
# ['111', '222bbb333']

source: re_split.py

For more information on splitting strings, see the following article.

Split a string in Python (delimiter, line break, regex, and more)

Metacharacters and special sequences in the `re` module

The main regex metacharacters and special sequences that can be used in the re module are as follows.

Metacharacter	Description
`.`	Matches any character except newline
`^`	Matches the start of a string
`$`	Matches the end of a string
`*`	Matches 0 or more repetitions of the preceding pattern
`+`	Matches 1 or more repetitions of the preceding pattern
`?`	Matches 0 or 1 repetitions of the preceding pattern
`{m}`	Matches `m` repetitions of the preceding pattern
`{m, n}`	Matches between `m` and `n` repetitions of the preceding pattern
`[]`	Matches a character set (any one character within the `[]`)
`\|`	OR (`A\|B` matches either `A` or `B` where `A` and `B` are patterns)

Special Sequence	Description
`\d`	Unicode decimal digit
`\D`	Opposite of `\d`
`\s`	Unicode whitespace character
`\S`	Opposite of `\s`
`\w`	Unicode word character and `_`
`\W`	Opposite of `\w`

For a complete list, check the official documentation:

re - Regular Expression Syntax — Regular expression operations — Python 3.11.3 documentation

Note that some characters have different meanings in Python 2.

7.2. re - Regular Expression Syntax — Regular expression operations — Python 2.7.18 documentation

For basic examples of these characters, see the following article:

Extract a substring from a string in Python (position, regex)

Flags

Flags affect the behavior of metacharacters and special sequences in regex patterns.

Only the main flags are introduced here. For others, refer to the official documentation.

re - Flags — Regular expression operations — Python 3.11.3 documentation

Match ASCII characters only: `re.ASCII`, `re.A`

By default, unlike standard regular expressions, \w is not equivalent to [a-zA-Z0-9_]. For example, \w matches full-width alphanumeric characters, Japanese, etc.

print(re.match(r'\w+', 'あいう漢字ＡＢＣ１２３'))
# <re.Match object; span=(0, 11), match='あいう漢字ＡＢＣ１２３'>

print(re.match('[a-zA-Z0-9_]+', 'あいう漢字ＡＢＣ１２３'))
# None

source: re_flag.py

To match only ASCII characters, use re.ASCII as the flags argument in each function or add the inline flag (?a) at the beginning of the regex pattern string. In this case, \w is equivalent to [a-zA-Z0-9_]

print(re.match(r'\w+', 'あいう漢字ＡＢＣ１２３', flags=re.ASCII))
# None

print(re.match(r'(?a)\w+', 'あいう漢字ＡＢＣ１２３'))
# None

source: re_flag.py

The same applies when using re.compile() to compile the pattern. Use the flags argument or the inline flag.

p = re.compile(r'\w+', flags=re.ASCII)
print(p)
# re.compile('\\w+', re.ASCII)

print(p.match('あいう漢字ＡＢＣ１２３'))
# None

p = re.compile(r'(?a)\w+')
print(p)
# re.compile('(?a)\\w+', re.ASCII)

print(p.match('あいう漢字ＡＢＣ１２３'))
# None

source: re_flag.py

re.ASCII is also available as re.A.

print(re.ASCII is re.A)
# True

source: re_flag.py

\W, which represents the opposite of \w, is also affected by re.ASCII or (?a).

print(re.match(r'\W+', 'あいう漢字ＡＢＣ１２３'))
# None

print(re.match(r'\W+', 'あいう漢字ＡＢＣ１２３', flags=re.ASCII))
# <re.Match object; span=(0, 11), match='あいう漢字ＡＢＣ１２３'>

source: re_flag.py

\d and \s match both half-width and full-width characters by default. re.ASCII or (?a) limits them to half-width characters.

print(re.match(r'\d+', '123'))
# <re.Match object; span=(0, 3), match='123'>

print(re.match(r'\d+', '１２３'))
# <re.Match object; span=(0, 3), match='１２３'>

print(re.match(r'\d+', '123', flags=re.ASCII))
# <re.Match object; span=(0, 3), match='123'>

print(re.match(r'\d+', '１２３', flags=re.ASCII))
# None

print(re.match(r'\s+', '　'))  # 全角スペース
# <re.Match object; span=(0, 1), match='\u3000'>

print(re.match(r'\s+', '　', flags=re.ASCII))
# None

source: re_flag.py

Their opposites, \D and \S, are also affected by re.ASCII or (?a).

Case-insensitive matching: `re.IGNORECASE`, `re.I`

By default, the case is considered when matching.

Use re.IGNORECASE for case-insensitive matching, which is equivalent to the i flag in standard regex.

print(re.match('[a-zA-Z]+', 'abcABC'))
# <re.Match object; span=(0, 6), match='abcABC'>

print(re.match('[a-z]+', 'abcABC', flags=re.IGNORECASE))
# <re.Match object; span=(0, 6), match='abcABC'>

print(re.match('[A-Z]+', 'abcABC', flags=re.IGNORECASE))
# <re.Match object; span=(0, 6), match='abcABC'>

source: re_flag.py

You can also use the inline flag (?i) or the abbreviation re.I.

Match at the beginning and end of each line: `re.MULTILINE`, `re.M`

^ matches the start of a string.

By default, it only matches the start of the entire string. Use re.MULTILINE to match the start of each line, equivalent to the m flag in standard regex.

s = '''aaa-xxx
bbb-yyy
ccc-zzz'''

print(s)
# aaa-xxx
# bbb-yyy
# ccc-zzz

print(re.findall('^[a-z]+', s))
# ['aaa']

print(re.findall('^[a-z]+', s, flags=re.MULTILINE))
# ['aaa', 'bbb', 'ccc']

source: re_flag.py

Similarly, $ matches the end of a string. By default, it matches the end of the entire string. Use re.MULTILINE to match the end of each line.

print(re.findall('[a-z]+$', s))
# ['zzz']

print(re.findall('[a-z]+$', s, flags=re.MULTILINE))
# ['xxx', 'yyy', 'zzz']

source: re_flag.py

You can also use the inline flag (?m) or the abbreviation re.M.

Specify multiple flags

Use | to enable multiple flags simultaneously. For inline flags, write like (?am).

s = '''aaa-xxx
あああ-んんん
bbb-zzz'''

print(s)
# aaa-xxx
# あああ-んんん
# bbb-zzz

print(re.findall(r'^\w+', s, flags=re.M))
# ['aaa', 'あああ', 'bbb']

print(re.findall(r'^\w+', s, flags=re.M | re.A))
# ['aaa', 'bbb']

print(re.findall(r'(?am)^\w+', s))
# ['aaa', 'bbb']

source: re_flag.py

Greedy and non-greedy matching

This is a general issue in regular expressions and not specific to Python, but it is worth mentioning as it can be a common pitfall.

By default, *, +, and ? perform greedy matching, matching the longest possible string.

s = 'aaa@xxx.com bbb@yyy.com'

m = re.match(r'.+com', s)
print(m)
# <re.Match object; span=(0, 23), match='aaa@xxx.com bbb@yyy.com'>

print(m.group())
# aaa@xxx.com bbb@yyy.com

source: re_greedy.py

By adding ? (i.e., *?, +?, ??), non-greedy (minimal) matching is performed, matching the shortest possible string.

m = re.match(r'.+?com', s)
print(m)
# <re.Match object; span=(0, 11), match='aaa@xxx.com'>

print(m.group())
# aaa@xxx.com

source: re_greedy.py

Regular Expressions in Python: the re Module

Compile a regex pattern: `compile()`

Execute with module-level functions

Execute with regex pattern object methods

Match object

Match string beginning: `match()`

Match anywhere in the string: `search()`

Match entire string: `fullmatch()`

Get all matches in a list: `findall()`

Get all matches as an iterator: `finditer()`

Replace matching parts: `sub()`, `subn()`

Split a string using a regex pattern: `split()`

Metacharacters and special sequences in the `re` module

Flags

Match ASCII characters only: `re.ASCII`, `re.A`

Case-insensitive matching: `re.IGNORECASE`, `re.I`

Match at the beginning and end of each line: `re.MULTILINE`, `re.M`

Specify multiple flags

Greedy and non-greedy matching

Related Categories

Related Articles

Regular Expressions in Python: the re Module

Compile a regex pattern: compile()

Execute with module-level functions

Execute with regex pattern object methods

Match object

Match string beginning: match()

Match anywhere in the string: search()

Match entire string: fullmatch()

Get all matches in a list: findall()

Get all matches as an iterator: finditer()

Replace matching parts: sub(), subn()

Split a string using a regex pattern: split()

Metacharacters and special sequences in the re module

Flags

Match ASCII characters only: re.ASCII, re.A

Case-insensitive matching: re.IGNORECASE, re.I

Match at the beginning and end of each line: re.MULTILINE, re.M

Specify multiple flags

Greedy and non-greedy matching

Related Categories

Related Articles

Compile a regex pattern: `compile()`

Match string beginning: `match()`

Match anywhere in the string: `search()`

Match entire string: `fullmatch()`

Get all matches in a list: `findall()`

Get all matches as an iterator: `finditer()`

Replace matching parts: `sub()`, `subn()`

Split a string using a regex pattern: `split()`

Metacharacters and special sequences in the `re` module

Match ASCII characters only: `re.ASCII`, `re.A`

Case-insensitive matching: `re.IGNORECASE`, `re.I`

Match at the beginning and end of each line: `re.MULTILINE`, `re.M`