Regular Expressions in Python: the re Module

Modified: | Tags: Python, String, Regex

In Python, the re module allows you to work with regular expressions (regex) to extract, replace, and split strings based on specific patterns.

This article first explains the functions and methods of the re module, then explains the metacharacters (special characters) and special sequences available in the re module. While the syntax is mostly standard for regular expressions, be cautious with flags, especially re.ASCII.

Compile a regex pattern: compile()

There are two ways to execute regex processing with the re module.

Execute with module-level functions

The first method uses module-level functions. Functions like re.match() and re.sub() allow string extraction and replacement based on regex patterns.

These functions take a regex pattern string and the target string as arguments. Details will be explained later.

import re

s = 'aaa@xxx.com bbb@yyy.net ccc@zzz.org'

print(re.match(r'([a-z]+)@([a-z]+)\.com', s))
# <re.Match object; span=(0, 11), match='aaa@xxx.com'>

print(re.sub(r'([a-z]+)@([a-z]+)\.com', 'NEW_ADDRESS', s))
# NEW_ADDRESS bbb@yyy.net ccc@zzz.org

In this example, the regex pattern [a-z] matches any character from a to z, and + symbol indicates that the previous pattern should be repeated one or more times. So, [a-z]+ matches a string with one or more repeated lowercase alphabetic characters.

. is a metacharacter, which is a character with special meaning, so it needs to be escaped with \.

Regex pattern strings often use a backslash \, so using raw strings, like in this example, can be helpful.

Execute with regex pattern object methods

The second method involves using regex pattern object methods.

You can compile a regex pattern string into a regex pattern object using re.compile().

p = re.compile(r'([a-z]+)@([a-z]+)\.com')

print(p)
# re.compile('([a-z]+)@([a-z]+)\\.com')

print(type(p))
# <class 're.Pattern'>

The same processing done by module-level functions like re.match() and re.sub() can be executed using regex object methods such as match() and sub().

print(p.match(s))
# <re.Match object; span=(0, 11), match='aaa@xxx.com'>

print(p.sub('NEW_ADDRESS', s))
# NEW_ADDRESS bbb@yyy.net ccc@zzz.org

All functions such as re.xxx() described below are also provided as methods of regex objects.

It is more efficient to create and reuse a regex object when repeatedly performing the same pattern-based processing.

In the following sample code, the function is used without compiling, but if the same pattern is used repeatedly, it is recommended to pre-compile and execute as a regex object method.

Match object

Functions like match() and search() return a match object.

s = 'aaa@xxx.com'

m = re.match(r'[a-z]+@[a-z]+\.[a-z]+', s)
print(m)
# <re.Match object; span=(0, 11), match='aaa@xxx.com'>

print(type(m))
# <class 're.Match'>

Match objects have various useful methods, such as:

  • Get matched positions: start(), end(), span()
  • Get matched strings: group()
  • Get strings of each group: groups()
print(m.start())
# 0

print(m.end())
# 11

print(m.span())
# (0, 11)

print(m.group())
# aaa@xxx.com

By enclosing a part of the regex pattern with parentheses (), that part is treated as a group. In this case, you can get the strings of the parts that match each group as a tuple by groups().

m = re.match(r'([a-z]+)@([a-z]+)\.([a-z]+)', s)
print(m)
# <re.Match object; span=(0, 11), match='aaa@xxx.com'>

print(m.groups())
# ('aaa', 'xxx', 'com')

When grouping, you can specify a number as an argument for group() to get the string of any group. If omitted or 0 is specified, the entire match is returned, and if a number from 1 is specified, the strings of each group are returned in order.

print(m.group())
# aaa@xxx.com

print(m.group(0))
# aaa@xxx.com

print(m.group(1))
# aaa

print(m.group(2))
# xxx

print(m.group(3))
# com

See the following article for more information on match objects.

Match string beginning: match()

match() returns a match object if the beginning of the string matches the pattern. As mentioned above, you can use the match object to extract the matched substring or check for a match.

match() only checks the beginning of the string and returns None if there is no match.

s = 'aaa@xxx.com bbb@yyy.net ccc@zzz.org'

print(re.match(r'[a-z]+@[a-z]+\.com', s))
# <re.Match object; span=(0, 11), match='aaa@xxx.com'>

print(re.match(r'[a-z]+@[a-z]+\.net', s))
# None

search() searches the whole string and returns a match object if a match is found. If multiple matches are found, only the first one is returned.

s = 'aaa@xxx.com bbb@yyy.net ccc@zzz.org'

print(re.search(r'[a-z]+@[a-z]+\.net', s))
# <re.Match object; span=(12, 23), match='bbb@yyy.net'>

print(re.search(r'[a-z]+@[a-z]+\.[a-z]+', s))
# <re.Match object; span=(0, 11), match='aaa@xxx.com'>

If you want to get all matching parts, use findall() or finditer() described later.

Match entire string: fullmatch()

Use fullmatch() to check if the entire string matches the regex pattern. It returns a match object if the entire string matches, and None otherwise.

s = 'aaa@xxx.com'
print(re.fullmatch(r'[a-z]+@[a-z]+\.com', s))
# <re.Match object; span=(0, 11), match='aaa@xxx.com'>

s = '!!!aaa@xxx.com!!!'
print(re.fullmatch(r'[a-z]+@[a-z]+\.com', s))
# None

Get all matches in a list: findall()

findall() returns all matching substrings as a list. Note that the elements of the list are strings, not match objects.

s = 'aaa@xxx.com bbb@yyy.net ccc@zzz.org'

result = re.findall(r'[a-z]+@[a-z]+\.[a-z]+', s)
print(result)
# ['aaa@xxx.com', 'bbb@yyy.net', 'ccc@zzz.org']

To check the number of matched parts, use the built-in len() function.

print(len(result))
# 3

If you use parentheses () for grouping in the regex pattern, a list of tuples containing the strings of each group is returned.

print(re.findall(r'([a-z]+)@([a-z]+)\.([a-z]+)', s))
# [('aaa', 'xxx', 'com'), ('bbb', 'yyy', 'net'), ('ccc', 'zzz', 'org')]

Since the grouping parentheses () can be nested, if you want to get the entire match as well, you can enclose the entire pattern in parentheses ().

print(re.findall(r'(([a-z]+)@([a-z]+)\.([a-z]+))', s))
# [('aaa@xxx.com', 'aaa', 'xxx', 'com'), ('bbb@yyy.net', 'bbb', 'yyy', 'net'), ('ccc@zzz.org', 'ccc', 'zzz', 'org')]

If there is no match, an empty tuple is returned.

print(re.findall('[0-9]+', s))
# []

Get all matches as an iterator: finditer()

finditer() returns all matching parts as an iterator with match objects as elements.

The iterator does not display its elements when printed with print(). To extract elements one by one, use the built-in next() function or a for loop.

s = 'aaa@xxx.com bbb@yyy.net ccc@zzz.org'

result = re.finditer(r'[a-z]+@[a-z]+\.[a-z]+', s)
print(result)
# <callable_iterator object at 0x107863070>

print(type(result))
# <class 'callable_iterator'>

for m in result:
    print(m)
# <re.Match object; span=(0, 11), match='aaa@xxx.com'>
# <re.Match object; span=(12, 23), match='bbb@yyy.net'>
# <re.Match object; span=(24, 35), match='ccc@zzz.org'>

You can also convert the iterator to a list using list().

l = list(re.finditer(r'[a-z]+@[a-z]+\.[a-z]+', s))
print(l)
# [<re.Match object; span=(0, 11), match='aaa@xxx.com'>, <re.Match object; span=(12, 23), match='bbb@yyy.net'>, <re.Match object; span=(24, 35), match='ccc@zzz.org'>]

print(l[0])
# <re.Match object; span=(0, 11), match='aaa@xxx.com'>

print(type(l[0]))
# <class 're.Match'>

print(l[0].span())
# (0, 11)

If you want to get the positions of all matching parts, using list comprehensions is more convenient than using list().

print([m.span() for m in re.finditer(r'[a-z]+@[a-z]+\.[a-z]+', s)])
# [(0, 11), (12, 23), (24, 35)]

Iterators access elements in order. Be careful not to try to access elements after reaching the end, as nothing will be left.

for m in result:
    print(m)
# <re.Match object; span=(0, 11), match='aaa@xxx.com'>
# <re.Match object; span=(12, 23), match='bbb@yyy.net'>
# <re.Match object; span=(24, 35), match='ccc@zzz.org'>

print(list(result))
# []

Replace matching parts: sub(), subn()

sub() allows you to replace matched parts with another string. Specify the regex pattern as the first argument, the replacement string as the second, and the target string as the third.

s = 'aaa@xxx.com bbb@yyy.net ccc@zzz.org'

print(re.sub('[a-z]+@', 'ABC@', s))
# ABC@xxx.com ABC@yyy.net ABC@zzz.org

You can specify the maximum number of replacements as the fourth argument, count.

print(re.sub('[a-z]+@', 'ABC@', s, 2))
# ABC@xxx.com ABC@yyy.net ccc@zzz.org

If you group using parentheses (), you can use the matched string in the replacement string.

By default, \1, \2, and \3 correspond to the parts matched by the first (), second (), and third (). When using regular strings instead of raw strings, you need to escape the backslash like '\\1'.

print(re.sub('([a-z]+)@([a-z]+)', '\\2@\\1', s))
# xxx@aaa.com yyy@bbb.net zzz@ccc.org

print(re.sub('([a-z]+)@([a-z]+)', r'\2@\1', s))
# xxx@aaa.com yyy@bbb.net zzz@ccc.org

By writing ?P<xxx> at the beginning of () in the reg pattern to give the group a name, you can specify the name with \g<xxx> instead of a number like \1.

print(re.sub('(?P<local>[a-z]+)@(?P<SLD>[a-z]+)', r'\g<SLD>@\g<local>', s))
# xxx@aaa.com yyy@bbb.net zzz@ccc.org

You can also specify a function that takes a match object as an argument for the second argument, allowing for more complex processing.

def func(matchobj):
    return matchobj.group(2).upper() + '@' + matchobj.group(1)

print(re.sub('([a-z]+)@([a-z]+)', func, s))
# XXX@aaa.com YYY@bbb.net ZZZ@ccc.org

Lambda expressions can also be used.

print(re.sub('([a-z]+)@([a-z]+)', lambda m: m.group(2).upper() + '@' + m.group(1), s))
# XXX@aaa.com YYY@bbb.net ZZZ@ccc.org

subn() returns a tuple containing the replaced string and the number of replaced parts (the number of matches to the pattern).

t = re.subn('[a-z]*@', 'ABC@', s)
print(t)
# ('ABC@xxx.com ABC@yyy.net ABC@zzz.org', 3)

print(type(t))
# <class 'tuple'>

print(t[0])
# ABC@xxx.com ABC@yyy.net ABC@zzz.org

print(t[1])
# 3

The way to specify arguments is the same as for sub(). You can use grouped parts with () and specify the count argument.

print(re.subn('([a-z]+)@([a-z]+)', r'\2@\1', s, 2))
# ('xxx@aaa.com yyy@bbb.net ccc@zzz.org', 2)

For more information on string replacement, see the following article.

Split a string using a regex pattern: split()

split() splits a string at the parts that match the pattern and returns the result as a list.

When matching at the beginning or end of a string, be aware that an empty string '' will be included at the beginning and end of the resulting list.

s = '111aaa222bbb333'

print(re.split('[a-z]+', s))
# ['111', '222', '333']

print(re.split('[0-9]+', s))
# ['', 'aaa', 'bbb', '']
source: re_split.py

You can specify the maximum number of splits as the third argument, maxsplit.

print(re.split('[a-z]+', s, 1))
# ['111', '222bbb333']
source: re_split.py

For more information on splitting strings, see the following article.

Metacharacters and special sequences in the re module

The main regex metacharacters and special sequences that can be used in the re module are as follows.

Metacharacter Description
. Matches any character except newline
^ Matches the start of a string
$ Matches the end of a string
* Matches 0 or more repetitions of the preceding pattern
+ Matches 1 or more repetitions of the preceding pattern
? Matches 0 or 1 repetitions of the preceding pattern
{m} Matches m repetitions of the preceding pattern
{m, n} Matches between m and n repetitions of the preceding pattern
[] Matches a character set (any one character within the [])
| OR (A|B matches either A or B where A and B are patterns)
Special Sequence Description
\d Unicode decimal digit
\D Opposite of \d
\s Unicode whitespace character
\S Opposite of \s
\w Unicode word character and _
\W Opposite of \w

For a complete list, check the official documentation:

Note that some characters have different meanings in Python 2.

For basic examples of these characters, see the following article:

Flags

Flags affect the behavior of metacharacters and special sequences in regex patterns.

Only the main flags are introduced here. For others, refer to the official documentation.

Match ASCII characters only: re.ASCII, re.A

By default, unlike standard regular expressions, \w is not equivalent to [a-zA-Z0-9_]. For example, \w matches full-width alphanumeric characters, Japanese, etc.

print(re.match(r'\w+', 'あいう漢字ABC123'))
# <re.Match object; span=(0, 11), match='あいう漢字ABC123'>

print(re.match('[a-zA-Z0-9_]+', 'あいう漢字ABC123'))
# None
source: re_flag.py

To match only ASCII characters, use re.ASCII as the flags argument in each function or add the inline flag (?a) at the beginning of the regex pattern string. In this case, \w is equivalent to [a-zA-Z0-9_]

print(re.match(r'\w+', 'あいう漢字ABC123', flags=re.ASCII))
# None

print(re.match(r'(?a)\w+', 'あいう漢字ABC123'))
# None
source: re_flag.py

The same applies when using re.compile() to compile the pattern. Use the flags argument or the inline flag.

p = re.compile(r'\w+', flags=re.ASCII)
print(p)
# re.compile('\\w+', re.ASCII)

print(p.match('あいう漢字ABC123'))
# None

p = re.compile(r'(?a)\w+')
print(p)
# re.compile('(?a)\\w+', re.ASCII)

print(p.match('あいう漢字ABC123'))
# None
source: re_flag.py

re.ASCII is also available as re.A.

print(re.ASCII is re.A)
# True
source: re_flag.py

\W, which represents the opposite of \w, is also affected by re.ASCII or (?a).

print(re.match(r'\W+', 'あいう漢字ABC123'))
# None

print(re.match(r'\W+', 'あいう漢字ABC123', flags=re.ASCII))
# <re.Match object; span=(0, 11), match='あいう漢字ABC123'>
source: re_flag.py

\d and \s match both half-width and full-width characters by default. re.ASCII or (?a) limits them to half-width characters.

print(re.match(r'\d+', '123'))
# <re.Match object; span=(0, 3), match='123'>

print(re.match(r'\d+', '123'))
# <re.Match object; span=(0, 3), match='123'>

print(re.match(r'\d+', '123', flags=re.ASCII))
# <re.Match object; span=(0, 3), match='123'>

print(re.match(r'\d+', '123', flags=re.ASCII))
# None

print(re.match(r'\s+', ' '))  # 全角スペース
# <re.Match object; span=(0, 1), match='\u3000'>

print(re.match(r'\s+', ' ', flags=re.ASCII))
# None
source: re_flag.py

Their opposites, \D and \S, are also affected by re.ASCII or (?a).

Case-insensitive matching: re.IGNORECASE, re.I

By default, the case is considered when matching.

Use re.IGNORECASE for case-insensitive matching, which is equivalent to the i flag in standard regex.

print(re.match('[a-zA-Z]+', 'abcABC'))
# <re.Match object; span=(0, 6), match='abcABC'>

print(re.match('[a-z]+', 'abcABC', flags=re.IGNORECASE))
# <re.Match object; span=(0, 6), match='abcABC'>

print(re.match('[A-Z]+', 'abcABC', flags=re.IGNORECASE))
# <re.Match object; span=(0, 6), match='abcABC'>
source: re_flag.py

You can also use the inline flag (?i) or the abbreviation re.I.

Match at the beginning and end of each line: re.MULTILINE, re.M

^ matches the start of a string.

By default, it only matches the start of the entire string. Use re.MULTILINE to match the start of each line, equivalent to the m flag in standard regex.

s = '''aaa-xxx
bbb-yyy
ccc-zzz'''

print(s)
# aaa-xxx
# bbb-yyy
# ccc-zzz

print(re.findall('^[a-z]+', s))
# ['aaa']

print(re.findall('^[a-z]+', s, flags=re.MULTILINE))
# ['aaa', 'bbb', 'ccc']
source: re_flag.py

Similarly, $ matches the end of a string. By default, it matches the end of the entire string. Use re.MULTILINE to match the end of each line.

print(re.findall('[a-z]+$', s))
# ['zzz']

print(re.findall('[a-z]+$', s, flags=re.MULTILINE))
# ['xxx', 'yyy', 'zzz']
source: re_flag.py

You can also use the inline flag (?m) or the abbreviation re.M.

Specify multiple flags

Use | to enable multiple flags simultaneously. For inline flags, write like (?am).

s = '''aaa-xxx
あああ-んんん
bbb-zzz'''

print(s)
# aaa-xxx
# あああ-んんん
# bbb-zzz

print(re.findall(r'^\w+', s, flags=re.M))
# ['aaa', 'あああ', 'bbb']

print(re.findall(r'^\w+', s, flags=re.M | re.A))
# ['aaa', 'bbb']

print(re.findall(r'(?am)^\w+', s))
# ['aaa', 'bbb']
source: re_flag.py

Greedy and non-greedy matching

This is a general issue in regular expressions and not specific to Python, but it is worth mentioning as it can be a common pitfall.

By default, *, +, and ? perform greedy matching, matching the longest possible string.

s = 'aaa@xxx.com bbb@yyy.com'

m = re.match(r'.+com', s)
print(m)
# <re.Match object; span=(0, 23), match='aaa@xxx.com bbb@yyy.com'>

print(m.group())
# aaa@xxx.com bbb@yyy.com
source: re_greedy.py

By adding ? (i.e., *?, +?, ??), non-greedy (minimal) matching is performed, matching the shortest possible string.

m = re.match(r'.+?com', s)
print(m)
# <re.Match object; span=(0, 11), match='aaa@xxx.com'>

print(m.group())
# aaa@xxx.com
source: re_greedy.py

Related Categories

Related Articles