Extract a Substring from a String in Python (Position, Regex)
This article explains how to extract a substring from a string in Python.
You can extract a substring by specifying its position and length, or by using regular expression (regex) patterns.
For information on how to find the position of a substring or replace it with another string, refer to the following articles:
- Search for a string in Python (Check if a substring is included/Get a substring position)
- Replace strings in Python (replace, translate, re.sub, re.subn)
If you want to extract a substring from the contents of a text file, first read the file as a string.
Extract a substring by position and length
Extract a character by index
You can get a character at a specific position by specifying its index in []
. Indexes start at 0
(zero-based indexing).
s = 'abcde'
print(s[0])
# a
print(s[4])
# e
You can also specify an index from the end of the string by using negative values. -1
refers to the last character.
print(s[-1])
# e
print(s[-5])
# a
If you specify an index that does not exist, an error will occur.
# print(s[5])
# IndexError: string index out of range
# print(s[-6])
# IndexError: string index out of range
Extract a substring by slicing
You can extract a substring within the range start <= x < stop
using the syntax [start:stop]
. If start
is omitted, slicing begins from the start of the string. If stop
is omitted, it continues to the end of the string.
s = 'abcde'
print(s[1:3])
# bc
print(s[:3])
# abc
print(s[1:])
# bcde
Negative values are also supported.
print(s[-4:-2])
# bc
print(s[:-2])
# abc
print(s[-4:])
# bcde
If start > stop
, no error is raised; instead, an empty string (''
) is returned.
print(s[3:1])
#
print(s[3:1] == '')
# True
Out-of-range values are automatically adjusted without raising an error.
print(s[-100:100])
# abcde
In addition to start
and stop
, you can also specify a step
value using [start:stop:step]
. If step
is negative, the substring will be returned in reverse order.
print(s[1:4:2])
# bd
print(s[::2])
# ace
print(s[::3])
# ad
print(s[::-1])
# edcba
print(s[::-2])
# eca
For more details on slicing, see the following article:
Extract a substring based on character count
The built-in len()
function returns the number of characters in a string. You can use it to get the central character or extract the first or second half of a string by slicing.
Note that only integers (int
) are allowed for indexing []
and slicing [:]
. If you attempt to use division /
inside indexing or slicing, it will raise an error because the result is a floating-point number (float
).
The following example uses integer division //
, which truncates the decimal part.
s = 'abcdefghi'
print(len(s))
# 9
# print(s[len(s) / 2])
# TypeError: string indices must be integers
print(s[len(s) // 2])
# e
print(s[:len(s) // 2])
# abcd
print(s[len(s) // 2:])
# efghi
Extract a substring with regex: re.search()
, re.findall()
In Python, you can use regular expressions (regex) with the re
module of the standard library.
Use re.search()
to extract the first substring that matches a regex pattern. Pass the regex pattern as the first argument and the target string as the second argument.
import re
s = '012-3456-7890'
print(re.search(r'\d+', s))
# <re.Match object; span=(0, 3), match='012'>
In regex, \d
matches a digit character, while +
matches one or more occurrences of the preceding pattern. Therefore, \d+
matches one or more consecutive digits.
Since backslashes \
are used in special sequences like \d
, it is convenient to use raw string notation by prefixing the string with r
.
If a match is found, re.search()
returns a match object. You can retrieve the matched substring using the group()
method of the match object.
m = re.search(r'\d+', s)
print(m.group())
# 012
print(type(m.group()))
# <class 'str'>
For more information about match objects, refer to the following article:
As shown in the example above, re.search()
returns only the first match, even if multiple matches exist. If you want to retrieve all matches, use re.findall()
, which returns a list of all matching substrings.
print(re.findall(r'\d+', s))
# ['012', '3456', '7890']
Examples of regex patterns
This section provides examples of regex patterns using metacharacters and special sequences.
Wildcard-like patterns
.
matches any single character except a newline, and *
matches zero or more repetitions of the preceding pattern.
For example, a.*b
matches a string that starts with a
and ends with b
. Since *
can match zero occurrences, it also matches ab
.
print(re.findall('a.*b', 'axyzb'))
# ['axyzb']
print(re.findall('a.*b', 'a---b'))
# ['a---b']
print(re.findall('a.*b', 'aあいうえおb'))
# ['aあいうえおb']
print(re.findall('a.*b', 'ab'))
# ['ab']
+
matches one or more repetitions of the preceding pattern. Therefore, a.+b
does not match ab
.
print(re.findall('a.+b', 'ab'))
# []
print(re.findall('a.+b', 'axb'))
# ['axb']
print(re.findall('a.+b', 'axxxxxxb'))
# ['axxxxxxb']
?
matches zero or one occurrence of the preceding pattern. With a.?b
, it matches ab
and any string with exactly one character between a
and b
.
print(re.findall('a.?b', 'ab'))
# ['ab']
print(re.findall('a.?b', 'axb'))
# ['axb']
print(re.findall('a.?b', 'axxb'))
# []
Greedy and non-greedy matching
*
, +
, and ?
are greedy matches, matching as much text as possible. In contrast, *?
, +?
, and ??
are non-greedy, minimal matches, matching as few characters as possible.
s = 'axb-axxxxxxb'
print(re.findall('a.*b', s))
# ['axb-axxxxxxb']
print(re.findall('a.*?b', s))
# ['axb', 'axxxxxxb']
Extract parts of the pattern with parentheses
You can enclose part of a regex pattern in parentheses ()
to extract only that part of the match.
print(re.findall('a(.*)b', 'axyzb'))
# ['xyz']
To match literal parentheses ()
, escape them with a backslash \
.
print(re.findall(r'\(.+\)', 'abc(def)ghi'))
# ['(def)']
print(re.findall(r'\((.+)\)', 'abc(def)ghi'))
# ['def']
Match any single character
Square brackets []
allow you to match any single character contained within.
Using a hyphen -
between consecutive Unicode code points (e.g., [a-z]
) creates a character range. For example, [a-z]
matches any single lowercase letter.
print(re.findall('[abc]x', 'ax-bx-cx'))
# ['ax', 'bx', 'cx']
print(re.findall('[abc]+', 'abc-aaa-cba'))
# ['abc', 'aaa', 'cba']
print(re.findall('[a-z]+', 'abc-xyz'))
# ['abc', 'xyz']
Match the start/end of the string
^
matches the start of a string, while $
matches the end.
s = 'abc-def-ghi'
print(re.findall('[a-z]+', s))
# ['abc', 'def', 'ghi']
print(re.findall('^[a-z]+', s))
# ['abc']
print(re.findall('[a-z]+$', s))
# ['ghi']
Extract by multiple patterns
|
allows you to match a substring that satisfies any one of multiple patterns. For example, to match substrings that follow either pattern A
or pattern B
, use A|B
.
s = 'axxxb-012'
print(re.findall('a.*b', s))
# ['axxxb']
print(re.findall(r'\d+', s))
# ['012']
print(re.findall(r'a.*b|\d+', s))
# ['axxxb', '012']
Case-insensitive matching
By default, matching with the re
module is case-sensitive. To perform case-insensitive matching, pass re.IGNORECASE
to the flags
argument.
s = 'abc-Abc-ABC'
print(re.findall('[a-z]+', s))
# ['abc', 'bc']
print(re.findall('[A-Z]+', s))
# ['A', 'ABC']
print(re.findall('[a-z]+', s, flags=re.IGNORECASE))
# ['abc', 'Abc', 'ABC']