Extract a Substring from a String in Python (Position, Regex)
This article explains how to extract a substring from a string in Python.
You can extract a substring by specifying its position and length, or by using regular expression (regex) patterns.
For information on how to find the position of a substring or replace it with another string, refer to the following articles:
- Search for a string in Python (Check if a substring is included/Get a substring position)
- Replace strings in Python (replace, translate, re.sub, re.subn)
If you want to extract a substring from the contents of a text file, first read the file as a string.
Extract a substring by position and length
Extract a character by index
You can get a character at a specific position by specifying its index in []. Indexes start at 0 (zero-based indexing).
s = 'abcde'
print(s[0])
# a
print(s[4])
# e
You can also specify an index from the end of the string by using negative values. -1 refers to the last character.
print(s[-1])
# e
print(s[-5])
# a
If you specify an index that does not exist, an error will occur.
# print(s[5])
# IndexError: string index out of range
# print(s[-6])
# IndexError: string index out of range
Extract a substring by slicing
You can extract a substring within the range start <= x < stop using the syntax [start:stop]. If start is omitted, slicing begins from the start of the string. If stop is omitted, it continues to the end of the string.
s = 'abcde'
print(s[1:3])
# bc
print(s[:3])
# abc
print(s[1:])
# bcde
Negative values are also supported.
print(s[-4:-2])
# bc
print(s[:-2])
# abc
print(s[-4:])
# bcde
If start > stop, no error is raised; instead, an empty string ('') is returned.
print(s[3:1])
#
print(s[3:1] == '')
# True
Out-of-range values are automatically adjusted without raising an error.
print(s[-100:100])
# abcde
In addition to start and stop, you can also specify a step value using [start:stop:step]. If step is negative, the substring will be returned in reverse order.
print(s[1:4:2])
# bd
print(s[::2])
# ace
print(s[::3])
# ad
print(s[::-1])
# edcba
print(s[::-2])
# eca
For more details on slicing, see the following article:
Extract a substring based on character count
The built-in len() function returns the number of characters in a string. You can use it to get the central character or extract the first or second half of a string by slicing.
Note that only integers (int) are allowed for indexing [] and slicing [:]. If you attempt to use division / inside indexing or slicing, it will raise an error because the result is a floating-point number (float).
The following example uses integer division //, which truncates the decimal part.
s = 'abcdefghi'
print(len(s))
# 9
# print(s[len(s) / 2])
# TypeError: string indices must be integers
print(s[len(s) // 2])
# e
print(s[:len(s) // 2])
# abcd
print(s[len(s) // 2:])
# efghi
Extract a substring with regex: re.search(), re.findall()
In Python, you can use regular expressions (regex) with the re module of the standard library.
Use re.search() to extract the first substring that matches a regex pattern. Pass the regex pattern as the first argument and the target string as the second argument.
import re
s = '012-3456-7890'
print(re.search(r'\d+', s))
# <re.Match object; span=(0, 3), match='012'>
In regex, \d matches a digit character, while + matches one or more occurrences of the preceding pattern. Therefore, \d+ matches one or more consecutive digits.
Since backslashes \ are used in special sequences like \d, it is convenient to use raw string notation by prefixing the string with r.
If a match is found, re.search() returns a match object. You can retrieve the matched substring using the group() method of the match object.
m = re.search(r'\d+', s)
print(m.group())
# 012
print(type(m.group()))
# <class 'str'>
For more information about match objects, refer to the following article:
As shown in the example above, re.search() returns only the first match, even if multiple matches exist. If you want to retrieve all matches, use re.findall(), which returns a list of all matching substrings.
print(re.findall(r'\d+', s))
# ['012', '3456', '7890']
Examples of regex patterns
This section provides examples of regex patterns using metacharacters and special sequences.
Wildcard-like patterns
. matches any single character except a newline, and * matches zero or more repetitions of the preceding pattern.
For example, a.*b matches a string that starts with a and ends with b. Since * can match zero occurrences, it also matches ab.
print(re.findall('a.*b', 'axyzb'))
# ['axyzb']
print(re.findall('a.*b', 'a---b'))
# ['a---b']
print(re.findall('a.*b', 'aあいうえおb'))
# ['aあいうえおb']
print(re.findall('a.*b', 'ab'))
# ['ab']
+ matches one or more repetitions of the preceding pattern. Therefore, a.+b does not match ab.
print(re.findall('a.+b', 'ab'))
# []
print(re.findall('a.+b', 'axb'))
# ['axb']
print(re.findall('a.+b', 'axxxxxxb'))
# ['axxxxxxb']
? matches zero or one occurrence of the preceding pattern. With a.?b, it matches ab and any string with exactly one character between a and b.
print(re.findall('a.?b', 'ab'))
# ['ab']
print(re.findall('a.?b', 'axb'))
# ['axb']
print(re.findall('a.?b', 'axxb'))
# []
Greedy and non-greedy matching
*, +, and ? are greedy matches, matching as much text as possible. In contrast, *?, +?, and ?? are non-greedy, minimal matches, matching as few characters as possible.
s = 'axb-axxxxxxb'
print(re.findall('a.*b', s))
# ['axb-axxxxxxb']
print(re.findall('a.*?b', s))
# ['axb', 'axxxxxxb']
Extract parts of the pattern with parentheses
You can enclose part of a regex pattern in parentheses () to extract only that part of the match.
print(re.findall('a(.*)b', 'axyzb'))
# ['xyz']
To match literal parentheses (), escape them with a backslash \.
print(re.findall(r'\(.+\)', 'abc(def)ghi'))
# ['(def)']
print(re.findall(r'\((.+)\)', 'abc(def)ghi'))
# ['def']
Match any single character
Square brackets [] allow you to match any single character contained within.
Using a hyphen - between consecutive Unicode code points (e.g., [a-z]) creates a character range. For example, [a-z] matches any single lowercase letter.
print(re.findall('[abc]x', 'ax-bx-cx'))
# ['ax', 'bx', 'cx']
print(re.findall('[abc]+', 'abc-aaa-cba'))
# ['abc', 'aaa', 'cba']
print(re.findall('[a-z]+', 'abc-xyz'))
# ['abc', 'xyz']
Match the start/end of the string
^ matches the start of a string, while $ matches the end.
s = 'abc-def-ghi'
print(re.findall('[a-z]+', s))
# ['abc', 'def', 'ghi']
print(re.findall('^[a-z]+', s))
# ['abc']
print(re.findall('[a-z]+$', s))
# ['ghi']
Extract by multiple patterns
| allows you to match a substring that satisfies any one of multiple patterns. For example, to match substrings that follow either pattern A or pattern B, use A|B.
s = 'axxxb-012'
print(re.findall('a.*b', s))
# ['axxxb']
print(re.findall(r'\d+', s))
# ['012']
print(re.findall(r'a.*b|\d+', s))
# ['axxxb', '012']
Case-insensitive matching
By default, matching with the re module is case-sensitive. To perform case-insensitive matching, pass re.IGNORECASE to the flags argument.
s = 'abc-Abc-ABC'
print(re.findall('[a-z]+', s))
# ['abc', 'bc']
print(re.findall('[A-Z]+', s))
# ['A', 'ABC']
print(re.findall('[a-z]+', s, flags=re.IGNORECASE))
# ['abc', 'Abc', 'ABC']