Extract a Substring from a String in Python (Position, Regex)

Modified: | Tags: Python, String, Regex

This article explains how to extract a substring from a string in Python.

You can extract a substring by specifying its position and length, or by using regular expression (regex) patterns.

For information on how to find the position of a substring or replace it with another string, refer to the following articles:

If you want to extract a substring from the contents of a text file, first read the file as a string.

Extract a substring by position and length

Extract a character by index

You can get a character at a specific position by specifying its index in []. Indexes start at 0 (zero-based indexing).

s = 'abcde'

print(s[0])
# a

print(s[4])
# e

You can also specify an index from the end of the string by using negative values. -1 refers to the last character.

print(s[-1])
# e

print(s[-5])
# a

If you specify an index that does not exist, an error will occur.

# print(s[5])
# IndexError: string index out of range

# print(s[-6])
# IndexError: string index out of range

Extract a substring by slicing

You can extract a substring within the range start <= x < stop using the syntax [start:stop]. If start is omitted, slicing begins from the start of the string. If stop is omitted, it continues to the end of the string.

s = 'abcde'

print(s[1:3])
# bc

print(s[:3])
# abc

print(s[1:])
# bcde

Negative values are also supported.

print(s[-4:-2])
# bc

print(s[:-2])
# abc

print(s[-4:])
# bcde

If start > stop, no error is raised; instead, an empty string ('') is returned.

print(s[3:1])
# 

print(s[3:1] == '')
# True

Out-of-range values are automatically adjusted without raising an error.

print(s[-100:100])
# abcde

In addition to start and stop, you can also specify a step value using [start:stop:step]. If step is negative, the substring will be returned in reverse order.

print(s[1:4:2])
# bd

print(s[::2])
# ace

print(s[::3])
# ad

print(s[::-1])
# edcba

print(s[::-2])
# eca

For more details on slicing, see the following article:

Extract a substring based on character count

The built-in len() function returns the number of characters in a string. You can use it to get the central character or extract the first or second half of a string by slicing.

Note that only integers (int) are allowed for indexing [] and slicing [:]. If you attempt to use division / inside indexing or slicing, it will raise an error because the result is a floating-point number (float).

The following example uses integer division //, which truncates the decimal part.

s = 'abcdefghi'

print(len(s))
# 9

# print(s[len(s) / 2])
# TypeError: string indices must be integers

print(s[len(s) // 2])
# e

print(s[:len(s) // 2])
# abcd

print(s[len(s) // 2:])
# efghi

Extract a substring with regex: re.search(), re.findall()

In Python, you can use regular expressions (regex) with the re module of the standard library.

Use re.search() to extract the first substring that matches a regex pattern. Pass the regex pattern as the first argument and the target string as the second argument.

import re

s = '012-3456-7890'

print(re.search(r'\d+', s))
# <re.Match object; span=(0, 3), match='012'>

In regex, \d matches a digit character, while + matches one or more occurrences of the preceding pattern. Therefore, \d+ matches one or more consecutive digits.

Since backslashes \ are used in special sequences like \d, it is convenient to use raw string notation by prefixing the string with r.

If a match is found, re.search() returns a match object. You can retrieve the matched substring using the group() method of the match object.

m = re.search(r'\d+', s)

print(m.group())
# 012

print(type(m.group()))
# <class 'str'>

For more information about match objects, refer to the following article:

As shown in the example above, re.search() returns only the first match, even if multiple matches exist. If you want to retrieve all matches, use re.findall(), which returns a list of all matching substrings.

print(re.findall(r'\d+', s))
# ['012', '3456', '7890']

Examples of regex patterns

This section provides examples of regex patterns using metacharacters and special sequences.

Wildcard-like patterns

. matches any single character except a newline, and * matches zero or more repetitions of the preceding pattern.

For example, a.*b matches a string that starts with a and ends with b. Since * can match zero occurrences, it also matches ab.

print(re.findall('a.*b', 'axyzb'))
# ['axyzb']

print(re.findall('a.*b', 'a---b'))
# ['a---b']

print(re.findall('a.*b', 'aあいうえおb'))
# ['aあいうえおb']

print(re.findall('a.*b', 'ab'))
# ['ab']

+ matches one or more repetitions of the preceding pattern. Therefore, a.+b does not match ab.

print(re.findall('a.+b', 'ab'))
# []

print(re.findall('a.+b', 'axb'))
# ['axb']

print(re.findall('a.+b', 'axxxxxxb'))
# ['axxxxxxb']

? matches zero or one occurrence of the preceding pattern. With a.?b, it matches ab and any string with exactly one character between a and b.

print(re.findall('a.?b', 'ab'))
# ['ab']

print(re.findall('a.?b', 'axb'))
# ['axb']

print(re.findall('a.?b', 'axxb'))
# []

Greedy and non-greedy matching

*, +, and ? are greedy matches, matching as much text as possible. In contrast, *?, +?, and ?? are non-greedy, minimal matches, matching as few characters as possible.

s = 'axb-axxxxxxb'

print(re.findall('a.*b', s))
# ['axb-axxxxxxb']

print(re.findall('a.*?b', s))
# ['axb', 'axxxxxxb']

Extract parts of the pattern with parentheses

You can enclose part of a regex pattern in parentheses () to extract only that part of the match.

print(re.findall('a(.*)b', 'axyzb'))
# ['xyz']

To match literal parentheses (), escape them with a backslash \.

print(re.findall(r'\(.+\)', 'abc(def)ghi'))
# ['(def)']

print(re.findall(r'\((.+)\)', 'abc(def)ghi'))
# ['def']

Match any single character

Square brackets [] allow you to match any single character contained within.

Using a hyphen - between consecutive Unicode code points (e.g., [a-z]) creates a character range. For example, [a-z] matches any single lowercase letter.

print(re.findall('[abc]x', 'ax-bx-cx'))
# ['ax', 'bx', 'cx']

print(re.findall('[abc]+', 'abc-aaa-cba'))
# ['abc', 'aaa', 'cba']

print(re.findall('[a-z]+', 'abc-xyz'))
# ['abc', 'xyz']

Match the start/end of the string

^ matches the start of a string, while $ matches the end.

s = 'abc-def-ghi'

print(re.findall('[a-z]+', s))
# ['abc', 'def', 'ghi']

print(re.findall('^[a-z]+', s))
# ['abc']

print(re.findall('[a-z]+$', s))
# ['ghi']

Extract by multiple patterns

| allows you to match a substring that satisfies any one of multiple patterns. For example, to match substrings that follow either pattern A or pattern B, use A|B.

s = 'axxxb-012'

print(re.findall('a.*b', s))
# ['axxxb']

print(re.findall(r'\d+', s))
# ['012']

print(re.findall(r'a.*b|\d+', s))
# ['axxxb', '012']

Case-insensitive matching

By default, matching with the re module is case-sensitive. To perform case-insensitive matching, pass re.IGNORECASE to the flags argument.

s = 'abc-Abc-ABC'

print(re.findall('[a-z]+', s))
# ['abc', 'bc']

print(re.findall('[A-Z]+', s))
# ['A', 'ABC']

print(re.findall('[a-z]+', s, flags=re.IGNORECASE))
# ['abc', 'Abc', 'ABC']

Related Categories

Related Articles