Split strings in Python (delimiter, line break, regex, etc.)

Posted: 2019-05-29 / Tags: Python, String, Regular expression

Here's how to split strings by delimiters, line breaks, regular expressions, and the number of characters in Python.

  • Split by delimiter: split()
    • Specify the delimiter: sep
    • Specify the maximum number of split: maxsplit
  • Split from right by delimiter: rsplit()
  • Split by line break: splitlines()
  • Split by regular expression: re.split()
    • Split by multiple different delimiters
  • Concatenate list of strings
  • Split based on the number of characters: slice

See the following post for details of string concatenation.

Sponsored Link

Split by delimiter: split()

Use split() method to split by single delimiter.

If the argument is omitted, it will be separated by whitespace. Whitespace include spaces, newlines \n and tabs \t, and consecutive whitespace are processed together.

A list of the words is returned.

s_blank = 'one two     three\nfour\tfive'
# one two     three
# four  five

# ['one', 'two', 'three', 'four', 'five']

# <class 'list'>

Use join(), described below, to concatenate a list into string.

Specify the delimiter: sep

Specify a delimiter for the first parameter sep.

s_comma = 'one,two,three,four,five'

# ['one', 'two', 'three', 'four', 'five']

# ['one,two,', ',four,five']

If you want to specify multiple delimiters, use regular expressions as described later.

Specify the maximum number of split: maxsplit

Specify the maximum number of split for the second parameter maxsplit.

If maxsplit is given, at most maxsplit splits are done.

print(s_comma.split(',', 2))
# ['one', 'two', 'three,four,five']

For example, it is useful when you want to delete the first line from a string.

If sep='\n', maxsplit=1, you can get a list of strings split by the first newline character \n. The second element [1] of this list is a string excluding the first line. As it is the last element, it can be specified as [-1].

s_lines = 'one\ntwo\nthree\nfour'
# one
# two
# three
# four

print(s_lines.split('\n', 1))
# ['one', 'two\nthree\nfour']

print(s_lines.split('\n', 1)[0])
# one

print(s_lines.split('\n', 1)[1])
# two
# three
# four

print(s_lines.split('\n', 1)[-1])
# two
# three
# four

Similarly, to delete the first two lines:

print(s_lines.split('\n', 2)[-1])
# three
# four

Split from right by delimiter: rsplit()

rsplit() splits from the right of the string.

The result is different from split() only when the second parameter maxsplit is given.

In the same way as split(), if you want to delete the last line, use rsplit().

print(s_lines.rsplit('\n', 1))
# ['one\ntwo\nthree', 'four']

print(s_lines.rsplit('\n', 1)[0])
# one
# two
# three

print(s_lines.rsplit('\n', 1)[1])
# four

To delete the last two lines:

print(s_lines.rsplit('\n', 2)[0])
# one
# two

Split by line break: splitlines()

There is also a splitlines() for splitting by line boundaries.

As in the previous examples, split() and rsplit() split by default with whitespaces including line break, and you can also specify line break with the parmeter sep.

However, it is often better to use splitlines().

For example, split string that contains \n (LF) used by Unix OS including Mac and \r\n (CR + LF) used by WIndows OS.

s_lines_multi = '1 one\n2 two\r\n3 three\n'
# 1 one
# 2 two
# 3 three

When split() is applied by default, it is split not only by line breaks but also by spaces.

# ['1', 'one', '2', 'two', '3', 'three']

Since only one newline character can be specified in sep, it can not be split if there are mixed newline characters. It is also split at the end of the newline character.

# ['1 one', '2 two\r', '3 three', '']

splitlines() splits at various newline characters but not at other whitespaces.

# ['1 one', '2 two', '3 three']

If the first argument keepends is set to True, the result includes a newline character at the end of the line.

# ['1 one\n', '2 two\r\n', '3 three\n']
Sponsored Link

Split by regular expression: re.split()

split() and rsplit() split only when sep matches completely.

If you want to split a string that matches a regular expression instead of perfect match, use the split() of the re module.

In re.split(), specify the regular expression pattern in the first parameter and the target character string in the second parameter.

An example of split by consecutive numbers is as follows.

import re

s_nums = 'one1two22three333four'

print(re.split('\d+', s_nums))
# ['one', 'two', 'three', 'four']

The maximum number of splits can be specified in the third parameter maxsplit.

print(re.split('\d+', s_nums, 2))
# ['one', 'two', 'three333four']

Split by multiple different delimiters

The following two are useful to remember even if you are not familiar with regular expressions.

Enclose a string with [] to match any single character in it. It can be used to split by multiple different characters.

s_marks = 'one-two+three#four'

print(re.split('[-+#]', s_marks))
# ['one', 'two', 'three', 'four']

If patterns are delimited by |, it matches any pattern. Of course, it is possible to use special characters of regular expression for each pattern, but it is OK even if normal string is specified as it is. It can be used to split multiple different strings.

s_strs = 'oneXXXtwoYYYthreeZZZfour'

print(re.split('XXX|YYY|ZZZ', s_strs))
# ['one', 'two', 'three', 'four']

Concatenate list of strings

In the previous examples, we split the string and got the list.

If you want to concatenate a list of strings into one string, use the string method join().

Call join() method from 'separator', pass a list of strings to be concatenated to argument.

l = ['one', 'two', 'three']

# one,two,three

# one
# two
# three

# onetwothree

See the following post for details of string concatenation.

Split based on the number of characters: slice

Use slice to split strings based on the number of characters.

s = 'abcdefghij'

# abcde

# fghij

It can be obtained as a tuple or assigned to a variable respectively.

s_tuple = s[:5], s[5:]

# ('abcde', 'fghij')

# <class 'tuple'>

s_first, s_last = s[:5], s[5:]

# abcde

# fghij

Split into three:

s_first, s_second, s_last = s[:3], s[3:6], s[6:]

# abc

# def

# ghij

The number of characters can be obtained with the built-in function len(). It can also be split into halves using this.

half = len(s) // 2
# 5

s_first, s_last = s[:half], s[half:]

# abcde

# fghij

If you want to concatenate strings, use the + operator.

print(s_first + s_last)
# abcdefghij
Sponsored Link

Related Categories

Related Posts