pandas: Split string columns by delimiters or regular expressions

Posted: | Tags: Python, pandas

In pandas, you can split a string column into multiple columns using delimiters or regular expression patterns by the string methods str.split() and str.extract().

This article explains the following contents.

  • Split with delimiter or regular expression pattern: str.split()
    • Specify delimiter or regular expression pattern: pat, regex
    • Split into multiple columns: expand
    • Specify the maximum number of splits: n
  • Split by extracting parts matching regular expressions: str.extract()
  • Apply the methods to pandas.DataFrame

Note that str.split() and str.extract() are methods of pandas.Series.

Split with delimiter or regular expression pattern: str.split()

To split strings using a delimiter or a regular expression pattern, use the str.split() method.

Specify delimiter or regular expression pattern: pat, regex

Consider the following pandas.Series as an example.

import pandas as pd

print(pd.__version__)
# 1.5.3

s_org = pd.Series(['aaa@xxx.com', 'bbb@yyy.com', 'ccc'], index=['A', 'B', 'C'])
print(s_org)
# A    aaa@xxx.com
# B    bbb@yyy.com
# C            ccc
# dtype: object

print(type(s_org))
# <class 'pandas.core.series.Series'>

Specify the delimiter in the first argument, pat. pandas.Series with elements as lists of split strings is returned. If pat is omitted, the strings will be split by whitespace.

s = s_org.str.split('@')
print(s)
# A    [aaa, xxx.com]
# B    [bbb, yyy.com]
# C             [ccc]
# dtype: object

print(type(s))
# <class 'pandas.core.series.Series'>

In pandas versions before 1.4.0, the pat argument was always treated as a normal string. However, in version 1.4.0, the regex argument was added, allowing the interpretation of pat to change based on its settings.

By default (regex=None), the specified string is treated as a normal string if its length is 1 and as a regular expression pattern if its length is greater than 1.

print(s_org.str.split(r'@.+\.'))
# A    [aaa, com]
# B    [bbb, com]
# C         [ccc]
# dtype: object

You can also specify a compiled regular expression pattern for pat.

import re

pat = re.compile(r'@.+\.')
print(s_org.str.split(pat))
# A    [aaa, com]
# B    [bbb, com]
# C         [ccc]
# dtype: object

pat is always treated as a regular expression pattern if regex=True, and as a normal string if regex=False. If you want to specify a one-character regular expression pattern or a two-or-more-character normal string, set True or False. Keep in mind that setting regex=False and specifying a compiled regular expression pattern for pat will cause an error.

Split into multiple columns: expand

To get the split result as a pandas.DataFrame with multiple columns, set the expand argument to True. The default is expand=False.

If there are fewer splits in a row than the number of columns, the missing elements will be set to None.

df = s_org.str.split('@', expand=True)
print(df)
#      0        1
# A  aaa  xxx.com
# B  bbb  yyy.com
# C  ccc     None

print(type(df))
# <class 'pandas.core.frame.DataFrame'>

The column names of the obtained pandas.DataFrame are sequential numbers starting from 0. You can change them using the columns attribute.

df.columns = ['local', 'domain']
print(df)
#   local   domain
# A   aaa  xxx.com
# B   bbb  yyy.com
# C   ccc     None

Specify the maximum number of splits: n

You can set the maximum number of splits using the n argument. By default, n=-1, and the string will be split at all matched positions.

s_org = pd.Series(['a-b-c-d', 'x-y-z', '1'], index=['A', 'B', 'C'])
print(s_org)
# A    a-b-c-d
# B      x-y-z
# C          1
# dtype: object

print(s_org.str.split('-'))
# A    [a, b, c, d]
# B       [x, y, z]
# C             [1]
# dtype: object

print(s_org.str.split('-', n=1))
# A    [a, b-c-d]
# B      [x, y-z]
# C           [1]
# dtype: object

Split by extracting parts matching regular expressions: str.extract()

To split strings by extracting parts of the string that match regular expressions, use the str.extract() method.

Consider the following pandas.Series as an example:

s_org = pd.Series(['aaa@xxx.com', 'bbb@yyy.com', 'ccc'], index=['A', 'B', 'C'])
print(s_org)
# A    aaa@xxx.com
# B    bbb@yyy.com
# C            ccc
# dtype: object

Specify the regular expression pattern in the first argument, pat. The string will be split according to the parts of the string that match the groups enclosed in () in the regular expression pattern. If multiple groups are extracted, a pandas.DataFrame is returned. If there are no matches, the missing elements will be set to NaN.

df = s_org.str.extract(r'(.+)@(.+)\.(.+)')
print(df)
#      0    1    2
# A  aaa  xxx  com
# B  bbb  yyy  com
# C  NaN  NaN  NaN

If there is only one group, setting expand=True returns a pandas.DataFrame, while setting expand=False returns a pandas.Series. The default is expand=True.

df = s_org.str.extract(r'(\w+)', expand=True)
print(df)
#      0
# A  aaa
# B  bbb
# C  ccc

print(type(df))
# <class 'pandas.core.frame.DataFrame'>

s = s_org.str.extract(r'(\w+)', expand=False)
print(s)
# A    aaa
# B    bbb
# C    ccc
# dtype: object

print(type(s))
# <class 'pandas.core.series.Series'>

When you use a named group (?P<name>...) in the regular expression pattern, the group names are used as column names.

df_name = s_org.str.extract(
    r'(?P<local>.*)@(?P<second_LD>.*)\.(?P<TLD>.*)', expand=True
)
print(df_name)
#   local second_LD  TLD
# A   aaa       xxx  com
# B   bbb       yyy  com
# C   NaN       NaN  NaN

The flags argument allows you to specify regular expression flags (such as re.IGNORECASE). For more details about flags, see the following article:

Note that str.extract() only extracts the first match when there are multiple matches. To extract all matches, use the str.extractall() method.

Apply the methods to pandas.DataFrame

Here are examples of updating a specific column in pandas.DataFrame by splitting it into multiple columns. Perhaps there might be a more efficient way to achieve this.

Although str.split() is used in the following examples, the same concept applies to str.extract().

Consider the previously created pandas.DataFrame:

print(df)
#   local   domain
# A   aaa  xxx.com
# B   bbb  yyy.com
# C   ccc     None

Using str.split() on a specific column returns pandas.DataFrame.

print(df['domain'].str.split('.', expand=True))
#       0     1
# A   xxx   com
# B   yyy   com
# C  None  None

You concatenate it with the original pandas.DataFrame using pd.concat() and delete the original column using the drop() method.

df2 = pd.concat([df, df['domain'].str.split('.', expand=True)], axis=1).drop(
    'domain', axis=1
)
print(df2)
#   local     0     1
# A   aaa   xxx   com
# B   bbb   yyy   com
# C   ccc  None  None

Another method is to select only the columns you need from the original pandas.DataFrame when concatenating.

df3 = pd.concat([df['local'], df['domain'].str.split('.', expand=True)], axis=1)
print(df3)
#   local     0     1
# A   aaa   xxx   com
# B   bbb   yyy   com
# C   ccc  None  None

To change specific column names, use the rename() method.

df3.rename(columns={0: 'second_LD', 1: 'TLD'}, inplace=True)
print(df3)
#   local second_LD   TLD
# A   aaa       xxx   com
# B   bbb       yyy   com
# C   ccc      None  None

Related Categories

Related Articles