pandas: Split string columns by delimiters or regular expressions
In pandas, you can split a string column into multiple columns using delimiters or regular expression patterns by the string methods str.split()
and str.extract()
.
This article explains the following contents.
- Split with delimiter or regular expression pattern:
str.split()
- Specify delimiter or regular expression pattern:
pat
,regex
- Split into multiple columns:
expand
- Specify the maximum number of splits:
n
- Specify delimiter or regular expression pattern:
- Split by extracting parts matching regular expressions:
str.extract()
- Apply the methods to
pandas.DataFrame
Note that str.split()
and str.extract()
are methods of pandas.Series
.
Split with delimiter or regular expression pattern: str.split()
To split strings using a delimiter or a regular expression pattern, use the str.split()
method.
Specify delimiter or regular expression pattern: pat
, regex
Consider the following pandas.Series
as an example.
import pandas as pd
print(pd.__version__)
# 1.5.3
s_org = pd.Series(['aaa@xxx.com', 'bbb@yyy.com', 'ccc'], index=['A', 'B', 'C'])
print(s_org)
# A aaa@xxx.com
# B bbb@yyy.com
# C ccc
# dtype: object
print(type(s_org))
# <class 'pandas.core.series.Series'>
Specify the delimiter in the first argument, pat
. pandas.Series
with elements as lists of split strings is returned. If pat
is omitted, the strings will be split by whitespace.
s = s_org.str.split('@')
print(s)
# A [aaa, xxx.com]
# B [bbb, yyy.com]
# C [ccc]
# dtype: object
print(type(s))
# <class 'pandas.core.series.Series'>
In pandas versions before 1.4.0
, the pat
argument was always treated as a normal string. However, in version 1.4.0
, the regex
argument was added, allowing the interpretation of pat
to change based on its settings.
By default (regex=None
), the specified string is treated as a normal string if its length is 1 and as a regular expression pattern if its length is greater than 1.
print(s_org.str.split(r'@.+\.'))
# A [aaa, com]
# B [bbb, com]
# C [ccc]
# dtype: object
You can also specify a compiled regular expression pattern for pat
.
import re
pat = re.compile(r'@.+\.')
print(s_org.str.split(pat))
# A [aaa, com]
# B [bbb, com]
# C [ccc]
# dtype: object
pat
is always treated as a regular expression pattern if regex=True
, and as a normal string if regex=False
. If you want to specify a one-character regular expression pattern or a two-or-more-character normal string, set True
or False
. Keep in mind that setting regex=False
and specifying a compiled regular expression pattern for pat
will cause an error.
Split into multiple columns: expand
To get the split result as a pandas.DataFrame
with multiple columns, set the expand
argument to True
. The default is expand=False
.
If there are fewer splits in a row than the number of columns, the missing elements will be set to None
.
df = s_org.str.split('@', expand=True)
print(df)
# 0 1
# A aaa xxx.com
# B bbb yyy.com
# C ccc None
print(type(df))
# <class 'pandas.core.frame.DataFrame'>
The column names of the obtained pandas.DataFrame
are sequential numbers starting from 0
. You can change them using the columns
attribute.
df.columns = ['local', 'domain']
print(df)
# local domain
# A aaa xxx.com
# B bbb yyy.com
# C ccc None
Specify the maximum number of splits: n
You can set the maximum number of splits using the n
argument. By default, n=-1
, and the string will be split at all matched positions.
s_org = pd.Series(['a-b-c-d', 'x-y-z', '1'], index=['A', 'B', 'C'])
print(s_org)
# A a-b-c-d
# B x-y-z
# C 1
# dtype: object
print(s_org.str.split('-'))
# A [a, b, c, d]
# B [x, y, z]
# C [1]
# dtype: object
print(s_org.str.split('-', n=1))
# A [a, b-c-d]
# B [x, y-z]
# C [1]
# dtype: object
Split by extracting parts matching regular expressions: str.extract()
To split strings by extracting parts of the string that match regular expressions, use the str.extract()
method.
Consider the following pandas.Series
as an example:
s_org = pd.Series(['aaa@xxx.com', 'bbb@yyy.com', 'ccc'], index=['A', 'B', 'C'])
print(s_org)
# A aaa@xxx.com
# B bbb@yyy.com
# C ccc
# dtype: object
Specify the regular expression pattern in the first argument, pat
. The string will be split according to the parts of the string that match the groups enclosed in ()
in the regular expression pattern. If multiple groups are extracted, a pandas.DataFrame
is returned. If there are no matches, the missing elements will be set to NaN
.
df = s_org.str.extract(r'(.+)@(.+)\.(.+)')
print(df)
# 0 1 2
# A aaa xxx com
# B bbb yyy com
# C NaN NaN NaN
If there is only one group, setting expand=True
returns a pandas.DataFrame
, while setting expand=False
returns a pandas.Series
. The default is expand=True
.
df = s_org.str.extract(r'(\w+)', expand=True)
print(df)
# 0
# A aaa
# B bbb
# C ccc
print(type(df))
# <class 'pandas.core.frame.DataFrame'>
s = s_org.str.extract(r'(\w+)', expand=False)
print(s)
# A aaa
# B bbb
# C ccc
# dtype: object
print(type(s))
# <class 'pandas.core.series.Series'>
When you use a named group (?P<name>...)
in the regular expression pattern, the group names are used as column names.
df_name = s_org.str.extract(
r'(?P<local>.*)@(?P<second_LD>.*)\.(?P<TLD>.*)', expand=True
)
print(df_name)
# local second_LD TLD
# A aaa xxx com
# B bbb yyy com
# C NaN NaN NaN
The flags
argument allows you to specify regular expression flags (such as re.IGNORECASE
). For more details about flags, see the following article:
Note that str.extract()
only extracts the first match when there are multiple matches. To extract all matches, use the str.extractall()
method.
Apply the methods to pandas.DataFrame
Here are examples of updating a specific column in pandas.DataFrame
by splitting it into multiple columns. Perhaps there might be a more efficient way to achieve this.
Although str.split()
is used in the following examples, the same concept applies to str.extract()
.
Consider the previously created pandas.DataFrame
:
print(df)
# local domain
# A aaa xxx.com
# B bbb yyy.com
# C ccc None
Using str.split()
on a specific column returns pandas.DataFrame
.
print(df['domain'].str.split('.', expand=True))
# 0 1
# A xxx com
# B yyy com
# C None None
You concatenate it with the original pandas.DataFrame
using pd.concat()
and delete the original column using the drop()
method.
- pandas: Concat multiple DataFrame/Series with concat()
- pandas: Delete rows/columns from DataFrame with drop()
df2 = pd.concat([df, df['domain'].str.split('.', expand=True)], axis=1).drop(
'domain', axis=1
)
print(df2)
# local 0 1
# A aaa xxx com
# B bbb yyy com
# C ccc None None
Another method is to select only the columns you need from the original pandas.DataFrame
when concatenating.
df3 = pd.concat([df['local'], df['domain'].str.split('.', expand=True)], axis=1)
print(df3)
# local 0 1
# A aaa xxx com
# B bbb yyy com
# C ccc None None
To change specific column names, use the rename()
method.
df3.rename(columns={0: 'second_LD', 1: 'TLD'}, inplace=True)
print(df3)
# local second_LD TLD
# A aaa xxx com
# B bbb yyy com
# C ccc None None