Convert Between Unicode Code Point and Character: chr, ord
In Python, the built-in chr()
and ord()
functions allow you to convert between Unicode code points and characters.
Additionally, in string literals, characters can be represented by their hexadecimal Unicode code points using \x
, \u
, or \U
.
You can check the correspondence between Unicode code points and characters on the Unicode Consortium's page below.
Convert characters to Unicode code points: ord()
ord()
returns the Unicode code point as an integer (int
) when given a single-character string.
i = ord('A')
print(i)
# 65
print(type(i))
# <class 'int'>
Specifying a string of two or more characters will result in an error.
# ord('abc')
# TypeError: ord() expected a character, but string of length 3 found
Unicode code points are often represented in hexadecimal notation. To convert an integer to a hexadecimal string, use the built-in hex()
function.
s = hex(i)
print(s)
# 0x41
print(type(s))
# <class 'str'>
The built-in format()
function can be used for more detailed formatting, such as zero-padding and including or excluding 0x
.
print(format(i, '04x'))
# 0041
print(format(i, '#06x'))
# 0x0041
In summary, the hexadecimal Unicode code point for a specific character can be obtained as follows.
print(format(ord('X'), '#08x'))
# 0x000058
print(format(ord('π―'), '#08x'))
# 0x01f4af
Some emojis, including flags, are represented by multiple Unicode code points, known as emoji sequences.
As of Python 3.11.7, ord()
cannot process emoji sequences, resulting in an error. Also, using the built-in len()
function to check the length of these emojis will return the number of Unicode code points.
# ord('π―π΅')
# TypeError: ord() expected a character, but string of length 2 found
print(len('π―π΅'))
# 2
Convert Unicode code points to characters: chr()
chr()
converts a specified integer (int
) to its corresponding Unicode character string (str
).
print(chr(65))
# A
print(type(chr(65)))
# <class 'str'>
Since integers can be written in hexadecimal by prefixing them with 0x
in Python, you can directly specify a hexadecimal Unicode code point as an argument in chr()
, regardless of zero-padding.
print(65 == 0x41)
# True
print(chr(0x41))
# A
print(chr(0x000041))
# A
To convert a hexadecimal string to a Unicode character, use int()
to convert the string to an integer with base 16
, and then pass it to chr()
.
s = '0x0041'
print(int(s, 16))
# 65
print(chr(int(s, 16)))
# A
When converting a hexadecimal string to an integer using int()
, the second argument can be 0
if the string is prefixed with 0x
. For more details on handling hexadecimal numbers and strings, refer to the following article.
Unicode code points are often represented as U+XXXX
. To convert such a string to its corresponding character, use slicing to extract the numeric part.
s = 'U+0041'
print(s[2:])
# 0041
print(chr(int(s[2:], 16)))
# A
Use Unicode code points in string literals: \x
, \u
, \U
In string literals, characters can be represented by their hexadecimal Unicode code points using \x
, \u
, or \U
.
The sequences \xXX
, \uXXXX
, and \UXXXXXXXX
require 2, 4, and 8 hexadecimal digits respectively. An error occurs if the number of digits is incorrect.
print('\x41')
# A
print('\u0041')
# A
print('\U00000041')
# A
print('\U0001f4af')
# π―
# print('\u041')
# SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-4: truncated \uXXXX escape
# print('\U0000041')
# SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-8: truncated \UXXXXXXXX escape
Each code is treated as one character.
print('\u0041\u0042\u0043')
# ABC
print(len('\u0041\u0042\u0043'))
# 3
In raw strings, where escape sequences are ignored, these sequences are treated as regular text.
print(r'\u0041\u0042\u0043')
# \u0041\u0042\u0043
print(len(r'\u0041\u0042\u0043'))
# 18