Convert Between Unicode Code Point and Character: chr, ord

Modified: | Tags: Python, String

In Python, the built-in chr() and ord() functions allow you to convert between Unicode code points and characters.

Additionally, in string literals, characters can be represented by their hexadecimal Unicode code points using \x, \u, or \U.

You can check the correspondence between Unicode code points and characters on the Unicode Consortium's page below.

Convert characters to Unicode code points: ord()

ord() returns the Unicode code point as an integer (int) when given a single-character string.

i = ord('A')
print(i)
# 65

print(type(i))
# <class 'int'>
source: chr_ord.py

Specifying a string of two or more characters will result in an error.

# ord('abc')
# TypeError: ord() expected a character, but string of length 3 found
source: chr_ord.py

Unicode code points are often represented in hexadecimal notation. To convert an integer to a hexadecimal string, use the built-in hex() function.

s = hex(i)
print(s)
# 0x41

print(type(s))
# <class 'str'>
source: chr_ord.py

The built-in format() function can be used for more detailed formatting, such as zero-padding and including or excluding 0x.

print(format(i, '04x'))
# 0041

print(format(i, '#06x'))
# 0x0041
source: chr_ord.py

In summary, the hexadecimal Unicode code point for a specific character can be obtained as follows.

print(format(ord('X'), '#08x'))
# 0x000058

print(format(ord('πŸ’―'), '#08x'))
# 0x01f4af
source: chr_ord.py

Some emojis, including flags, are represented by multiple Unicode code points, known as emoji sequences.

As of Python 3.11.7, ord() cannot process emoji sequences, resulting in an error. Also, using the built-in len() function to check the length of these emojis will return the number of Unicode code points.

# ord('πŸ‡―πŸ‡΅')
# TypeError: ord() expected a character, but string of length 2 found

print(len('πŸ‡―πŸ‡΅'))
# 2
source: chr_ord.py

Convert Unicode code points to characters: chr()

chr() converts a specified integer (int) to its corresponding Unicode character string (str).

print(chr(65))
# A

print(type(chr(65)))
# <class 'str'>
source: chr_ord.py

Since integers can be written in hexadecimal by prefixing them with 0x in Python, you can directly specify a hexadecimal Unicode code point as an argument in chr(), regardless of zero-padding.

print(65 == 0x41)
# True

print(chr(0x41))
# A

print(chr(0x000041))
# A
source: chr_ord.py

To convert a hexadecimal string to a Unicode character, use int() to convert the string to an integer with base 16, and then pass it to chr().

s = '0x0041'

print(int(s, 16))
# 65

print(chr(int(s, 16)))
# A
source: chr_ord.py

When converting a hexadecimal string to an integer using int(), the second argument can be 0 if the string is prefixed with 0x. For more details on handling hexadecimal numbers and strings, refer to the following article.

Unicode code points are often represented as U+XXXX. To convert such a string to its corresponding character, use slicing to extract the numeric part.

s = 'U+0041'

print(s[2:])
# 0041

print(chr(int(s[2:], 16)))
# A
source: chr_ord.py

Use Unicode code points in string literals: \x, \u, \U

In string literals, characters can be represented by their hexadecimal Unicode code points using \x, \u, or \U.

The sequences \xXX, \uXXXX, and \UXXXXXXXX require 2, 4, and 8 hexadecimal digits respectively. An error occurs if the number of digits is incorrect.

print('\x41')
# A

print('\u0041')
# A

print('\U00000041')
# A

print('\U0001f4af')
# πŸ’―

# print('\u041')
# SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-4: truncated \uXXXX escape

# print('\U0000041')
# SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-8: truncated \UXXXXXXXX escape
source: chr_ord.py

Each code is treated as one character.

print('\u0041\u0042\u0043')
# ABC

print(len('\u0041\u0042\u0043'))
# 3
source: chr_ord.py

In raw strings, where escape sequences are ignored, these sequences are treated as regular text.

print(r'\u0041\u0042\u0043')
# \u0041\u0042\u0043

print(len(r'\u0041\u0042\u0043'))
# 18
source: chr_ord.py

Related Categories

Related Articles