note.nkmk.me

Convert Unicode code point and character to each other (chr, ord)

Posted: 2021-09-21 / Tags: Python, String

In Python, the built-in functions chr() and ord() are used to convert between Unicode code points and characters.

A character can also be represented by writing a hexadecimal Unicode code point with \x, \u, or \U in a string literal.

This article describes the following contents.

  • Convert character to Unicode code point: ord()
  • Convert Unicode code point to character: chr()
  • Use Unicode code points in strings: \x, \u, \U
Sponsored Link

Convert character to Unicode code point: ord()

By specifying a string of one character as an argument of ord(), the Unicode code point of the character is returned as an integer int.

i = ord('A')
print(i)
# 65

print(type(i))
# <class 'int'>
source: chr_ord.py

An error occurs if you specify a string of more than two characters.

# ord('abc')
# TypeError: ord() expected a character, but string of length 3 found
source: chr_ord.py

Unicode code points are often written in hexadecimal notation. Use the built-in function hex() to convert an integer to a hexadecimal string.

s = hex(i)
print(s)
# 0x41

print(type(s))
# <class 'str'>
source: chr_ord.py

The built-in function format() can be used to specify more detailed formatting, such as zero-filling and prefix 0x.

print(format(i, '04x'))
# 0041

print(format(i, '#06x'))
# 0x0041
source: chr_ord.py

In summary, the hexadecimal Unicode code point for a particular character can be obtained as follows.

print(format(ord('X'), '#08x'))
# 0x000058

print(format(ord('💯'), '#08x'))
# 0x01f4af
source: chr_ord.py

Flags and other emoji are represented by multiple Unicode code points.

Note that as of Python 3.7.3, ord() does not support such emoji and an error raises. If you check the number of characters of those emoji with the built-in function len(), the number of Unicode code points is returned.

# ord('🇯🇵')
# TypeError: ord() expected a character, but string of length 2 found

print(len('🇯🇵'))
# 2
source: chr_ord.py

Convert Unicode code point to character: chr()

chr() returns the string str representing a character whose Unicode code point is the specified integer int.

print(chr(65))
# A

print(type(chr(65)))
# <class 'str'>
source: chr_ord.py

In Python, an integer can be written in hexadecimal with 0x, so you can specify it as an argument of chr(). It doesn't matter if it is filled with zeros.

print(65 == 0x41)
# True

print(chr(0x41))
# A

print(chr(0x000041))
# A
source: chr_ord.py

If you want to convert a hexadecimal string representing a Unicode code point to a character, convert the string to an integer and then pass it to chr().

Use int() to convert a hexadecimal string into an integer. Specify the radix 16 as the second argument.

s = '0x0041'

print(int(s, 16))
# 65

print(chr(int(s, 16)))
# A
source: chr_ord.py

The second argument can be 0 if the string is prefixed with 0x. See the following article for more details on the handling of hexadecimal numbers and strings.

Unicode code points are often written in the form of U+XXXX. To convert such a string to a character of that code point, just select the numeric part of the string with slice.

s = 'U+0041'

print(s[2:])
# 0041

print(chr(int(s[2:], 16)))
# A
source: chr_ord.py
Sponsored Link

Use Unicode code points in strings: \x, \u, \U

If you write \x, \u, or \U and a hexadecimal Unicode code point in a string literal, it is treated as that character.

It should be 2, 4, or 8 digits like \xXX, \uXXXX, and \UXXXXXX, respectively. An error raises if the number of digits is not correct.

print('\x41')
# A

print('\u0041')
# A

print('\U00000041')
# A

print('\U0001f4af')
# 💯

# print('\u041')
# SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-4: truncated \uXXXX escape

# print('\U0000041')
# SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-8: truncated \UXXXXXXXX escape
source: chr_ord.py

Each code is treated as one character. You can check it with the built-in function len() which returns the number of characters.

print('\u0041\u0042\u0043')
# ABC

print(len('\u0041\u0042\u0043'))
# 3
source: chr_ord.py

Note that in raw strings where escape sequences are disabled, the string is treated as is.

print(r'\u0041\u0042\u0043')
# \u0041\u0042\u0043

print(len(r'\u0041\u0042\u0043'))
# 18
source: chr_ord.py
Sponsored Link
Share

Related Categories

Related Articles