PythonでUnicodeコードポイントと文字を相互変換（chr, ord, \x, \u, \U）

Modified: 2023-09-19 | Tags: Python, 文字列, Unicode

PythonでUnicodeコードポイント（文字コード）と文字を相互に変換するには、組み込み関数chr(), ord()を使う。あるUnicodeコードポイントの文字を取得するにはchr()、ある文字のUnicodeコードポイントを取得するにはord()を使う。

また、文字列リテラルの中で\x, \u, \Uに続けて16進数表記のUnicodeコードポイントを記述することで文字を表すこともできる。

文字をUnicodeコードポイントに変換: ord()
Unicodeコードポイントを文字に変換: chr()
文字列をUnicodeコードポイントで記述: \x, \u, \U

なお、Unicodeコードポイントと文字の対応はUnicodeコンソーシアムの以下のページで確認可能。フォームに16進数表記のUnicodeコードポイントまたは文字を入力してshowをクリックすると対応する文字の詳細が表示される。

Unicode Utilities: Character Properties

Unicode UtilitiesではUnicodeのプロパティ（ブロックやスクリプトなど）に対応する文字一覧なども確認できる。

関連記事: Unicodeのコードポイントやプロパティの一覧、詳細情報などを確認

文字をUnicodeコードポイントに変換: ord()

ord()の引数に文字（1文字の文字列）を指定すると、その文字のUnicodeコードポイントが整数intで返される。

組み込み関数 - ord() — Python 3.11.5 ドキュメント

i = ord('A')
print(i)
# 65

print(type(i))
# <class 'int'>

source: chr_ord.py

2文字以上の文字列を指定するとエラー。

# ord('abc')
# TypeError: ord() expected a character, but string of length 3 found

source: chr_ord.py

Unicodeコードポイントは16進数で表記されることが多い。整数を16進数表記の文字列に変換するには組み込み関数hex()を使う。

s = hex(i)
print(s)
# 0x41

print(type(s))
# <class 'str'>

source: chr_ord.py

組み込み関数format()を使うと、ゼロ埋めや0xの有無など、より細かい書式を指定できる。

関連記事: Python, formatで書式変換（0埋め、指数表記、16進数など）

print(format(i, '04x'))
# 0041

print(format(i, '#06x'))
# 0x0041

source: chr_ord.py

特定の文字の16進数表記のUnicodeコードポイントを取得する処理をまとめて書くと以下のようになる。

print(format(ord('X'), '#08x'))
# 0x000058

print(format(ord('💯'), '#08x'))
# 0x01f4af

source: chr_ord.py

絵文字の中には、複数のUnicodeコードポイントで表現されている絵文字シーケンスと呼ばれるものがある。国旗や職業の絵文字などが該当する。

Unicode Utilities: Character Properties - 🇯🇵

Python3.11.5時点ではord()はそのような絵文字に対応しておらず、エラーとなる。また、組み込み関数len()でそれらの絵文字の文字数を確認すると、Unicodeコードポイントの個数が返されるので注意。

# ord('🇯🇵')
# TypeError: ord() expected a character, but string of length 2 found

print(len('🇯🇵'))
# 2

source: chr_ord.py

Unicodeコードポイントを文字に変換: chr()

chr()の引数に整数を指定すると、その値がUnicodeコードポイントである文字が文字列strで返される。

組み込み関数 - chr() — Python 3.11.5 ドキュメント

print(chr(65))
# A

print(type(chr(65)))
# <class 'str'>

source: chr_ord.py

Pythonでは、0xをつけると数値を16進数で記述できるので、16進数表記のUnicodeコードポイントが分かっていればそのままchr()の引数として指定できる。ゼロ埋めされていても問題ない。

print(65 == 0x41)
# True

print(chr(0x41))
# A

print(chr(0x000041))
# A

source: chr_ord.py

Unicodeコードポイントを表す16進数表記の文字列から対応する文字を取得するには、16進数表記の文字列を整数intに変換してからchr()に渡す。

16進数表記の文字列を整数intに変換するにはint()を使う。第一引数に文字列、第二引数に基数16を指定する。

s = '0x0041'

print(int(s, 16))
# 65

print(chr(int(s, 16)))
# A

source: chr_ord.py

16進数表記の文字列をint()で整数に変換する場合、文字列に0xがついていれば第二引数は0でもよい。16進数の数値・文字列の扱いについての詳細は以下の記事を参照。

関連記事: Pythonで2進数、8進数、16進数の数値・文字列を相互に変換

UnicodeコードポイントはU+XXXXの形で記述されることも多い。このような文字列をそのコードポイントの文字に変換するにはスライスで数値部分のみを選択すればよい。

関連記事: Pythonのスライスによるリストや文字列の部分選択・代入

s = 'U+0041'

print(s[2:])
# 0041

print(chr(int(s[2:], 16)))
# A

source: chr_ord.py

文字列をUnicodeコードポイントで記述: \x, \u, \U

文字列リテラルの中では\x, \u, \Uに続けて16進数表記のUnicodeコードポイントを記述でき、その文字として扱われる。

Unicode HOWTO - Python の Unicode サポート — Python 3.11.5 ドキュメント

\xXX, \uXXXX, \UXXXXXXXXのように、それぞれ2桁、4桁、8桁の16進数である必要がある。桁数が合っていないとエラー。

print('\x41')
# A

print('\u0041')
# A

print('\U00000041')
# A

print('\U0001f4af')
# 💯

# print('\u041')
# SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-4: truncated \uXXXX escape

# print('\U0000041')
# SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-8: truncated \UXXXXXXXX escape

source: chr_ord.py

各コードが1文字として扱われる。

print('\u0041\u0042\u0043')
# ABC

print(len('\u0041\u0042\u0043'))
# 3

source: chr_ord.py

エスケープシーケンスが無効化されるraw文字列ではそのままの文字の並びとして認識されるので注意。

関連記事: Pythonでエスケープシーケンスを無視（無効化）するraw文字列

print(r'\u0041\u0042\u0043')
# \u0041\u0042\u0043

print(len(r'\u0041\u0042\u0043'))
# 18

source: chr_ord.py

PythonでUnicodeコードポイントと文字を相互変換（chr, ord, \x, \u, \U）

文字をUnicodeコードポイントに変換: ord()

Unicodeコードポイントを文字に変換: chr()

文字列をUnicodeコードポイントで記述: \x, \u, \U

関連カテゴリー

関連記事