您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python Unicode data module usage

編輯：Python

Catalog

UCD Introduce

unicodedata.lookup(name)

unicodedata.name(chr[,default])

unicodedata.decimal(chr[, default])

unicodedata.digit(chr[, default])

unicodedata.numeric(chr[, default])

unicodedata.category(chr)

unicodedata.bidirectional(chr)

unicodedata.combining(chr)

unicodedata.east_asian_width(chr)

unicodedata.mirrored(chr)

unicodedata.decomposition(chr)

unicodedata.normalize(form, unistr)

unicodedata.unidata_version

UCD Introduce

UCD yes Unicode Character database （Unicode Character DataBase） Abbreviation .

UCD By some description Unicode Plain text or... Of character attributes and internal relationships html The composition of the document .

UCD Most of the text files in are suitable for program analysis Unicode related data . Among them html The file explains the organization of the database , Format and meaning of data .

UCD Undoubtedly, the largest document in is the document describing the attributes of Chinese characters Unihan.txt.

stay UCD 5.0,0 in ,Unihan.txt The file size is 28,221K byte .Unihan.txt It contains many valuable indexes , For example, Chinese radical 、 stroke 、 pinyin 、 Frequency of use 、 Four corner number sorting, etc . These indexes are based on some authoritative dictionaries , But most indexes can only retrieve some Chinese characters .

unicodedata.lookup(name)

Find a character by name . If a character exists, the corresponding character is returned , If there is no throw exception KeyError.

>>> import unicodedata>>> print(unicodedata.lookup('LEFT CURLY BRACKET')){>>> print(unicodedata.lookup('LEFT'))Traceback (most recent call last): File "<stdin>", line 1, in <module>KeyError: "undefined character name 'LEFT'">>>

unicodedata.name(chr[,default])

Find its name by characters . If the corresponding name is returned successfully , Otherwise, throw an exception ValueError.

>>> import unicodedata>>> print(unicodedata.name('{'))LEFT CURLY BRACKET>>> print(unicodedata.name('@'))COMMERCIAL AT>>> print(unicodedata.name('{{'))Traceback (most recent call last): File "<stdin>", line 1, in <module>TypeError: name() argument 1 must be a unicode character, not str>>>

unicodedata.decimal(chr[, default])

Returns a numeric value representing a numeric character . If you give a value without a number , It throws an exception ValueError.

>>> import unicodedata>>> print(unicodedata.decimal('7'))7>>> print(unicodedata.decimal('7a'))Traceback (most recent call last): File "<stdin>", line 1, in <module>TypeError: decimal() argument 1 must be a unicode character, not str>>>

unicodedata.digit(chr[, default])

Convert a legal numeric string to a numeric value , such as 0 To 9 Convert the string of to the corresponding numeric value . If the illegal string , Throw an exception ValueError.

>>> import unicodedata>>> print(unicodedata.digit('9', None))9>>> print(unicodedata.digit('9a', None))Traceback (most recent call last): File "<stdin>", line 1, in <module>TypeError: digit() argument 1 must be a unicode character, not str>>>

unicodedata.numeric(chr[, default])

Converts a string representing a number to a floating-point number and returns . For example, you can put ‘8’,‘ Four ’ Convert numeric output . And digit（） The difference is that it can express any numeric character , It's not limited to 0 To 9 The characters of . If it is not a legal character , It throws an exception ValueError.

>>> import unicodedata>>> print(unicodedata.numeric(' Four ', None))4.0>>> print(unicodedata.numeric('8', None))8.0>>> print(unicodedata.numeric('8a', None))Traceback (most recent call last): File "<stdin>", line 1, in <module>TypeError: numeric() argument 1 must be a unicode character, not str>>>

unicodedata.category(chr)

Return a character to it in UNICODE The type of classification in . The specific types are as follows ：

Code Description
[Cc] Other, Control
[Cf] Other, Format
[Cn] Other, Not Assigned (no characters in the file have this property)
[Co] Other, Private Use
[Cs] Other, Surrogate
[LC] Letter, Cased
[Ll] Letter, Lowercase
[Lm] Letter, Modifier
[Lo] Letter, Other
[Lt] Letter, Titlecase
[Lu] Letter, Uppercase
[Mc] Mark, Spacing Combining
[Me] Mark, Enclosing
[Mn] Mark, Nonspacing
[Nd] Number, Decimal Digit
[Nl] Number, Letter
[No] Number, Other
[Pc] Punctuation, Connector
[Pd] Punctuation, Dash
[Pe] Punctuation, Close
[Pf] Punctuation, Final quote (may behave like Ps or Pe depending on usage)
[Pi] Punctuation, Initial quote (may behave like Ps or Pe depending on usage)
[Po] Punctuation, Other
[Ps] Punctuation, Open
[Sc] Symbol, Currency
[Sk] Symbol, Modifier
[Sm] Symbol, Math
[So] Symbol, Other
[Zl] Separator, Line
[Zp] Separator, Paragraph
[Zs] Separator, Space

>>> import unicodedata>>> print(unicodedata.category(' Four '))Lo>>> print(unicodedata.category('8'))Nd>>> print(unicodedata.category('a'))Ll>>>

unicodedata.bidirectional(chr)

Give a character its classification , For left to right , From right to left . If there is no definition , Returns an empty string .

>>> import unicodedata>>> print(unicodedata.bidirectional('9'))EN>>>>>> print(unicodedata.bidirectional(u'\u0660'))AN>>>>>> print(unicodedata.bidirectional(' in '))L>>>>>> print(unicodedata.bidirectional('a'))L>>>>>> print(unicodedata.category(u'\u0660'))Nd>>>

among EN Express English Number,AN Express Arabic Number,L Express Letter,Nd Is said Number Decimal.

unicodedata.combining(chr)

Returns the authoritative combined value of characters , If there is no definition , The default is to return 0. When normalizing operations , You can sort by this value , Large values come after small values .

>>> import unicodedata>>> print(unicodedata.combining('9'))0>>>>>> print(unicodedata.combining('A'))0>>>unicodedata.east_asian_width(chr)

Returns the width of the character display . The details are as follows ：

‘F’(Fullwidth), ‘H’(Halfwidth), ‘W’(Wide), ‘Na’(Narrow), ‘A’(Ambiguous) or ‘N’(Natural).

>>> import unicodedata>>> print(unicodedata.east_asian_width('9'))Na>>>>>> print(unicodedata.east_asian_width('A'))Na>>>>>> print(unicodedata.east_asian_width(' CAI '))W>>>

unicodedata.mirrored(chr)

Determine whether a character supports the image attribute , If it supports returning 1, Otherwise return to 0.

>>> import unicodedata>>> print(unicodedata.mirrored('9'))0>>>>>> print(unicodedata.mirrored('A'))0>>>>>> print(unicodedata.mirrored(' CAI '))0>>>

unicodedata.decomposition(chr)

Divide a decomposable character into two 16 Hexadecimal value return , If it is not decomposable , Returns an empty .

>>> import unicodedata>>> print(unicodedata.decomposition('9'))>>>>>> print(unicodedata.decomposition('-'))>>>>>> print(unicodedata.decomposition(' CAI '))>>>>>> print(unicodedata.decomposition('ガ'))30AB 3099>>>

unicodedata.normalize(form, unistr)

Put a bunch of UNICODE The string is converted to a string in normal format , The specific format supports NFC、NFKC、NFD and NFKD Format . Some text elements can use static pre - assembled forms , Dynamic composition can also be used .Unicode Sequences of different representations of characters are considered equivalent . If two or more sequences are considered equivalent ,Unicode The standard does not specify which particular sequence is correct , And think that each sequence is only equivalent to other sequences .

Such as Results need a single, single representation , You can use a normalized Unicode Text form to reduce unwanted differences .Unicode The standard defines four forms of normalization ： Normalization Form D (NFD),Normalization Form KD (NFKD),Normalization Form C (NFC), and Normalization Form KC (NFKC). About ,NFD and NFKD Decompose possible characters , and NFC and NFKC Combine possible characters .

>>> import unicodedata>>> print(unicodedata.normalize('NFKD', u'aあä').encode('ascii', 'ignore'))b'aa'>>>>>> title = u"Klüft skräms inför på fédéral électoral große">>> print title.encode(‘ascii','ignore')Klft skrms infr p fdral lectoral groe# You can see that many characters are missing >>> import unicodedata >>> unicodedata.normalize('NFKD', title).encode('ascii','ignore') 'Kluft skrams infor pa federal electoral groe'

unicodedata.unidata_version

Returns the current Unicode The version of the database used .

unicodedata.ucd_3_2_0

Provide ucd3.2 Object mode access , In order to be compatible with the old IDNA Applications for .

>>> import unicodedata>>> print(unicodedata.unidata_version)9.0.0>>>>>> print(unicodedata.ucd_3_2_0)<unicodedata.UCD object at 0x00000215E3EA3B70>>>>

Let's take a closer look at a character's UNICODE data ：

U+0062 is the Unicode hex value of the character Latin Small Letter B, which is categorized as “lowercase letter” in the Unicode 6.0 character table.

Unicode Character Information
Unicode Hex U+0062
Character Name LATIN SMALL LETTER B
General Category Lowercase Letter [Code: Ll]
Canonical Combining Class 0
Bidirectional Category L
Mirrored N
Uppercase Version U+0042
Titlecase Version U+0042
Unicode Character Encodings
Latin Small Letter B HTML Entity b (decimal entity), b (hex entity)
Windows Key Code Alt 0098 or Alt +00621
Programming Source Code Encodings Python hex: u”\u0062”, Hex for C++ and Java: “\u0062”
UTF-8 Hexadecimal Encoding 0x62

Most of the above functions are used to query these data information , And return the corresponding value .

That's all python unicodedata Details of module usage , More about python unicodedata For information about the module, please pay attention to other relevant articles on the software development network ！