Turn Chinese characters into Pinyin , It can be used for batch Chinese phonetic notation 、 Word order 、 Common scenarios such as phonetic retrieval of words .
Now there are many Pinyin conversion tools on the Internet , be based on Python There are many open source modules , Today, I will introduce a module with the most functions and features : pypinyin
, It supports the following features :
Before the start , You have to make sure that Python and pip Has been successfully installed on the computer , without , Please install it first .
( Optional 1) If you use Python The goal is data analysis , It can be installed directly Anaconda: It has... Built in Python and pip.
( Optional 2) Besides , Recommended VSCode Editor , It has many advantages .
Please choose one of the following ways to enter the command to install the dependency : 1. Windows Environmental Science open Cmd ( Start - function -CMD). 2. MacOS Environmental Science open Terminal (command+ Space input Terminal). 3. If you're using a VSCode Editor or Pycharm, You can directly use the Terminal.
pip install pypinyin
The most common Pinyin conversion methods are as follows :
# Python Practical treasure from pypinyin import pinyin, lazy_pinyin, Style pinyin(' center ') # [['zhōng'], ['xīn']]
Recognize polyphonic characters :
# Python Practical treasure from pypinyin import pinyin, lazy_pinyin, Style pinyin(' center ', heteronym=True) # Enable polyphonic mode # [['zhōng', 'zhòng'], ['xīn']]
Set the output style , Only the first letter is recognized :
# Python Practical treasure from pypinyin import pinyin, lazy_pinyin, Style pinyin(' center ', style=Style.FIRST_LETTER) # Set Pinyin style # [['z'], ['x']]
Modify the tone output position , The tone is displayed after the corresponding letter , Or the last display tone of Pinyin :
# Python Practical treasure from pypinyin import pinyin, lazy_pinyin, Style # TONE2 The tone is displayed after the corresponding letter pinyin(' center ', style=Style.TONE2, heteronym=True) # [['zho1ng', 'zho4ng'], ['xi1n']] # TONE3 The last display tone of Pinyin pinyin(' center ', style=Style.TONE3, heteronym=True) # [['zhong1', 'zhong4'], ['xin1']]
Regardless of polyphony :
# Python Practical treasure from pypinyin import pinyin, lazy_pinyin, Style lazy_pinyin(' center ') # Regardless of polyphony # ['zhong', 'xin']
Don't use v Instead of ü:
# Python Practical treasure from pypinyin import pinyin, lazy_pinyin, Style lazy_pinyin(' strategic ', v_to_u=True) # Don't use v Express ü # ['zhan', 'lüe']
Mark softly :
# Python Practical treasure from pypinyin import pinyin, lazy_pinyin, Style # Use 5 Sign whisper lazy_pinyin(' clothes ', style=Style.TONE3, neutral_tone_with_five=True) # ['yi1', 'shang5']
Use the command line one key to recognize Pinyin :
# Python Practical treasure python -m pypinyin music # yīn yuè
Customize the phonetic display style
We can go through register()
To realize the requirement of customized Pinyin style :
from pypinyin import lazy_pinyin from pypinyin.style import register @register('kiss') def kiss(pinyin, **kwargs): return ' {0}'.format(pinyin) lazy_pinyin(' kiss ', style='kiss') # [' me', ' me']
You can see , By defining a kiss function , Use register Decorator , We have created a new style, This style Can be directly used for pinyin conversion parameters , Very convenient .
in addition , All modules come with style Its effects are as follows :
@unique class Style(IntEnum): """ Pinyin style """ #: Common style , Without tone . Such as : China -> ``zhong guo`` NORMAL = 0 #: Standard tone style , The tone of Pinyin is on the first letter of vowel ( Default style ). Such as : China -> ``zhōng guó`` TONE = 1 #: Tone style 2, That is, the tone of Pinyin comes after each vowel , Use numbers [1-4] To said . Such as : China -> ``zho1ng guo2`` TONE2 = 2 #: Tone style 3, That is, the tone of Pinyin comes after each Pinyin , Use numbers [1-4] To said . Such as : China -> ``zhong1 guo2`` TONE3 = 8 #: Initials style , Return only the initial consonant of each Pinyin ( notes : Some pinyin have no initials , See `#27`_). Such as : China -> ``zh g`` INITIALS = 3 #: The style of the initials , Return only the initials of Pinyin . Such as : China -> ``z g`` FIRST_LETTER = 4 #: Vowel style , Only the final part of each pinyin is returned , Without tone . Such as : China -> ``ong uo`` FINALS = 5 #: Standard vowel style , With a tone , On the first vowel . Such as : China -> ``ōng uó`` FINALS_TONE = 6 #: Vowel style 2, With a tone , The tone comes after each vowel , Use numbers [1-4] To said . Such as : China -> ``o1ng uo2`` FINALS_TONE2 = 7 #: Vowel style 3, With a tone , The tone comes after the Pinyin , Use numbers [1-4] To said . Such as : China -> ``ong1 uo2`` FINALS_TONE3 = 9 #: Phonetic style , With a tone , the high and level tone ( The first sound ) Not marked . Such as : China -> ``ㄓㄨㄥ ㄍㄨㄛˊ`` BOPOMOFO = 10 #: Phonetic style , Just initials . Such as : China -> ``ㄓ ㄍ`` BOPOMOFO_FIRST = 11 #: The contrast style between Chinese pinyin and Russian alphabet , The tone comes after the Pinyin , Use numbers [1-4] To said . Such as : China -> ``чжун1 го2`` CYRILLIC = 12 #: The contrast style between Chinese pinyin and Russian alphabet , Just initials . Such as : China -> ``ч г`` CYRILLIC_FIRST = 13
Handle special characters
By default , Special characters in text will not be processed , Return as is :
pinyin(' Hello **') # [['nǐ'], ['hǎo'], ['**']]
However, if you want to handle these special characters, it is also possible , such as :
ignore
: Ignore this character
pinyin(' Hello **', errors='ignore') # [['nǐ'], ['hǎo']]
errors
: Replace with remove \u
Of unicode code :
pinyin(' Hello **', errors='replace') # [['nǐ'], ['hǎo'], ['26062606']]
callable object
: Provide a callback function , Accept characters without Pinyin ( strand ) As a parameter , Supported return value types : unicode
or list
or None
:
pinyin(' Hello **', errors=lambda x: 'star') # [['nǐ'], ['hǎo'], ['star']] pinyin(' Hello **', errors=lambda x: None) # [['nǐ'], ['hǎo']]
The return value type is list
when , automatically expend list:
pinyin(' Hello **', errors=lambda x: ['star' for _ in x]) # [['nǐ'], ['hǎo'], ['star'], ['star']] # Specify a polyphone pinyin(' Hello **', heteronym=True, errors=lambda x: [['star', '*'] for _ in x]) # [['nǐ'], ['hǎo'], ['star', '*'], ['star', '*']]
Custom Pinyin Library
If you feel that the output effect of the module is not satisfactory , Or you want to do something special , Can pass load_single_dict()
or load_phrases_dict()
Modify the result by customizing the Pinyin library :
from pypinyin import lazy_pinyin, load_phrases_dict, Style, load_single_dict hans = ' orange ' lazy_pinyin(hans, style=Style.TONE2) # ['jie2', 'zi3'] load_phrases_dict({' orange ': [['jú'], ['zǐ']]}) # increase " orange " phrase lazy_pinyin(hans, style=Style.TONE2) # ['ju2', 'zi3'] hans = ' Not yet ' lazy_pinyin(hans, style=Style.TONE2) # ['hua2n', 'me2i'] load_single_dict({ord(' also '): 'hái,huán'}) # adjustment " also " The phonetic order of the characters lazy_pinyin(' Not yet ', style=Style.TONE2) # ['ha2i', 'me2i']