Regular expressions match Chinese characters, which are very common in practical applications.
For example: crawler web page text extraction, validation of user input standards, etc.
Take the following text string as an example, match all Chinese characters in the string asstr.
import reastr = '''aaaaaWhen when, see Nanxue snow, I am with plum blossomTwo white heads'''
Two methods are described below (the environment of this article is python3)
1. Use Unicode encoding to match Chinese
Common Chinese Unicode encoding range: \u4e00-\u9fa5
Implement matching code: re.findall('[\u4e00-\u9fa5]', astr)
import reastr = '''aaaaaWhen when, see Nanxue snow, I am with plum blossomTwo white heads'''res = re.findall('[\u4e00-\u9fa5]', astr)print(res)
Matches:
Second,Directly use Chinese characters to achieve Chinese matching
If you haven’t used it, you may not know it. Chinese matching can also be done in this way
The matching code is: re.findall('[一-羥]', asstr)
import reastr = '''aaaaaWhen when, see Nanxue snow, I am with plum blossomTwo white heads'''res = re.findall('[一-羥]', astr)print(res)
Match result:
Note: Actually here"The Unicode code corresponding to "1" is "\u4e00", and the Unicode code corresponding to "徥" (yù) is "\u9fa5".
Common non-English characters Unicode encoding range:
u4e00-u9fa5 (Chinese)
u0800-u4e00 (Japanese)
uac00-ud7ff (Korean)