Hello everyone, meet again, I'm your friend Quanstack Jun.
###The problem of garbled encoding of strings has a long history, which is really a headache.This is not when doing regular matching Chinese, encoding has once again become a stumbling block, and I will record two points here.First, string encoding.Second, the regular matches Chinese.
Early encodings were all encoded in ASCII, using one byte to handle the encoding.For example, the uppercase A code is 65, but when dealing with Chinese, one byte is obviously not enough, at least two bytes, and it cannot conflict with ASCII. China has formulated GB2312 encoding and compiled Chinese into it.Similarly, South Korea and Japan have developed grid standards, and as a result, garbled characters will appear in multilingual mixed texts.Hence, Unicode came into being.Unicode unifies all languages into one encoding, so there will be no more garbled problems.Hence, Unicode came into being.Unicode unifies all languages into one encoding, so there will be no more garbled problems.A new problem has appeared again: if it is unified into Unicode encoding, the problem of garbled characters has disappeared since then.However, if the text you write is basically all in English, using Unicode encoding requires twice as much storage space as ASCII encoding, which is very uneconomical in terms of storage and transmission.
Therefore, in the spirit of saving, UTF-8 encoding, which converts Unicode encoding into "variable-length encoding", appeared again.UTF-8 encoding encodes a Unicode character into 1-6 bytes according to different number sizes, commonly used English letters are encoded into 1 byte, Chinese characters are usually 3 bytes, and only very rare characters will be encoded.Encoded into 4-6 bytes.If the text you want to transmit contains a lot of English characters, encoding in UTF-8 can save space. Python adds support for Unicode, and strings represented in Unicode are represented by u'ABC'.Although the string 'xxx' is ASCII encoding, it can also be regarded as UTF-8 encoding, while u'xxx' can only be Unicode encoding.Convert u'xxx' to UTF-8 encoded 'xxx' with the encode('utf-8') method.
>> u'ABC'.encode('utf-8')'ABC'>>> u'Chinese'.encode('utf-8')'\xe4\xb8\xad\xe6\x96\x87
Conversely, convert the string 'xxx' represented by UTF-8 encoding to the Unicode string u'xxx' using the decode('utf-8') method.
>>> 'abc'.decode('utf-8')u'abc'>>> '\xe4\xb8\xad\xe6\x96\x87'.decode('utf-8')u'\u4e2d\u6587'>>> print '\xe4\xb8\xad\xe6\x96\x87'.decode('utf-8')Chinese
Since the Python source code is also a text file, when your source code contains Chinese, you need to specify the UTF-8 encoding when saving the source code.When the Python interpreter reads the source code, in order to make it read in UTF-8 encoding, we usually write these two lines at the beginning of the file:
#!/usr/bin/env python# -*- coding: utf-8 -*-
The first line of comment is to tell the Linux/OS X system that this is a Python executable program, and the Windows system will ignore this comment;
The second line of comment is to tell the Python interpreter to read the source code according to UTF-8 encoding, otherwise, the Chinese output you wrote in the source code may have garbled characters.
About the Python regular expression matching Chinese, in fact, as long as you agree to the encoding, I use py2.7 on my computer, so add u before the string, and add u before the regular expression.
str=u"【Psychological proverb】Reality is a filthy river,To accept the filthy river without being polluted, we must become the sea. =-=4845/.?'"# pattern =re.compile(u'[\u4e00-\u9fa5]')pattern =re.compile(u"[\u4e00-\u9fa5]+")result=re.findall(pattern, str)# print result.group()for w in result:print w
For more detailed regular matching content, you can see this blog post
內容參考:廖雪峰大神的Blog Posts
Additional:
I stumbled across a blog post today, the analysis of python2.7 coding errors and principles is in place.PYTHON-Advanced-Encoding Processing Summary
Publisher: Full-stack programmer, please indicate the source: https://javaforall.cn/128943.htmlOriginal link: https://javaforall.cn