程式師世界 >> 編程語言 >> 更多編程語言 >> Python >> python中Unicode編碼初探

python中Unicode編碼初探

編輯：Python

上一篇文章主要討論了字符編碼的不同方式，這一篇文章著重談談對python的編碼解碼的理解。

python2

在python2中主要有兩種類型的字符類型，一個是str,一個是Unicode。平時我們默認操作的字符串類型是str,在字符串前面加一個u就是Unicode類型。

這兩個類型有相應的工廠方法：str()和unicode()

上圖的例子中可以看出，unicode方法將傳入的string,利用傳入的encoding將string轉換成unicode對象。注意最後的錯誤是因為python默認的編碼方式是ASCII編碼格式。　　

>>>import sys
>>>sys.getdefaultencoding()
'ascii'

string類型和unicode類型分別擁有str.decode()和unicode.encode()方法。我們實驗一下：

>>>a_unicode=u 'Hi \u2119'　　　　　　　　　　　　　　　＃長度為４
>>>to_string=a_unicode.encode('utf-8')
>>>to_string
'Hi \xe2\x84\x99'　　　　　　　　　　　　　　　　　　　　＃這裡長度為６
>>>type(to_string)
'str'
>>>to_new_unicode=to_string.decode('utf-8')
>>>to_noe_unicode==a_unicode
True                                                                                            #這裡python2中兩者是相等的，但是python3中不相等。
>>>asc_string=a_unicode.encode()
UnicodeEncodeError:'ascii'　not in range(128)                         #上面講過python默認編碼是ascii，而ascii只能表示128的字符，\u2119超出了ascii的可編碼范圍,所以錯誤
>>>asc_unicode=to_string.decode()
UnicodeEncodeError:'ascii' not in range(128)                           #這裡利用ascii解碼utf-8編碼過的字符，出現同樣的錯誤。
>>>to_string.encode('utf-8')
UnicodeEncodeError:'ascii' not in range(128)

上述代碼，最後一個錯誤尤其要注意，嘗試將一個字符串類型編碼成utf-8編碼格式，在python內部要經過兩個步驟:將字符串轉換成unicode,接著將轉換後的unicodeencode成'utf-8'格式即：to_string.decode(sys.getdefaultencoding).encode('utf-8')。因為默認為ascii編碼，不能轉換\xe2，所以出現錯誤。
從上述錯誤我們要注意兩點：

在轉換的時候，明確轉換的字符串的類型，確認是str還是unicode
明白python內部包含的隱式轉換方法。

參考資料:
http://nedbatchelder.com/text/unipain.html
https://docs.python.org/2/howto/unicode.html
http://blog.csdn.net/trochiluses/article/details/16825269