This article mainly introduces in Python
in bytes
And str
The difference between .1
Updated: 2022 / 6 / 16
1. Python There are two types that can represent character sequences
bytes
ASCII code
Standard to show )str
Unicode
Code points (code point
, Also called code points ), These code points correspond to text characters in human language a = b'h\x6511o' print(list(a)) # [104, 101, 49, 49, 111] print(a) # b'he11o' a = 'a\\u300 propos' print(list(a)) # ['a', '\\', 'u', '3', '0', '0', ' ', 'p', 'r', 'o', 'p', 'o', 's'] print(a) # a\u300 propos
2.Unicode
Data and binary data conversion
Unicode
Data into binary data , Must call str
Of encode
Method ( code )FileContent = 'This is file content.'
print(FileContent)
# 'This is file content.'
print(type(FileContent))
# <class 'str'>
FileContent = FileContent.encode(encoding='utf-8')
print(FileContent)
# b'This is file content.'
print(type(FileContent))
# <class 'bytes'>
Unicode
data , Must call bytes
Of decode
Method ( decode )FileContent = b'This is file content.'
print(FileContent)
# b'This is file content.'
print(type(FileContent))
# <class 'bytes'>
FileContent = FileContent.decode(encoding='utf-8')
print(FileContent)
# 'This is file content.'
print(type(FileContent))
# <class 'str'>
When you call these methods , You can specify the character set encoding , You can also use the system default solution , Usually
UTF-8
The default character set encoding of the current operating system ,
Python
Check the default coding standard of the current operating system with a line of code : staycmd
In the implementation of :python3 -c 'import locale; print(locale.getpreferredencoding())' # UTF-8
3. Use the original 8 Bit value and Unicode
character string
Use the original 8 Bit value and Unicode
Two problems to pay attention to when string ( This problem is equivalent to using bytes
and str
Two problems that need to be paid attention to ):
bytes
and str
Are incompatible with each other Use
+
The operator# bytes+bytes print(b'a' + b'1') # b'a1' # str+str print('b' + '2') # b2 # bytes+str print('c' + b'2') # TypeError: can only concatenate str (not "bytes") to str
Binary operators can also be used to compare sizes between the same types
# bytes bytes assert b'c' > b'a' assert b'c' < b'a' # AssertionError print(b'a' == b'a') # True # str str assert 'c' > 'a' assert 'c' < 'a' # AssertionError print('a' == 'a') # True # bytes str assert b'c' > 'a' # TypeError: '>' not supported between instances of 'bytes' and 'str' print('a' == b'a') # False
In the format string
%s
Both types of instances can appear in
%
The right side of the operator , Used to replace the format string on the left (format string
) Inside%s
. But if the format string isbytes
type , Then it doesn't workstr
Instance to replace%s
, becausePython
I don't know thisstr
What character set should be encoded .# bytes % str print(b'red %s' % 'blue' # TypeError: %b requires a bytes-like object, or an object that implements __bytes__, not 'str' # str % bytes print('red %s' % b'blue') # red b'blue' # @ This will make the system bytes Instance above __repr__ Method . The call result replaces the... In the format string %s, So the program will directly output b'blue', Not the output blue
Unicode
String manipulation , You cannot use the original bytes
w
Mode must be in ‘ Text ’ Mode writing , Otherwise, an error will be reported when writing binary data to the file :# Write binary data with open('test.txt', "w+") as f: f.write(b"\xf1\xf2") # TypeError: write() argument must be str, not bytes
wb
Binary data can be written normally# Write binary data with open('test.txt', "wb") as f: f.write(b"\xf1\xf2")
r
Mode must be in ‘ Text ’ Mode writing , Otherwise, an error will be reported when reading binary data from a file :# Read binary data with open('test.txt', "r+") as f: f.read() # UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 0: invalid continuation byte # @ When manipulating a file handle in text mode , The system will use Default text encoding The scheme deals with binary data . therefore , The above way of writing will let the system pass `bytes.decode` Decode this data into `str` character string , Reuse `str.encode` Encode strings into binary values . But for most systems , The default text encoding scheme is `UTF-8`, So the system is likely to put `b'\xf1\xf2\xf3\xf4\xf5'` As a `UTF-8` Format string to decode , So there's a mistake like that .
rb
Binary data can be read normally# Write binary data with open('test.txt', "rb") as f: print(b"\xf1\xf2" == f.read()) # True
Another modification , Set up
encoding
Parameter specifies the string encoding :with open('test.txt', "r", encoding="cp1252") as f: print(f.read())
Additional explanation :
Python bytes And str The difference between ︎