Do you encounter any of the following types while learning about reptiles . If you are interested in learning or understanding relevant knowledge , I don't think I'm too talented to learn , You can refer to it . Welcome your comments .
You may see the following garbled behavior in the request parameters :
Then you will find out content-type The data type is x-protobuf type , Then maybe you need to learn protobuf Agreement to continue your crawler .
I don't know if that's right , For reference only , Can provide an idea . Let's start with a normal data content-type The data type is
Under the circumstances . Web page according to utf-8 Encoding decodes data . But if content-type The data type is x-protobuf when , He can't rely on protobuf Protocol to parse , So there will be garbled code .
Let's move on to our main topic .
First of all, this article will introduce you protobuf The definition and resolution of the protocol , So that you can have a deeper understanding of protobuf agreement , In the next section, we will introduce how to encounter protobuf
The practical operation of how the protocol is resolved .
protobuf (protocol buffer) Is Google's internal mixed language data standard . By serializing structured data ( Serialization ), Used for communication protocol 、 Language independence in areas such as data storage 、 Platform independent 、 Extensible serialization structure data format .
As shown in the figure, the programmer has written proto Procedure of documents , Then it is programmed into a package suitable for the programming language . This process can be done by downloading the link below .( This process will be described later )
https://github.com/protocolbuffers/protobuf/releases/
take proto File into the package you need . What you need to do is write proto The content of the document .
Then through the compiled package, the data and binary can be converted, which is called serialization and deserialization .
Some people may be curious , We just want to convert a garbled code into data we can understand. Why should we learn to write this file . Then you can look at the picture above first , If the parameters carried in your request data are garbled , If you want to create such garbled data, you need to learn how to pass proto Compiled package ( Because the description of this article is python Language then this is py file ), To convert data into binary files . To request data normally .
First, write a simple proto file
syntax = "proto3";
message Panda {
int32 id = 1;
string name = 2;
}
Save the file name as xxx.proto file . So far, we have finished writing proto The process of , Next we need to finish the proto The file needs to be compiled into the package used by our program , download https://github.com/protocolbuffers/protobuf/releases/
Files in the web site , download windows edition
Set the environment variable
function cmd, find proto File office
Translate it into python file
So far we have finished proto The compilation of the file .
Next we start to serialize the data into binary
You need to install protobuf==3.20 google And the deserialization library blackboxprotobuf
So far you can complete proto Compilation of documents , Here's a slightly more complicated code , You can see the code according to the data type table above . The following code can be used to parse some proto The relationship between nested data in
syntax = "proto3";
message Message
{
int32 id_2 = 2;
int32 id_3 = 3;
Info id_5 = 5;
message Info
{
string id_1 = 1;
repeated int32 id_2 = 2 [packed=false];
int32 id_3 = 3;
Number id_5 = 5;
message Number
{
int32 id_2 = 2;
}
repeated int32 id_8 = 8 [packed=false];
int32 id_6 = 6;
int32 id_7 = 7;
int32 id_9 = 9;
int32 id_11 = 11;
}
string id_6 = 6;
}
Reference blog https://www.cxymm.net/article/mine_song/76691817
First we need to understand Varints code , And then through Varints Coding understanding protobuf The decoding process of
Varint It's a compact way to represent numbers . It uses one or more bytes to represent a number , Smaller numbers use fewer bytes . This reduces the number of bytes used to represent numbers .
Varint Every byte in ( Except for the last byte ) The most significant bit is set (msb), This bit indicates that more bytes will appear . Low per byte 7 Bits are used to 7 A binary complement representation of a number in the form of a bit set , First place in the least effective group .
If it doesn't work 1 Bytes , Then the most significant bit is set to 0 , As the following example ,1 One byte can represent , therefore msb by 0.
0000 0001
If you need more than one byte to represent ,msb It should be set to 1 . for example 300, If you use Varint What they say :
1010 1100 0000 0010
If it is calculated according to the normal binary system , This means 88068 (65536 + 16384 + 4096 + 2048 + 4).
But if you follow Varint The way of coding , First look at the first byte :10101100, The highest position is 1, The rest is 0101100,msb by 1, Indicates that there are still bytes left to be read , The second byte 00000010, The highest position is 0, The rest is 0000010,msb by 0, Indicates that there are no bytes left . Put two 7 Put binary numbers together , Is the target value
0000010 0101100 => 4 + 8 + 32 + 256 = 300
Here is the small end mode , Little Endian , Read it first , High position behind , Read it later . therefore 0000010 To be calculated later .
First, let's write a simple protobuf code , As shown in the figure
Then assign a value
So get the binary bit
08 01 12 04 74 65 73 74
Explain our decoding process , In the process of analyzing and decoding , We need to understand Wire Type, Each message item is preceded by a corresponding tag, To resolve the corresponding data type , Express tag The data type of is also Varint.
tag Calculation method of : (field_number << 3) | wire_type
Each data type has a corresponding wire_type:
therefore wire_type At most, it can only support 8 Kind of , There are 6 Kind of .
therefore 08 The corresponding binary is :
After the filling is
0 0001 000
Why do you write like this ?
First, the last three are wire_type The type of 0 , It means int32 Consistent with our definition above
id stay protobuf Is not shown in , Only the following identifiers are displayed 1
As shown in the following figure :
And then because it's int32 Type, so we directly take the value as 01, This is the value we assigned , namely
This starts parsing the string
Division 0 0010 010
That is, the identifier is 2 type Wire Type by Length-delimited string.
The following value is the length of the string 04 Then the next four data are string data
namely 74 65 73 74 ASCII Convert to test This completes the coding process .
You can try to learn the more difficult decoding process , Next section , We describe how to use... In reptiles .