Do you encounter any of the following types while learning about reptiles . If you are interested in learning or understanding relevant knowledge , I don't think I'm too talented to learn , You can refer to it . Welcome your comments .
You may see the following garbled behavior in the request parameters :
Then you will find out content-type The data type is x-protobuf type , Then maybe you need to learn protobuf Agreement to continue your crawler .
I don't know if that's right , For reference only , Can provide an idea . Let's start with a normal data content-type The data type is
Under the circumstances . Web page according to utf-8 Encoding decodes data . But if content-type The data type is x-protobuf when , He can't rely on protobuf Protocol to parse , So there will be garbled code .
Let's move on to our main topic .
First of all, this article will introduce you protobuf The definition and resolution of the protocol , So that you can have a deeper understanding of protobuf agreement , In the next section, we will introduce how to encounter protobuf
The practical operation of how the protocol is resolved .
protobuf (protocol buffer) Is Google's internal mixed language data standard . By serializing structured data ( Serialization ), Used for communication protocol 、 Language independence in areas such as data storage 、 Platform independent 、 Extensible serialization structure data format .
As shown in the figure, the programmer has written proto Procedure of documents , Then it is programmed into a package suitable for the programming language . This process can be done by downloading the link below .( This process will be described later )
https://github.com/protocolbuffers/protobuf/releases/
take proto File into the package you need . What you need to do is write proto The content of the document .
Then through the compiled package, the data and binary can be converted, which is called serialization and deserialization .
Some people may be curious , We just want to convert a garbled code into data we can understand. Why should we learn to write this file . Then you can look at the picture above first , If the parameters carried in your request data are garbled , If you want to create such garbled data, you need to learn how to pass proto Compiled package ( Because the description of this article is python Language then this is py file ), To convert data into binary files . To request data normally .
First, write a simple proto file
syntax = "proto3";
message Panda {
int32 id = 1;
string name = 2;
}
The first line determines proto Protocol used , Now most of them use proto3 instead of proto2
Then define a message body to store some data fields you need ,
Each of these data fields has a type , A name and a value make up . This value is not in your data , Instead, look for the ID of the type that defines the data field . I have not tried to use non numeric situations . The specific writing format can be according to the following figure proto Write in the format of . You can define the data type you need according to your own situation .
Save the file name as xxx.proto file . So far, we have finished writing proto The process of , Next we need to finish the proto The file needs to be compiled into the package used by our program , download https://github.com/protocolbuffers/protobuf/releases/
Files in the web site , download windows edition
Set the environment variable
function cmd, find proto File office
Translate it into python file
So far we have finished proto The compilation of the file .
Next we start to serialize the data into binary
You need to install protobuf==3.20 google And the deserialization library blackboxprotobuf
So far you can complete proto Compilation of documents , Here's a slightly more complicated code , You can see the code according to the data type table above . The following code can be used to parse some proto The relationship between nested data in
syntax = "proto3";
message Message
{
int32 id_2 = 2;
int32 id_3 = 3;
Info id_5 = 5;
message Info
{
string id_1 = 1;
repeated int32 id_2 = 2 [packed=false];
int32 id_3 = 3;
Number id_5 = 5;
message Number
{
int32 id_2 = 2;
}
repeated int32 id_8 = 8 [packed=false];
int32 id_6 = 6;
int32 id_7 = 7;
int32 id_9 = 9;
int32 id_11 = 11;
}
string id_6 = 6;
}
Call result display
![image](https://img2022.cnblogs.com/blog/2636039/202206/2636039-20220618184512495-166331078.png)
Okay , The above description explains to you how to write proto The process of , The following will show you how to parse binary proto The file of . Take a break first .
Reference blog https://www.cxymm.net/article/mine_song/76691817
First we need to understand Varints code , And then through Varints Coding understanding protobuf The decoding process of
Varint It's a compact way to represent numbers . It uses one or more bytes to represent a number , Smaller numbers use fewer bytes . This reduces the number of bytes used to represent numbers .
Varint Every byte in ( Except for the last byte ) The most significant bit is set (msb), This bit indicates that more bytes will appear . Low per byte 7 Bits are used to 7 A binary complement representation of a number in the form of a bit set , First place in the least effective group .
If it doesn't work 1 Bytes , Then the most significant bit is set to 0 , As the following example ,1 One byte can represent , therefore msb by 0.
0000 0001
If you need more than one byte to represent ,msb It should be set to 1 . for example 300, If you use Varint What they say :
1010 1100 0000 0010
If it is calculated according to the normal binary system , This means 88068 (65536 + 16384 + 4096 + 2048 + 4).
But if you follow Varint The way of coding , First look at the first byte :10101100, The highest position is 1, The rest is 0101100,msb by 1, Indicates that there are still bytes left to be read , The second byte 00000010, The highest position is 0, The rest is 0000010,msb by 0, Indicates that there are no bytes left . Put two 7 Put binary numbers together , Is the target value
0000010 0101100 => 4 + 8 + 32 + 256 = 300
Here is the small end mode , Little Endian , Read it first , High position behind , Read it later . therefore 0000010 To be calculated later .
First, let's write a simple protobuf code , As shown in the figure
Then assign a value
So get the binary bit
08 01 12 04 74 65 73 74
Explain our decoding process , In the process of analyzing and decoding , We need to understand Wire Type, Each message item is preceded by a corresponding tag, To resolve the corresponding data type , Express tag The data type of is also Varint.
tag Calculation method of : (field_number << 3) | wire_type
Each data type has a corresponding wire_type:
therefore wire_type At most, it can only support 8 Kind of , There are 6 Kind of .
therefore 08 The corresponding binary is :
After the filling is
0 0001 000
Why do you write like this ?
First, the last three are wire_type The type of 0 , It means int32 Consistent with our definition above
id stay protobuf Is not shown in , Only the following identifiers are displayed 1
As shown in the following figure :
And then because it's int32 Type, so we directly take the value as 01, This is the value we assigned , namely
This starts parsing the string
Division 0 0010 010
That is, the identifier is 2 type Wire Type by Length-delimited string.
The following value is the length of the string 04 Then the next four data are string data
namely 74 65 73 74 ASCII Convert to test This completes the coding process .
You can try to learn the more difficult decoding process , Next section , We describe how to use... In reptiles .
Spider-01- Introduction to reptiles Python Reptiles don't have a lot of knowledge , But you need to deal with the web all the time , Every page is different , So there are some requirements for strain capacity Reptile preparation Reference material Master Python The crawler frame Scrap ...
A practical example of a reptile : Search engine ( Baidu . Google .360 Search, etc ). Bole Online . Huihui shopping assistant . Data analysis and research ( Data iceberg column ). Ticket grabbing software, etc . What is a web crawler : Easy to understand : A crawler is a process that simulates human behavior in requesting websites ...
One .requests be based on cookie operation introduction : Sometimes , We are using crawler program to crawl some data about users ( Take Zhang San “ Renren.com ” Personal home page data ) when , If before use requests During the normal operation of the module , Often can not reach us ...
One .urllib library urllib yes Python It comes with a library for crawlers , Its main function is to send requests through the code simulation browser . Its commonly used sub modules are in Python3 For in the urllib.request and urllib. ...
HTTP agreement : HTTP(Hypertext Transfer Protocol): Hypertext transfer protocol .URL It's through HTTP Protocol access to resources Internet route , One URL Corresponding to a data resource . HTTP agreement ...
One . What is? Scrapy? Scrapy It's a way of crawling website data , Application framework for extracting structural data , Very famous , Very strong . A framework is one that has been integrated with various functions ( High performance asynchronous download , queue , Distributed , analysis , Persistence, etc ) ...
Understand in a simple way HTTP agreement HTTP(HyperText Transfer Protocol, Hypertext transfer protocol ) It is the most widely used network protocol on the Internet . The most commonly used version is HTTP 1.1. HTTP ...
HTTP agreement : HTTP(Hypertext Transfer Protocol): Hypertext transfer protocol .URL It's through HTTP Protocol access to resources Internet route , One URL Corresponding to a data resource . HTTP agreement ...
One :Requests: Give Way HTTP Serving humanity although Python In the standard library of urllib2 The module already contains most of the functions we normally use , But it's API It doesn't feel good to use it , and Requests Claim to be ...
Introduce # Introduce : Use requests It can simulate the request of browser , Compared to what was used before urllib,requests Modular api More convenient ( The essence is encapsulation urllib3) # Be careful :requests The library sends a request to drop the content of the web page ...
http://blog.csdn.net/akof1314/article/details/6968587 Delphi complete Bug Resolution tools EurekaLog Use label : delphi Tools ftp ...
MySQL There are no more than the following reasons for garbled Chinese : server Set the problem by itself , For example, still staying in latin1 table The problem of language family setting in English ( contain character And collation) Client program ( for example php) Our company ...
React-Native Getting Started tutorial series catalog One . preparation ( Completed ) Two . Project introduction and debugging 3、 ... and .CSS Style vs Flex Layout Four . Commonly used UI Use of control 5、 ... and .JSX stay React-Native Application in 6、 ... and . things ...
CREATE OR REPLACE PROCEDURE SCOTT.HTML_EMAIL( P_TO IN VARCHAR2, -- Addressee address P_SUBJECT IN VARCHAR2, -- Mail master ...
subject Solution code and comments #include <stdio.h> #include <stdlib.h> #include <math.h> #i ...
It's no use using a good formula ... It's easy to think about it , First of all, we should analyze , Because one at a time LCM Is greater than or equal to any of these numbers , Then I LCM Add to which number , That number is going to be big , Think so , We knew , Every (x,y) Corresponding to one kind of situation . ...
First of all, we noticed ai<=50w because 2*3*5*7*11*13*17=510510 So it contains at most 6 A quality factor We separate the contribution of each number , To add is to add the coprime pairs related to this number Deleting is equal to subtracting this ...
One . essential information title :UML Analysis and application of system modeling Time :2016 Publishing sources : Wireless Internet technology Domain classification : Unified modeling language Two . Research background Problem definition : Analysis and application of unified modeling language difficulty : Master and understand the business environment of relevant systems , Good mastery ...
sublime text3 Address :https://packagecontrol.io/installation#st3 1. open Preferences——Browse Packages, Open a folder C ...
java.util.concurrent.atomic The package provides 13 Classes . Belong to 4 There are kinds of ways to update atoms , These are the basic types of atomic renewal , Atomic update array , Atom update references and atom update properties .Atomic The classes in the package basically make ...