您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python crawler advanced, breaking through the anti script mechanism (anti crawl mechanism)

編輯：Python

Preface

I believe that you will encounter anti - crawling mechanism more or less when doing crawlers or automated scripts （ Or anti script mechanism ）, The most common anti script mechanism is to verify at login , According to my actual combat （ Help fans write scripts ） Find out , Basically, as long as there is a little level of website will have anti script mechanism , If it is a large website, its anti script mechanism will be more powerful and complex . For example, Taobao. 、12306 these , If the strategy is not strong enough . Then, in the second kill or ticket grabbing , Normal users will have no experience . This article will explain how to break through the general anti climbing mechanism . By reading this article , On the Internet 80% The web site allows your script to operate at will

Anti script mechanism

Visible anti scripting mechanism

1. Low difficulty

Graphic verification code 、 Graphic verification code with interference line 、 Calculate the result of graphic verification code 、 Request parameter simple encryption （ The result of each encryption is the same ） etc. .

2. Medium difficulty

Request parameters AES encryption （ Each encryption result is different ）、boss Click verification for 、 Sliding verification of pull hook 、 Click the text verification in the picture in order .

3. High difficulty

Baidu's image rotation to the correct angle verification 、 Google's click has a picture verification of something .

Invisible anti script mechanism

If you think that major websites only use the above anti script mechanism, you are naive . Those above are just to dissuade ordinary reptile engineers , These are the ones who really want to stop them .
1. For example, some seckill commodities , The parameters of the product will not be generated until the second kill time , If you can't get the parameters of the product in advance, you can't swipe the data package and place an order , When the time comes, wait for you to get this parameter , The normal user has finished the second kill , This can achieve the purpose of anti script .
2. It's not like human behavior when your access speed is too fast or when a script crawls data , At this time, some verification methods pop up or return you error or useless data , Even give circular links , Let the script enter an infinite loop , So as to achieve the purpose of anti script .
This is just a list of some common anti scripting strategies , In fact, large websites are much better than these

Break through the anti script mechanism

1. Low difficulty

1. For simple graphic verification code ,python Third party library pytesseract Can be identified .
2. The network with interference lines also has corresponding algorithms for removing interference lines , But the algorithms that make the interference line a little more disgusting are not easy to use .
3. There is no way to identify the interference line 、 Need to calculate the graphic verification code . You can save the picture , Manual input is verified .
4. Simple encryption of request parameters . Just grab the encrypted parameters directly .
Most websites have no anti script mechanism , Even if there are , Most of them are simple . Can break through this kind of low difficulty ,60% You can use a script program to achieve a certain purpose .

2. Medium difficulty

Today's point ： Request parameters AES encryption （ Each encryption result is different ）
This anti scripting mechanism generally has the following characteristics ：
1. Enter the same account and password in the input box , But the parameters of each request are different .
2. In addition to the parameters we entered , There will also be some constantly changing parameters .
The following are the request parameters for a website login , The account and password are 20001111.
First request
Second request
Of course , Some simple ones are not used AES encryption , It may be a strange algorithm such as randomly generated string encryption . But this does not affect , The solution is the same .

Solutions

Solving this anti script strategy requires a certain front-end foundation , This process is also known as front-end reverse .

1. Find the front-end encryption algorithm

1. Narrow the scope of
Through the browser's debugger , Check the front-end code , Narrow the scope by searching keywords . Keywords are generally encryption 、 Code and other words in English or Pinyin , Usually in English , I haven't met Pinyin yet . Such as encrypt、encode etc. .
2. Find algorithms by debugging or understanding code logic
Generally, there are many keywords , Some places are not encryption algorithms at first sight , Then indefinitely set a breakpoint in the browser , Run it over , Continue to narrow down . Because some code is for other functions , They don't execute . At this time, we can basically find . If the code executes , There are many places where you are not sure , Then you can only read his source code carefully to judge whether it is encrypted .
This is a partial screenshot of the encryption algorithm

2. Write the same algorithm

After understanding his algorithm, write the same algorithm

3. Get the parameters required by the algorithm

Remember the ever - changing parameters mentioned earlier ？ In this website, that parameter is lt. But after debugging, it is not found that , It's another parameter pwdDefaultEncryptSalt.
Now that you have found the parameters , That handle pwdDefaultEncryptSalt and lt You can get the value of ,lt To send ,pwdDefaultEncryptSalt Used to encrypt .
thus , Scripting has broken through the anti scripting strategy , Master the above methods ,80% Your website is not a problem

Other medium and high difficulty

I haven't done anything else , But it should be coordinated selenium breakthrough , Because his request parameters are often complex , If you still insist on using the previous method, the cost is too high . use selenium Just simulate moving the mouse and clicking , If it slides, find a gap , Then simulate human behavior to slide , The sliding speed is fast first and then slow . As for Baidu and Google, they are too abnormal , No idea .

Conclusion

In fact, if you are not a crawler engineer, it is enough to break through the low and medium difficulty . For the possibility of low difficulty, I wrote it simply , But some basic ideas should be understood after reading . For moderately difficult cases , It wasn't very detailed . How to put it? , Personal experience , Seeing him, I thought that this mechanism might do this , Then go and test your ideas , Then we can break through him . After all, you know how he did it , Then you can basically know how to break through . Maybe I'm also a developer , Put yourself in another position , How do I develop this anti script function . Just like network security , If you know how to attack, you know how to defend ; You know how to defend and how to attack , It's all interlinked .