This is a detailed introduction Python A tutorial for getting started with crawlers , Starting from actual combat , Suitable for beginners . Readers only need to follow the ideas of the article in the reading process , Sort out the corresponding implementation code ,30 Minutes to learn to write simple Python Reptiles .
This article Python The crawler tutorial mainly explains the following 5 Part content :
Learn about the web ;
Use requests The library grabs website data ;
Use Beautiful Soup Parse web pages ;
Cleaning and organizing data ;
Reptile attack and defense ;
Learn about the web
Take the home page of China Tourism Website ( http://www.cntour.cn/) For example , Grab the first message on the home page of China Tourism Network ( Titles and links ), The data appears in the source code in the form of plaintext . On the home page of China Tourism Network , Press shortcut key 【Ctrl+U】 Open the source page , Pictured 1 Shown .
chart 1 China Tourism Network home source code
Understand the structure of web pages
Web pages are generally composed of three parts , Namely HTML( Hypertext markup language )、CSS( Cascading style sheets ) and JScript( Active scripting language ).
HTML
HTML Is the structure of the whole web page , Equivalent to the framework of the whole website . belt “<”、“>” All symbols belong to HTML The label of , And the labels appear in pairs .
Common labels are as follows :
<html>..</html> Indicates that the element in the middle of the tag is a web page <body>..</body> Represents the content visible to the user <div>..</div> Represents the frame <p>..</p> A paragraph <li>..</li> Represents a list <img>..</img> Show pictures <h1>..</h1> Indicates title <a href="">..</a> Represents a hyperlink
CSS
CSS Presentation style , chart 1 pass the civil examinations 13 That's ok <style type="text/css"> It means that the following references a CSS, stay CSS Appearance is defined in .
JScript
JScript It means function . Interactive content and various special effects are JScript in ,JScript Describes the various functions of the website .
If we use the human body as a metaphor ,HTML It's a human skeleton , And defines the human mouth 、 eyes 、 Where do ears grow .CSS It's the details of people's appearance , Such as what the mouth looks like , Is the eye double eyelid or single eyelid , Big eyes or small eyes , Is the skin black or white .JScript Show people's skills , For example, dancing 、 Singing or playing musical instruments .
Write a simple HTML
By writing and modifying HTML, Can better understand HTML. First open a notepad , Then type in the following :
<html> <head> <title> Python 3 Introduction and practice of crawler and data cleaning </title> </head> <body> <div> <p>Python 3 Introduction and practice of crawler and data cleaning </p> </div> <div> <ul> <li><a href="http://c.biancheng.net"> Reptiles </a></li> <li> Data cleaning </li> </ul> </div> </body>
After entering the code , Save Notepad , Then change the file name and suffix to "HTML.html";
Effect after running the file , Pictured 2 Shown .
chart 2 This code just uses HTML, Readers can modify the Chinese in the code by themselves , Then observe the change .
About the legitimacy of reptiles
Almost every website has a name robots.txt Documents , Of course, some websites don't set robots.txt. For no settings robots.txt Our website can get data encrypted without password through web crawler , That is, all page data of the website can be crawled . If the website has robots.txt file , It is necessary to determine whether there is data that visitors are not allowed to obtain .
Take taobao as an example , Access in a browser https://www.taobao.com/robots.txt, Pictured 3 Shown .
chart 3 Taobao.com robots.txt The contents of the document Taobao allows some crawlers to access some of its paths , For users who are not allowed , No crawling is allowed , The code is as follows :
User-Agent:* Disallow:/
This code means that in addition to the crawler specified above , Other crawlers are not allowed to crawl any data .
Use requests Library request site
install requests library
First, in the PyCharm Install in requests library , Open... For this purpose PyCharm, single click “File”( file ) menu , choice “Setting for New Projects...” command , Pictured 4 Shown .
chart 4 choice “Project Interpreter”( Project compiler ) command , Confirm the currently selected compiler , Then click the plus sign in the upper right corner , Pictured 5 Shown .
chart 5 Enter in the search box :requests( Be careful , Be sure to enter complete , Otherwise, it's easy to make mistakes ), Then click in the lower left corner “Install Package”( Installation Library ) Button . Pictured 6 Shown :
chart 6 After installation , Will be in Install Package Displayed on the “Package‘requests’ installed successfully”( The request for the library was successfully installed ), Pictured 7 Shown ; If the installation is not successful, a prompt message will be displayed .
chart 7 Installation successful
The basic principle of reptiles
The process of web page request is divided into two links :
Request ( request ): Every web page displayed in front of users must go through this step , That is, send an access request to the server .
Response( Respond to ): After receiving the user's request, the server , Will verify the validity of the request , And then to the user ( client ) Send the content of the response , The client receives the contents of the server response , Show the content , It's the familiar Web request , Pictured 8 Shown .
chart 8 Response The corresponding There are also two ways to request web pages :
GET: The most common way , It is generally used to obtain or query resource information , It's also the way most websites use , Fast response .
POST: comparison GET The way , The function of uploading parameters in form is added , Therefore, in addition to querying information , You can also modify the information .
therefore , Before writing a crawler, determine who to send the request to , How to send .
Use GET To grab data
Copy the title of the first news on any front page , Press... On the source page 【Ctrl+F】 Press the key combination to call up the search box , Paste the title in the search box , Then press 【Enter】 key .
Pictured 8 Shown , The title can be found in the source code , The object of the request is www.cntour.cn, The request mode is GET( All data requests in the source code are GET), Pictured 9 Shown .
chart 9( Click here to see a large HD image ) After determining the request object and method , stay PyCharm Enter the following code in :
Copy Pure text Copy
import requests # Import requests package
url = 'http://www.cntour.cn/'
strhtml = requests.get(url) #Get How to get web data
print(strhtml.text)
import requests # Import requests package
url = 'http://www.cntour.cn/'
strhtml = requests.get(url) #Get How to get web data
print(strhtml.text)
The operation result is as shown in the figure 10 Shown : chart 10 Effect drawing of operation results ( Click here to see a large HD image ) The statement used to load the library is import+ The name of the library . In the process , load requests The statement of the library is :import requests.
use GET To get data, you need to call requests In the library get Method , The method of use is in requests Then enter the English point number , As shown below :
requests.get
Save the acquired data to strhtml variable , The code is as follows :
strhtml = request.get(url)
This is the time strhtml It's a URL object , It represents the entire web page , But at this time, you only need the source code in the web page , The following statement represents the web page source code :
strhtml.text
Use POST To grab data
First enter the website of Youdao translation : http://fanyi.youdao.com/, Enter Youdao translation page .
Press shortcut key F12, Enter developer mode , single click Network, At this time, the content is empty , Pictured 11 Shown :
chart 11 Enter... In Youdao translation “ I love China ”, single click “ translate ” Button , Pictured 12 Shown :
chart 12 In developer mode , In turn, click “Network” Button and “XHR” Button , Find the translation data , Pictured 13 Shown :
chart 13 single click Headers, The way to find the requested data is POST. Pictured 14 Shown :
chart 14 After finding where the data is and specifying the request method , Next, start writing about crawlers .
First , take Headers Medium URL Copy it , And assign it to url, The code is as follows :
POST The way your request gets data is different from GET,POST The request data must build the request header before it can .
Form Data The request parameters in are shown in Figure 15 Shown :
chart 15 Copy it and build a new dictionary :
From_data={'i':' I love China ','from':'zh-CHS','to':'en','smartresult':'dict','client':'fanyideskweb','salt':'15477056211258','sign':'b3589f32c38bc9e3876a570b8a992604','ts':'1547705621125','bv':'b33a2f3f9d09bde064c9275bcb33d94e','doctype':'json','version':'2.1','keyfrom':'fanyi.web','action':'FY_BY_REALTIME','typoResult':'false'}
Next use requests.post Method to request form data , The code is as follows :
Convert data in string format to JSON Format data , And according to data structure , Extract the data , And print the translation results , The code is as follows :
import requests # Import requests package
import json
def get_translate_date(word=None):
url = 'http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule'
From_data={'i':word,'from':'zh-CHS','to':'en','smartresult':'dict','client':'fanyideskweb','salt':'15477056211258','sign':'b3589f32c38bc9e3876a570b8a992604','ts':'1547705621125','bv':'b33a2f3f9d09bde064c9275bcb33d94e','doctype':'json','version':'2.1','keyfrom':'fanyi.web','action':'FY_BY_REALTIME','typoResult':'false'}
# Request form data
response = requests.post(url,data=From_data)
# take Json Format string to dictionary
content = json.loads(response.text)
print(content)
# Print the translated data
#print(content['translateResult'][0][0]['tgt'])
if __name__=='__main__':
get_translate_date(' I love China ')
Use Beautiful Soup Parse web pages
adopt requests The library can already catch the web source code , The next step is to find and extract data from the source code .Beautiful Soup yes python A library , Its main function is to capture data from web pages .Beautiful Soup It has been transplanted to bs4 In the library , That is, importing Beautiful Soup You need to install bs4 library .
install bs4 The method of library is shown in the figure 16 Shown :
chart 16 Install well bs4 After the library , Installation is needed. lxml library . If we don't install lxml library , Will use Python Default parser . Even though Beautiful Soup Support both Python In the standard library HTML The parser also supports some third-party parsers , however lxml The library has more powerful functions 、 Faster features , Therefore, the author recommends installing lxml library .
install Python After the third-party library , Enter the following code , Can be opened Beautiful Soup The journey :
Copy Pure text Copy
import requests # Import requests package
from bs4 import BeautifulSoup
url='http://www.cntour.cn/'
strhtml=requests.get(url)
soup=BeautifulSoup(strhtml.text,'lxml')
data = soup.select('#main>div>div.mtop.firstMod.clearfix>div.centerBox>ul.newsList>li>a')
print(data)
import requests # Import requests package
from bs4 import BeautifulSoup
url='http://www.cntour.cn/'
strhtml=requests.get(url)
soup=BeautifulSoup(strhtml.text,'lxml')
data = soup.select('#main>div>div.mtop.firstMod.clearfix>div.centerBox>ul.newsList>li>a')
print(data)
The running result of the code is shown in the figure 17 Shown .
chart 17( Click here to see a large HD image ) Beautiful Soup Library can easily parse web page information , It's integrated into bs4 In the library , If necessary, you can start from bs4 Call in library . The expression is as follows :
from bs4 import BeautifulSoup
First ,HTML The document will be converted to Unicode Coding format , then Beautiful Soup Choose the most appropriate parser to parse the document , It is specified here that lxml The parser parses . After parsing, the complex HTML Convert the document into a tree structure , And each node is Python object . Here, the parsed document is stored in the new variable soup in , The code is as follows :
soup=BeautifulSoup(strhtml.text,'lxml')
Next use select( Selectors ) Location data , When locating data, you need to use the developer mode of the browser , Pause the mouse cursor at the corresponding data position and right-click , Then select... From the shortcut menu “ Check ” command , Pictured 18 Shown :
chart 18 Then the developer interface will pop up on the right side of the browser , The highlighted code on the right ( See chart 19(b)) Corresponding to the highlighted data text on the left ( See chart 19(a)). Right click the right side to highlight the data , Choose... From the shortcut menu that pops up “Copy”*“Copy Selector” command , You can automatically copy the path .
chart 19 Copy path Paste the path into the document , The code is as follows :
#main > div > div.mtop.firstMod.clearfix > div.centerBox > ul.newsList > li:nth-child(1) > a
Because this path is the first selected path , And we need to get all the headlines , So it will li:nth-child(1) Middle colon ( Include colon ) Delete the following part , The code is as follows :
#main > div > div.mtop.firstMod.clearfix > div.centerBox > ul.newsList > li > a
Use soup.select Reference this path , The code is as follows :
data = soup.select('#main > div > div.mtop.firstMod.clearfix > div.centerBox > ul.newsList > li > a')
Cleaning and organizing data
thus , Got a goal HTML Code , But the data has not been extracted yet , The next in PyCharm Enter the following code in :
Copy Pure text Copy
for item in data:
result={
'title':item.get_text(),
'link':item.get('href')
}
print(result)
for item in data:
result={
'title':item.get_text(),
'link':item.get('href')
}
print(result)
The running result of the code is shown in the figure 20 Shown : chart 20( Click here to see a large HD image ) First, make it clear that the data to be extracted is the title and link , Title in <a> In the label , Extract the text of the tag with get_text() Method . Link in <a> Labeled href Properties of the , Extract... From the tag href Properties with get() Method , Specify the attribute data to be extracted in parentheses , namely get('href').
From the picture 20 Can be found in , There is a number in the link to the article ID. Let's extract this with a regular expression ID. The regular symbols to be used are as follows :
\d Match the Numbers + Match the previous character 1 Times or times
stay Python When using regular expressions, use re library , This library does not need to be installed , Can be called directly . stay PyCharm Enter the following code in :
Copy Pure text Copy
import re
for item in data:
result={
"title":item.get_text(),
"link":item.get('href'),
'ID':re.findall('\d+',item.get('href'))
}
print(result)
import re
for item in data:
result={
"title":item.get_text(),
"link":item.get('href'),
'ID':re.findall('\d+',item.get('href'))
}
print(result)
The operation result is as shown in the figure 21 Shown :
chart 21 Use here re Library findall Method , The first parameter represents the regular expression , The second parameter represents the text to be extracted .
Reptile attack and defense
A crawler is a simulation of human browsing behavior , Batch fetching of data . When the amount of data captured gradually increases , It will put a lot of pressure on the accessed server , It may even collapse . In other words , The server doesn't like people to grab their own data . that , The website will target these crawlers , Take some anti climbing strategies .
The first way the server can identify a crawler is by checking the connection useragent To identify browser access , Or code access . If it's code access , When the number of visits increases , The server will block the visitors directly IP.
So deal with this primary anti climbing mechanism , What should we do ?
Let's take the crawler created earlier as an example . When accessing , We can not only find in the developer environment URL、Form Data, You can also do it in Request headers Construct the request header of the browser in , Encapsulate yourself . The way the server recognizes browser access is to judge keyword Is it Request headers Under the User-Agent, Pictured 22 Shown .
chart 22 therefore , We only need to construct the parameters of the request header . Create request header information , The code is as follows :
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'} response = request.get(url,headers=headers)
Write here , Many readers will think that modifying User-Agent It's too simple . It's really simple , But normal people 1 Look at a picture in seconds , And a reptile 1 You can capture many pictures in seconds , such as 1 Capture hundreds of pictures in seconds , Then the pressure on the server is bound to increase . in other words , If in a IP Download pictures in batch , This behavior is not in line with normal human behavior , It must be sealed IP.
The principle is very simple , Is to count each IP Frequency of visits , The frequency exceeds the threshold , Will return a verification code , If it's really user access , The user will fill in , Then continue to visit , If it's code access , Will be sealed IP.
There are two solutions to this problem , The first is the commonly used additional delay , Every time 3 Grab once per second , The code is as follows :
import time time.sleep(3)
however , We write crawlers to efficiently grab data in batches , Set up here 3 Grab once per second , It's too inefficient . Actually , There is a more important solution , That is to solve the problem essentially .
No matter how you access , The purpose of the server is to find out what is code access , Then block IP. terms of settlement : To avoid being sealed IP, Agents are often used in data collection . Of course ,requests There are also corresponding proxies attribute .
First , Build your own agent IP pool , Assign it to... In the form of a dictionary proxies, And then transmit it to requests, The code is as follows :
This article is only for Python The crawler and its implementation process are briefly introduced , It can only make beginners understand python Reptiles have a simple understanding , It doesn't give you complete control of Python Reptiles .
If you want to be right Python Reptiles have a deeper understanding , I recommend reading :