您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

python

編輯：Python

This is a detailed introduction Python A tutorial for getting started with crawlers , Starting from actual combat , Suitable for beginners . Readers only need to follow the ideas of the article in the reading process , Sort out the corresponding implementation code ,30 Minutes to learn to write simple Python Reptiles .

This article Python The crawler tutorial mainly explains the following 5 Part content ：

Learn about the web ;
Use requests The library grabs website data ;
Use Beautiful Soup Parse web pages ;
Cleaning and organizing data ;
Reptile attack and defense ;

Learn about the web

Take the home page of China Tourism Website （ http://www.cntour.cn/） For example , Grab the first message on the home page of China Tourism Network （ Titles and links ）, The data appears in the source code in the form of plaintext . On the home page of China Tourism Network , Press shortcut key 【Ctrl+U】 Open the source page , Pictured 1 Shown .

chart 1 China Tourism Network home source code

Understand the structure of web pages

Web pages are generally composed of three parts , Namely HTML（ Hypertext markup language ）、CSS（ Cascading style sheets ） and JScript（ Active scripting language ）.

HTML

HTML Is the structure of the whole web page , Equivalent to the framework of the whole website . belt “＜”、“＞” All symbols belong to HTML The label of , And the labels appear in pairs .

Common labels are as follows ：

<html>..</html> Indicates that the element in the middle of the tag is a web page
<body>..</body> Represents the content visible to the user
<div>..</div> Represents the frame
<p>..</p> A paragraph
<li>..</li> Represents a list
<img>..</img> Show pictures
<h1>..</h1> Indicates title
<a href="">..</a> Represents a hyperlink

CSS

CSS Presentation style , chart 1 pass the civil examinations 13 That's ok ＜style type=＂text/css＂＞ It means that the following references a CSS, stay CSS Appearance is defined in .

JScript

JScript It means function . Interactive content and various special effects are JScript in ,JScript Describes the various functions of the website .

If we use the human body as a metaphor ,HTML It's a human skeleton , And defines the human mouth 、 eyes 、 Where do ears grow .CSS It's the details of people's appearance , Such as what the mouth looks like , Is the eye double eyelid or single eyelid , Big eyes or small eyes , Is the skin black or white .JScript Show people's skills , For example, dancing 、 Singing or playing musical instruments .

Write a simple HTML

By writing and modifying HTML, Can better understand HTML. First open a notepad , Then type in the following ：

<html>
<head>
    <title> Python 3 Introduction and practice of crawler and data cleaning </title>
</head>
<body>
    <div>
        <p>Python 3 Introduction and practice of crawler and data cleaning </p>
    </div>
    <div>
        <ul>
            <li><a href="http://c.biancheng.net"> Reptiles </a></li>
            <li> Data cleaning </li>
        </ul>
    </div>
</body>

After entering the code , Save Notepad , Then change the file name and suffix to "HTML.html";

Effect after running the file , Pictured 2 Shown .

chart 2
This code just uses HTML, Readers can modify the Chinese in the code by themselves , Then observe the change .

About the legitimacy of reptiles

Almost every website has a name robots.txt Documents , Of course, some websites don't set robots.txt. For no settings robots.txt Our website can get data encrypted without password through web crawler , That is, all page data of the website can be crawled . If the website has robots.txt file , It is necessary to determine whether there is data that visitors are not allowed to obtain .

Take taobao as an example , Access in a browser https://www.taobao.com/robots.txt, Pictured 3 Shown .

chart 3 Taobao.com robots.txt The contents of the document
Taobao allows some crawlers to access some of its paths , For users who are not allowed , No crawling is allowed , The code is as follows ：

User-Agent:*
Disallow:/

This code means that in addition to the crawler specified above , Other crawlers are not allowed to crawl any data .

Use requests Library request site

install requests library

First, in the PyCharm Install in requests library , Open... For this purpose PyCharm, single click “File”（ file ） menu , choice “Setting for New Projects...” command , Pictured 4 Shown .

chart 4
choice “Project Interpreter”（ Project compiler ） command , Confirm the currently selected compiler , Then click the plus sign in the upper right corner , Pictured 5 Shown .

chart 5
Enter in the search box ：requests（ Be careful , Be sure to enter complete , Otherwise, it's easy to make mistakes ）, Then click in the lower left corner “Install Package”（ Installation Library ） Button . Pictured 6 Shown ：

chart 6
After installation , Will be in Install Package Displayed on the “Package‘requests’ installed successfully”（ The request for the library was successfully installed ）, Pictured 7 Shown ; If the installation is not successful, a prompt message will be displayed .

chart 7 Installation successful

The basic principle of reptiles

The process of web page request is divided into two links ：

Request （ request ）： Every web page displayed in front of users must go through this step , That is, send an access request to the server .
Response（ Respond to ）： After receiving the user's request, the server , Will verify the validity of the request , And then to the user （ client ） Send the content of the response , The client receives the contents of the server response , Show the content , It's the familiar Web request , Pictured 8 Shown .

chart 8 Response The corresponding
There are also two ways to request web pages ：

GET： The most common way , It is generally used to obtain or query resource information , It's also the way most websites use , Fast response .
POST： comparison GET The way , The function of uploading parameters in form is added , Therefore, in addition to querying information , You can also modify the information .

therefore , Before writing a crawler, determine who to send the request to , How to send .

Use GET To grab data

Copy the title of the first news on any front page , Press... On the source page 【Ctrl+F】 Press the key combination to call up the search box , Paste the title in the search box , Then press 【Enter】 key .

Pictured 8 Shown , The title can be found in the source code , The object of the request is www.cntour.cn, The request mode is GET（ All data requests in the source code are GET）, Pictured 9 Shown .

chart 9（ Click here to see a large HD image ） After determining the request object and method , stay PyCharm Enter the following code in ：

 Copy  Pure text  Copy



import requests # Import requests package 
url = 'http://www.cntour.cn/'
strhtml = requests.get(url) #Get How to get web data 
print(strhtml.text)

import requests # Import requests package
url = 'http://www.cntour.cn/'
strhtml = requests.get(url) #Get How to get web data
print(strhtml.text)

The operation result is as shown in the figure 10 Shown ：

chart 10 Effect drawing of operation results （ Click here to see a large HD image ）
The statement used to load the library is import+ The name of the library . In the process , load requests The statement of the library is ：import requests.

use GET To get data, you need to call requests In the library get Method , The method of use is in requests Then enter the English point number , As shown below ：

requests.get

Save the acquired data to strhtml variable , The code is as follows ：

strhtml = request.get(url)

This is the time strhtml It's a URL object , It represents the entire web page , But at this time, you only need the source code in the web page , The following statement represents the web page source code ：

strhtml.text

Use POST To grab data

First enter the website of Youdao translation ： http://fanyi.youdao.com/, Enter Youdao translation page .

Press shortcut key F12, Enter developer mode , single click Network, At this time, the content is empty , Pictured 11 Shown ：

chart 11
Enter... In Youdao translation “ I love China ”, single click “ translate ” Button , Pictured 12 Shown ：

chart 12
In developer mode , In turn, click “Network” Button and “XHR” Button , Find the translation data , Pictured 13 Shown ：

chart 13
single click Headers, The way to find the requested data is POST. Pictured 14 Shown ：

chart 14
After finding where the data is and specifying the request method , Next, start writing about crawlers .

First , take Headers Medium URL Copy it , And assign it to url, The code is as follows ：

url = 'http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule'

POST The way your request gets data is different from GET,POST The request data must build the request header before it can .

Form Data The request parameters in are shown in Figure 15 Shown ：

chart 15
Copy it and build a new dictionary ：

From_data={'i':' I love China ','from':'zh-CHS','to':'en','smartresult':'dict','client':'fanyideskweb','salt':'15477056211258','sign':'b3589f32c38bc9e3876a570b8a992604','ts':'1547705621125','bv':'b33a2f3f9d09bde064c9275bcb33d94e','doctype':'json','version':'2.1','keyfrom':'fanyi.web','action':'FY_BY_REALTIME','typoResult':'false'}

Next use requests.post Method to request form data , The code is as follows ：

import requests # Import requests package
response = requests.post(url,data=payload)

Convert data in string format to JSON Format data , And according to data structure , Extract the data , And print the translation results , The code is as follows ：

 Copy  Pure text  Copy



import json
content = json.loads(response.text)
print(content['translateResult'][0][0]['tgt'])

import json
content = json.loads(response.text)
print(content['translateResult'][0][0]['tgt'])

Use requests.post Method grab the complete code of Youdao translation results as follows ：

 Copy  Pure text  Copy



import requests # Import requests package 
import json
def get_translate_date(word=None):
 url = 'http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule'
 From_data={
'i':word,'from':'zh-CHS','to':'en','smartresult':'dict','client':'fanyideskweb','salt':'15477056211258','sign':'b3589f32c38bc9e3876a570b8a992604','ts':'1547705621125','bv':'b33a2f3f9d09bde064c9275bcb33d94e','doctype':'json','version':'2.1','keyfrom':'fanyi.web','action':'FY_BY_REALTIME','typoResult':'false'}
 # Request form data 
 response = requests.post(url,data=From_data)
 # take Json Format string to dictionary 
 content = json.loads(response.text)
 print(content)
 # Print the translated data 
 #print(content['translateResult'][0][0]['tgt'])
if __name__=='__main__':
 get_translate_date(' I love China ')

import requests # Import requests package
import json
def get_translate_date(word=None):
url = 'http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule'
From_data={'i':word,'from':'zh-CHS','to':'en','smartresult':'dict','client':'fanyideskweb','salt':'15477056211258','sign':'b3589f32c38bc9e3876a570b8a992604','ts':'1547705621125','bv':'b33a2f3f9d09bde064c9275bcb33d94e','doctype':'json','version':'2.1','keyfrom':'fanyi.web','action':'FY_BY_REALTIME','typoResult':'false'}
# Request form data
response = requests.post(url,data=From_data)
# take Json Format string to dictionary
content = json.loads(response.text)
print(content)
# Print the translated data
#print(content['translateResult'][0][0]['tgt'])
if __name__=='__main__':
get_translate_date(' I love China ')

Use Beautiful Soup Parse web pages

adopt requests The library can already catch the web source code , The next step is to find and extract data from the source code .Beautiful Soup yes python A library , Its main function is to capture data from web pages .Beautiful Soup It has been transplanted to bs4 In the library , That is, importing Beautiful Soup You need to install bs4 library .

install bs4 The method of library is shown in the figure 16 Shown :

chart 16
Install well bs4 After the library , Installation is needed. lxml library . If we don't install lxml library , Will use Python Default parser . Even though Beautiful Soup Support both Python In the standard library HTML The parser also supports some third-party parsers , however lxml The library has more powerful functions 、 Faster features , Therefore, the author recommends installing lxml library .

install Python After the third-party library , Enter the following code , Can be opened Beautiful Soup The journey ：

 Copy  Pure text  Copy



import requests # Import requests package 
from bs4 import BeautifulSoup
url='http://www.cntour.cn/'
strhtml=requests.get(url)
soup=BeautifulSoup(strhtml.text,'lxml')
data = soup.select('#main>div>div.mtop.firstMod.clearfix>div.centerBox>ul.newsList>li>a')
print(data)

import requests # Import requests package
from bs4 import BeautifulSoup
url='http://www.cntour.cn/'
strhtml=requests.get(url)
soup=BeautifulSoup(strhtml.text,'lxml')
data = soup.select('#main>div>div.mtop.firstMod.clearfix>div.centerBox>ul.newsList>li>a')
print(data)

The running result of the code is shown in the figure 17 Shown .

chart 17（ Click here to see a large HD image ）
Beautiful Soup Library can easily parse web page information , It's integrated into bs4 In the library , If necessary, you can start from bs4 Call in library . The expression is as follows ：

from bs4 import BeautifulSoup

First ,HTML The document will be converted to Unicode Coding format , then Beautiful Soup Choose the most appropriate parser to parse the document , It is specified here that lxml The parser parses . After parsing, the complex HTML Convert the document into a tree structure , And each node is Python object . Here, the parsed document is stored in the new variable soup in , The code is as follows ：

soup=BeautifulSoup(strhtml.text,'lxml')

Next use select（ Selectors ） Location data , When locating data, you need to use the developer mode of the browser , Pause the mouse cursor at the corresponding data position and right-click , Then select... From the shortcut menu “ Check ” command , Pictured 18 Shown ：

chart 18
Then the developer interface will pop up on the right side of the browser , The highlighted code on the right （ See chart 19(b)） Corresponding to the highlighted data text on the left （ See chart 19(a)）. Right click the right side to highlight the data , Choose... From the shortcut menu that pops up “Copy”*“Copy Selector” command , You can automatically copy the path .

chart 19 Copy path Paste the path into the document , The code is as follows :

#main > div > div.mtop.firstMod.clearfix > div.centerBox > ul.newsList > li:nth-child(1) > a

Because this path is the first selected path , And we need to get all the headlines , So it will li：nth-child（1） Middle colon （ Include colon ） Delete the following part , The code is as follows ：

#main > div > div.mtop.firstMod.clearfix > div.centerBox > ul.newsList > li > a

Use soup.select Reference this path , The code is as follows ：

data = soup.select('#main > div > div.mtop.firstMod.clearfix > div.centerBox > ul.newsList > li > a')

Cleaning and organizing data

thus , Got a goal HTML Code , But the data has not been extracted yet , The next in PyCharm Enter the following code in ：

 Copy  Pure text  Copy



for item in data:
 result={

 'title':item.get_text(),
 'link':item.get('href')
 }
print(result)

for item in data:
result={
'title':item.get_text(),
'link':item.get('href')
}
print(result)

The running result of the code is shown in the figure 20 Shown ：

chart 20（ Click here to see a large HD image ）
First, make it clear that the data to be extracted is the title and link , Title in ＜a＞ In the label , Extract the text of the tag with get_text() Method . Link in ＜a＞ Labeled href Properties of the , Extract... From the tag href Properties with get() Method , Specify the attribute data to be extracted in parentheses , namely get(＇href＇).

From the picture 20 Can be found in , There is a number in the link to the article ID. Let's extract this with a regular expression ID. The regular symbols to be used are as follows :

\d Match the Numbers
+ Match the previous character 1 Times or times

stay Python When using regular expressions, use re library , This library does not need to be installed , Can be called directly . stay PyCharm Enter the following code in :

 Copy  Pure text  Copy



import re
for item in data:
 result={

 "title":item.get_text(),
 "link":item.get('href'),
 'ID':re.findall('\d+',item.get('href'))
 }
print(result)

import re
for item in data:
result={
"title":item.get_text(),
"link":item.get('href'),
'ID':re.findall('\d+',item.get('href'))
}
print(result)

The operation result is as shown in the figure 21 Shown ：

chart 21
Use here re Library findall Method , The first parameter represents the regular expression , The second parameter represents the text to be extracted .

Reptile attack and defense

A crawler is a simulation of human browsing behavior , Batch fetching of data . When the amount of data captured gradually increases , It will put a lot of pressure on the accessed server , It may even collapse . In other words , The server doesn't like people to grab their own data . that , The website will target these crawlers , Take some anti climbing strategies .

The first way the server can identify a crawler is by checking the connection useragent To identify browser access , Or code access . If it's code access , When the number of visits increases , The server will block the visitors directly IP.

So deal with this primary anti climbing mechanism , What should we do ？

Let's take the crawler created earlier as an example . When accessing , We can not only find in the developer environment URL、Form Data, You can also do it in Request headers Construct the request header of the browser in , Encapsulate yourself . The way the server recognizes browser access is to judge keyword Is it Request headers Under the User-Agent, Pictured 22 Shown .

chart 22
therefore , We only need to construct the parameters of the request header . Create request header information , The code is as follows ：

headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'}
response = request.get(url,headers=headers)

Write here , Many readers will think that modifying User-Agent It's too simple . It's really simple , But normal people 1 Look at a picture in seconds , And a reptile 1 You can capture many pictures in seconds , such as 1 Capture hundreds of pictures in seconds , Then the pressure on the server is bound to increase . in other words , If in a IP Download pictures in batch , This behavior is not in line with normal human behavior , It must be sealed IP.

The principle is very simple , Is to count each IP Frequency of visits , The frequency exceeds the threshold , Will return a verification code , If it's really user access , The user will fill in , Then continue to visit , If it's code access , Will be sealed IP.

There are two solutions to this problem , The first is the commonly used additional delay , Every time 3 Grab once per second , The code is as follows ：

import time
time.sleep(3)

however , We write crawlers to efficiently grab data in batches , Set up here 3 Grab once per second , It's too inefficient . Actually , There is a more important solution , That is to solve the problem essentially .

No matter how you access , The purpose of the server is to find out what is code access , Then block IP. terms of settlement ： To avoid being sealed IP, Agents are often used in data collection . Of course ,requests There are also corresponding proxies attribute .

First , Build your own agent IP pool , Assign it to... In the form of a dictionary proxies, And then transmit it to requests, The code is as follows ：

 Copy  Pure text  Copy



proxies={

 "http":"http://10.10.1.10:3128",
 "https":"http://10.10.1.10:1080",
}
response = requests.get(url, proxies=proxies)

proxies={
"http":"http://10.10.1.10:3128",
"https":"http://10.10.1.10:1080",
}
response = requests.get(url, proxies=proxies)

Extended reading

This article is only for Python The crawler and its implementation process are briefly introduced , It can only make beginners understand python Reptiles have a simple understanding , It doesn't give you complete control of Python Reptiles .

If you want to be right Python Reptiles have a deeper understanding , I recommend reading ：