程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

python

編輯:Python
This is a detailed introduction Python A tutorial for getting started with crawlers , Starting from actual combat , Suitable for beginners . Readers only need to follow the ideas of the article in the reading process , Sort out the corresponding implementation code ,30 Minutes to learn to write simple Python Reptiles .

This article Python The crawler tutorial mainly explains the following 5 Part content :
  1. Learn about the web ;
  2. Use requests The library grabs website data ;
  3. Use Beautiful Soup Parse web pages ;
  4. Cleaning and organizing data ;
  5. Reptile attack and defense ;

Learn about the web

Take the home page of China Tourism Website ( http://www.cntour.cn/) For example , Grab the first message on the home page of China Tourism Network ( Titles and links ), The data appears in the source code in the form of plaintext . On the home page of China Tourism Network , Press shortcut key 【Ctrl+U】 Open the source page , Pictured 1 Shown .

chart 1 China Tourism Network home source code

Understand the structure of web pages

Web pages are generally composed of three parts , Namely HTML( Hypertext markup language )、CSS( Cascading style sheets ) and JScript( Active scripting language ).

HTML

HTML Is the structure of the whole web page , Equivalent to the framework of the whole website . belt “<”、“>” All symbols belong to HTML The label of , And the labels appear in pairs .

Common labels are as follows :

<html>..</html> Indicates that the element in the middle of the tag is a web page
<body>..</body> Represents the content visible to the user
<div>..</div> Represents the frame
<p>..</p> A paragraph
<li>..</li> Represents a list
<img>..</img> Show pictures
<h1>..</h1> Indicates title
<a href="">..</a> Represents a hyperlink

CSS

CSS Presentation style , chart 1 pass the civil examinations 13 That's ok <style type="text/css"> It means that the following references a CSS, stay CSS Appearance is defined in .

JScript

JScript It means function . Interactive content and various special effects are JScript in ,JScript Describes the various functions of the website .

If we use the human body as a metaphor ,HTML It's a human skeleton , And defines the human mouth 、 eyes 、 Where do ears grow .CSS It's the details of people's appearance , Such as what the mouth looks like , Is the eye double eyelid or single eyelid , Big eyes or small eyes , Is the skin black or white .JScript Show people's skills , For example, dancing 、 Singing or playing musical instruments .

Write a simple HTML

By writing and modifying HTML, Can better understand HTML. First open a notepad , Then type in the following :

<html>
<head>
    <title> Python 3 Introduction and practice of crawler and data cleaning </title>
</head>
<body>
    <div>
        <p>Python 3 Introduction and practice of crawler and data cleaning </p>
    </div>
    <div>
        <ul>
            <li><a href="http://c.biancheng.net"> Reptiles </a></li>
            <li> Data cleaning </li>
        </ul>
    </div>
</body>

After entering the code , Save Notepad , Then change the file name and suffix to "HTML.html";

Effect after running the file , Pictured 2 Shown .

chart 2
This code just uses HTML, Readers can modify the Chinese in the code by themselves , Then observe the change .

About the legitimacy of reptiles

Almost every website has a name robots.txt Documents , Of course, some websites don't set robots.txt. For no settings robots.txt Our website can get data encrypted without password through web crawler , That is, all page data of the website can be crawled . If the website has robots.txt file , It is necessary to determine whether there is data that visitors are not allowed to obtain .

Take taobao as an example , Access in a browser https://www.taobao.com/robots.txt, Pictured   3 Shown .

chart 3 Taobao.com robots.txt The contents of the document
Taobao allows some crawlers to access some of its paths , For users who are not allowed , No crawling is allowed , The code is as follows :

User-Agent:*
Disallow:/

This code means that in addition to the crawler specified above , Other crawlers are not allowed to crawl any data .

Use requests Library request site

install requests library

First, in the PyCharm Install in requests library , Open... For this purpose PyCharm, single click “File”( file ) menu , choice “Setting for New Projects...” command , Pictured 4 Shown .

chart 4
choice “Project Interpreter”( Project compiler ) command , Confirm the currently selected compiler , Then click the plus sign in the upper right corner , Pictured 5 Shown .

chart 5
Enter in the search box :requests( Be careful , Be sure to enter complete , Otherwise, it's easy to make mistakes ), Then click in the lower left corner “Install Package”( Installation Library ) Button . Pictured 6 Shown :

chart 6
After installation , Will be in Install Package Displayed on the “Package‘requests’ installed successfully”( The request for the library was successfully installed ), Pictured 7 Shown ; If the installation is not successful, a prompt message will be displayed .

chart 7 Installation successful

The basic principle of reptiles

The process of web page request is divided into two links :
  1. Request ( request ): Every web page displayed in front of users must go through this step , That is, send an access request to the server .
  2. Response( Respond to ): After receiving the user's request, the server , Will verify the validity of the request , And then to the user ( client ) Send the content of the response , The client receives the contents of the server response , Show the content , It's the familiar Web request , Pictured 8 Shown .

chart 8 Response The corresponding
There are also two ways to request web pages :
  1. GET: The most common way , It is generally used to obtain or query resource information , It's also the way most websites use , Fast response .
  2. POST: comparison GET The way , The function of uploading parameters in form is added , Therefore, in addition to querying information , You can also modify the information .

therefore , Before writing a crawler, determine who to send the request to , How to send .

Use GET To grab data

Copy the title of the first news on any front page , Press... On the source page 【Ctrl+F】 Press the key combination to call up the search box , Paste the title in the search box , Then press 【Enter】 key .

Pictured 8 Shown , The title can be found in the source code , The object of the request is www.cntour.cn, The request mode is GET( All data requests in the source code are GET), Pictured 9 Shown .

chart 9( Click here to see a large HD image ) After determining the request object and method , stay PyCharm Enter the following code in :
 Copy  Pure text  Copy 

  1. import requests # Import requests package
  2. url = 'http://www.cntour.cn/'
  3. strhtml = requests.get(url) #Get How to get web data
  4. print(strhtml.text)
import requests # Import requests package
url = 'http://www.cntour.cn/'
strhtml = requests.get(url) #Get How to get web data
print(strhtml.text)
The operation result is as shown in the figure 10 Shown :
chart 10 Effect drawing of operation results ( Click here to see a large HD image )
The statement used to load the library is import+ The name of the library . In the process , load requests The statement of the library is :import requests.

use GET To get data, you need to call requests In the library get Method , The method of use is in requests Then enter the English point number , As shown below :

requests.get

Save the acquired data to strhtml variable , The code is as follows :

strhtml = request.get(url)

This is the time strhtml It's a URL object , It represents the entire web page , But at this time, you only need the source code in the web page , The following statement represents the web page source code :

strhtml.text

Use POST To grab data

First enter the website of Youdao translation : http://fanyi.youdao.com/, Enter Youdao translation page .

Press shortcut key F12, Enter developer mode , single click Network, At this time, the content is empty , Pictured 11 Shown :

chart 11
Enter... In Youdao translation “ I love China ”, single click “ translate ” Button , Pictured 12 Shown :

chart 12
In developer mode , In turn, click “Network” Button and “XHR” Button , Find the translation data , Pictured 13 Shown :

chart 13
single click Headers, The way to find the requested data is POST. Pictured 14 Shown :

chart 14
After finding where the data is and specifying the request method , Next, start writing about crawlers .

First , take Headers Medium URL Copy it , And assign it to url, The code is as follows :

url = 'http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule'

POST The way your request gets data is different from GET,POST The request data must build the request header before it can .

Form Data The request parameters in are shown in Figure 15 Shown :

chart 15
Copy it and build a new dictionary :

From_data={'i':' I love China ','from':'zh-CHS','to':'en','smartresult':'dict','client':'fanyideskweb','salt':'15477056211258','sign':'b3589f32c38bc9e3876a570b8a992604','ts':'1547705621125','bv':'b33a2f3f9d09bde064c9275bcb33d94e','doctype':'json','version':'2.1','keyfrom':'fanyi.web','action':'FY_BY_REALTIME','typoResult':'false'}

Next use requests.post Method to request form data , The code is as follows :

import requests        # Import requests package
response = requests.post(url,data=payload)

Convert data in string format to JSON Format data , And according to data structure , Extract the data , And print the translation results , The code is as follows :
 Copy  Pure text  Copy 

  1. import json
  2. content = json.loads(response.text)
  3. print(content['translateResult'][0][0]['tgt'])
import json
content = json.loads(response.text)
print(content['translateResult'][0][0]['tgt'])
Use requests.post Method grab the complete code of Youdao translation results as follows :
 Copy  Pure text  Copy 

  1. import requests # Import requests package
  2. import json
  3. def get_translate_date(word=None):
  4. url = 'http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule'
  5. From_data={ 'i':word,'from':'zh-CHS','to':'en','smartresult':'dict','client':'fanyideskweb','salt':'15477056211258','sign':'b3589f32c38bc9e3876a570b8a992604','ts':'1547705621125','bv':'b33a2f3f9d09bde064c9275bcb33d94e','doctype':'json','version':'2.1','keyfrom':'fanyi.web','action':'FY_BY_REALTIME','typoResult':'false'}
  6. # Request form data
  7. response = requests.post(url,data=From_data)
  8. # take Json Format string to dictionary
  9. content = json.loads(response.text)
  10. print(content)
  11. # Print the translated data
  12. #print(content['translateResult'][0][0]['tgt'])
  13. if __name__=='__main__':
  14. get_translate_date(' I love China ')
import requests # Import requests package
import json
def get_translate_date(word=None):
url = 'http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule'
From_data={'i':word,'from':'zh-CHS','to':'en','smartresult':'dict','client':'fanyideskweb','salt':'15477056211258','sign':'b3589f32c38bc9e3876a570b8a992604','ts':'1547705621125','bv':'b33a2f3f9d09bde064c9275bcb33d94e','doctype':'json','version':'2.1','keyfrom':'fanyi.web','action':'FY_BY_REALTIME','typoResult':'false'}
# Request form data
response = requests.post(url,data=From_data)
# take Json Format string to dictionary
content = json.loads(response.text)
print(content)
# Print the translated data
#print(content['translateResult'][0][0]['tgt'])
if __name__=='__main__':
get_translate_date(' I love China ')

Use Beautiful Soup Parse web pages

adopt requests The library can already catch the web source code , The next step is to find and extract data from the source code .Beautiful Soup yes python A library , Its main function is to capture data from web pages .Beautiful Soup It has been transplanted to bs4 In the library , That is, importing Beautiful Soup You need to install bs4 library .

install bs4 The method of library is shown in the figure 16 Shown :

chart 16
Install well bs4 After the library , Installation is needed. lxml library . If we don't install lxml library , Will use Python Default parser . Even though Beautiful Soup Support both Python In the standard library HTML The parser also supports some third-party parsers , however lxml The library has more powerful functions 、 Faster features , Therefore, the author recommends installing lxml library .

install Python After the third-party library , Enter the following code , Can be opened Beautiful Soup The journey :
 Copy  Pure text  Copy 

  1. import requests # Import requests package
  2. from bs4 import BeautifulSoup
  3. url='http://www.cntour.cn/'
  4. strhtml=requests.get(url)
  5. soup=BeautifulSoup(strhtml.text,'lxml')
  6. data = soup.select('#main>div>div.mtop.firstMod.clearfix>div.centerBox>ul.newsList>li>a')
  7. print(data)
import requests # Import requests package
from bs4 import BeautifulSoup
url='http://www.cntour.cn/'
strhtml=requests.get(url)
soup=BeautifulSoup(strhtml.text,'lxml')
data = soup.select('#main>div>div.mtop.firstMod.clearfix>div.centerBox>ul.newsList>li>a')
print(data)
The running result of the code is shown in the figure 17 Shown .

chart 17( Click here to see a large HD image )
Beautiful Soup Library can easily parse web page information , It's integrated into bs4 In the library , If necessary, you can start from bs4 Call in library . The expression is as follows :

from bs4 import BeautifulSoup

First ,HTML The document will be converted to Unicode Coding format , then Beautiful Soup Choose the most appropriate parser to parse the document , It is specified here that lxml The parser parses . After parsing, the complex HTML Convert the document into a tree structure , And each node is Python object . Here, the parsed document is stored in the new variable soup in , The code is as follows :

soup=BeautifulSoup(strhtml.text,'lxml')

Next use select( Selectors ) Location data , When locating data, you need to use the developer mode of the browser , Pause the mouse cursor at the corresponding data position and right-click , Then select... From the shortcut menu “ Check ” command , Pictured 18 Shown :

chart 18
Then the developer interface will pop up on the right side of the browser , The highlighted code on the right ( See chart   19(b)) Corresponding to the highlighted data text on the left ( See chart 19(a)). Right click the right side to highlight the data , Choose... From the shortcut menu that pops up “Copy”*“Copy Selector” command , You can automatically copy the path .

chart 19 Copy path Paste the path into the document , The code is as follows :

#main > div > div.mtop.firstMod.clearfix > div.centerBox > ul.newsList > li:nth-child(1) > a

Because this path is the first selected path , And we need to get all the headlines , So it will li:nth-child(1) Middle colon ( Include colon ) Delete the following part , The code is as follows :

#main > div > div.mtop.firstMod.clearfix > div.centerBox > ul.newsList > li > a

Use soup.select Reference this path , The code is as follows :

data = soup.select('#main > div > div.mtop.firstMod.clearfix > div.centerBox > ul.newsList > li > a')

Cleaning and organizing data

thus , Got a goal HTML Code , But the data has not been extracted yet , The next in PyCharm Enter the following code in :
 Copy  Pure text  Copy 

  1. for item in data:
  2. result={
  3. 'title':item.get_text(),
  4. 'link':item.get('href')
  5. }
  6. print(result)
for item in data:
result={
'title':item.get_text(),
'link':item.get('href')
}
print(result)
The running result of the code is shown in the figure 20 Shown :
chart 20( Click here to see a large HD image )
First, make it clear that the data to be extracted is the title and link , Title in <a> In the label , Extract the text of the tag with get_text() Method . Link in <a> Labeled href Properties of the , Extract... From the tag href Properties with get() Method , Specify the attribute data to be extracted in parentheses , namely get('href').

From the picture 20 Can be found in , There is a number in the link to the article ID. Let's extract this with a regular expression ID. The regular symbols to be used are as follows :

\d Match the Numbers
+ Match the previous character 1 Times or times

stay Python When using regular expressions, use re library , This library does not need to be installed , Can be called directly . stay PyCharm Enter the following code in :
 Copy  Pure text  Copy 

  1. import re
  2. for item in data:
  3. result={
  4. "title":item.get_text(),
  5. "link":item.get('href'),
  6. 'ID':re.findall('\d+',item.get('href'))
  7. }
  8. print(result)
import re
for item in data:
result={
"title":item.get_text(),
"link":item.get('href'),
'ID':re.findall('\d+',item.get('href'))
}
print(result)
The operation result is as shown in the figure 21 Shown :

chart 21
Use here re Library findall Method , The first parameter represents the regular expression , The second parameter represents the text to be extracted .

Reptile attack and defense

A crawler is a simulation of human browsing behavior , Batch fetching of data . When the amount of data captured gradually increases , It will put a lot of pressure on the accessed server , It may even collapse . In other words , The server doesn't like people to grab their own data . that , The website will target these crawlers , Take some anti climbing strategies .

The first way the server can identify a crawler is by checking the connection useragent To identify browser access , Or code access . If it's code access , When the number of visits increases , The server will block the visitors directly IP.

So deal with this primary anti climbing mechanism , What should we do ?

Let's take the crawler created earlier as an example . When accessing , We can not only find in the developer environment URL、Form Data, You can also do it in Request headers Construct the request header of the browser in , Encapsulate yourself . The way the server recognizes browser access is to judge keyword Is it Request headers Under the User-Agent, Pictured 22 Shown .

chart 22
therefore , We only need to construct the parameters of the request header . Create request header information , The code is as follows :

headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'}
response = request.get(url,headers=headers)

Write here , Many readers will think that modifying User-Agent It's too simple . It's really simple , But normal people 1 Look at a picture in seconds , And a reptile 1 You can capture many pictures in seconds , such as 1 Capture hundreds of pictures in seconds , Then the pressure on the server is bound to increase . in other words , If in a IP Download pictures in batch , This behavior is not in line with normal human behavior , It must be sealed IP.

The principle is very simple , Is to count each IP Frequency of visits , The frequency exceeds the threshold , Will return a verification code , If it's really user access , The user will fill in , Then continue to visit , If it's code access , Will be sealed IP.

There are two solutions to this problem , The first is the commonly used additional delay , Every time 3 Grab once per second , The code is as follows :

import time
time.sleep(3)

however , We write crawlers to efficiently grab data in batches , Set up here 3 Grab once per second , It's too inefficient . Actually , There is a more important solution , That is to solve the problem essentially .

No matter how you access , The purpose of the server is to find out what is code access , Then block IP. terms of settlement : To avoid being sealed IP, Agents are often used in data collection . Of course ,requests There are also corresponding proxies attribute .

First , Build your own agent IP pool , Assign it to... In the form of a dictionary proxies, And then transmit it to requests, The code is as follows :
 Copy  Pure text  Copy 

  1. proxies={
  2. "http":"http://10.10.1.10:3128",
  3. "https":"http://10.10.1.10:1080",
  4. }
  5. response = requests.get(url, proxies=proxies)
proxies={
"http":"http://10.10.1.10:3128",
"https":"http://10.10.1.10:1080",
}
response = requests.get(url, proxies=proxies)

Extended reading

This article is only for Python The crawler and its implementation process are briefly introduced , It can only make beginners understand python Reptiles have a simple understanding , It doesn't give you complete control of Python Reptiles .

If you want to be right Python Reptiles have a deeper understanding , I recommend reading :
  • Python Introduction to reptiles  
  • Python3 Introduction to web crawler
  • Python Reptile tutorial —— For class network

  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved