您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

What is a crawler? What is the principle of Python crawler

編輯：Python

Preface

In short, the Internet is a large network composed of sites and network devices , We visit the site through a browser , The site HTML、JS、CSS Code back to browser , The code is parsed by the browser 、 Rendering , Present us with colorful web pages ;

One 、 What is a reptile ？

If we compare the Internet to a big spider web , Data is stored in each node of the spider web , And a reptile is a little spider ,

Grab your prey along the Internet （ data ） A reptile means ： Make a request to the website , A program that analyzes and extracts useful data after obtaining resources ;

Technically speaking, it's Through the program simulation browser request site behavior , Return the site to HTML Code /JSON data / binary data （ picture 、 video ） Climb to the local , Then extract the data you need , Store and use ;

Two 、 Basic process of reptile ：

How users access network data ：

The way 1： Browser submit request ---> Download Web code ---> Parse to page

The way 2： Impersonate a browser to send a request ( Get web code )-> Extract useful data -> Stored in a database or file

What a reptile has to do is the way 2;

1、 Initiate request

Use http The library makes a request to the target site , Send a Request

Request contain ： Request header 、 Requester, etc

Request Module defect ： Cannot perform JS and CSS Code

2、 Get response content

If the server can respond normally , You get a Response

Response contain ：html,json, picture , Video etc.

3、 Parsing content

analysis html data ： Regular expressions （RE modular ）, The third-party parsing library is as follows Beautifulsoup,pyquery etc.

analysis json data ：json modular

Parse binary data : With wb Write to a file

4、 Save the data

database （MySQL,Mongdb、Redis）

file

3、 ... and 、http agreement Request and response

Request： Users will be their own information through the browser （socket client） Send to the server （socket server）

Response： Server receives request , Analyze the request information sent by users , Then return the data （ The returned data may contain other links , Such as ： picture ,js,css etc. ）

ps： The browser is receiving Response after , The content will be parsed to show the user , And the crawler sends the request in the simulation browser and then receives Response after , To extract useful data .

Four 、 request

1、 Request mode ：

Common request methods ：GET / POST

2、 Requested URL

url Global resource locator , Used to define a unique resource on the Internet for example ： A picture 、 A file 、 A video can be used url Sole determination

url code

https://www.baidu.com/s?wd= picture

The picture will be encoded （ Look at the sample code ）

The loading process of the web page is ：

Load a web page , Usually load first document file ,

In parsing document When documenting , Encounter Links , Then the request to download the picture is initiated for the hyperlink

3、 Request header

User-agent： If there is no user-agent Client configuration , The server may regard you as an illegal user host;

cookies：cookie Used to save login information

Be careful ： Generally, a reptile will add a request header

Parameters that the request header needs to be aware of ：

（1）Referrer： Where is the source of the visit （ Some large websites , Will pass Referrer Do anti-theft chain strategy ; All reptiles should also pay attention to simulation ）

（2）User-Agent: Visit the browser （ Add it or it will be treated as a crawler ）

（3）cookie： Ask the head to take care of

4、 Request body

 Request body
If it is get The way , The request body has no content （get The request body of the request is placed in url In the following parameters , Can see directly ）
If it is post The way , The body of the request is format data
ps：
1、 Login window , File upload, etc , Information is attached to the request body
2、 Sign in , Enter the wrong username and password , And then submit , You can see that post, When you log in correctly, the page will usually jump , Can't capture post

5、 ... and 、 Respond to Response

1、 Response status code

200： On behalf of success

301： For jump

404： file does not exist

403： No access to

502： Server error

2、respone header

Parameters to be noted in response header ：

（1）Set-Cookie:BDSVRTM=0; path=/： There may be more than one , To tell the browser , hold cookie preserved

（2）Content-Location： The server response header contains Location After returning to the browser , The browser will revisit another page

3、preview It's web source code

JSO data

Such as web page html, picture

Binary data, etc

6、 ... and 、 summary

1、 Summarize the reptile process ：

Crawling ---> analysis ---> Storage

2、 Tools for reptiles ：

Request Library ：requests,selenium（ It can drive the browser to parse and render CSS and JS, But there are performance disadvantages （ Useful and useless pages will be loaded ）;）

Parsing library ： Regular ,beautifulsoup,pyquery

The repository ： file ,MySQL,Mongodb,Redis

3、 Climb the school flower net

Finally, I'll give you some welfare

Basic Edition ：

import re
import requests
respose\=requests.get('http://www.xiaohuar.com/v/')
# print(respose.status\_code)# The status code of the response
# print(respose.content) # Return byte information
# print(respose.text) # Return text content
urls=re.findall(r'class="items".\*?href="(.\*?)"',respose.text,re.S) #re.S Convert text information into 1 Line matching
url=urls\[5\]
result\=requests.get(url)
mp4\_url\=re.findall(r'id="media".\*?src="(.\*?)"',result.text,re.S)\[0\]
video\=requests.get(mp4\_url)
with open('D:\\\\a.mp4','wb') as f:
f.write(video.content)

View Code

Function encapsulation

import re
import requests
import hashlib
import time
# respose=requests.get('http://www.xiaohuar.com/v/')
# # print(respose.status\_code)# The status code of the response
# # print(respose.content) # Return byte information
# # print(respose.text) # Return text content
# urls=re.findall(r'class="items".\*?href="(.\*?)"',respose.text,re.S) #re.S Convert text information into 1 Line matching
# url=urls\[5\]
# result=requests.get(url)
# mp4\_url=re.findall(r'id="media".\*?src="(.\*?)"',result.text,re.S)\[0\]
#
# video=requests.get(mp4\_url)
#
# with open('D:\\\\a.mp4','wb') as f:
# f.write(video.content)
#
def get\_index(url):
respose \= requests.get(url)
if respose.status\_code==200:
return respose.text
def parse\_index(res):
urls \= re.findall(r'class="items".\*?href="(.\*?)"', res,re.S) # re.S Convert text information into 1 Line matching
return urls
def get\_detail(urls):
for url in urls:
if not url.startswith('http'):
url\='http://www.xiaohuar.com%s' %url
result \= requests.get(url)
if result.status\_code==200 :
mp4\_url\_list \= re.findall(r'id="media".\*?src="(.\*?)"', result.text, re.S)
if mp4\_url\_list:
mp4\_url\=mp4\_url\_list\[0\]
print(mp4\_url)
# save(mp4\_url)
def save(url):
video \= requests.get(url)
if video.status\_code==200:
m\=hashlib.md5()
m.updata(url.encode('utf-8'))
m.updata(str(time.time()).encode('utf-8'))
filename\=r'%s.mp4'% m.hexdigest()
filepath\=r'D:\\\\%s'%filename
with open(filepath, 'wb') as f:
f.write(video.content)
def main():
for i in range(5):
res1 \= get\_index('http://www.xiaohuar.com/list-3-%s.html'% i )
res2 \= parse\_index(res1)
get\_detail(res2)
if \_\_name\_\_ == '\_\_main\_\_':
main()

View Code

Concurrent Version （ If there is a total need to climb 30 A video , open 30 A thread to do , The time it takes is The slowest part of it takes time ）

import re
import requests
import hashlib
import time
from concurrent.futures import ThreadPoolExecutor
p\=ThreadPoolExecutor(30) # establish 1 In the process pool , The number of threads to be accommodated is 30 individual ;
def get\_index(url):
respose \= requests.get(url)
if respose.status\_code==200:
return respose.text
def parse\_index(res):
res\=res.result() # After the process is executed , obtain 1 Objects
urls = re.findall(r'class="items".\*?href="(.\*?)"', res,re.S) # re.S Convert text information into 1 Line matching
for url in urls:
p.submit(get\_detail(url)) # For details page Commit to thread pool
def get\_detail(url): # Only download 1 A video
if not url.startswith('http'):
url\='http://www.xiaohuar.com%s' %url
result \= requests.get(url)
if result.status\_code==200 :
mp4\_url\_list \= re.findall(r'id="media".\*?src="(.\*?)"', result.text, re.S)
if mp4\_url\_list:
mp4\_url\=mp4\_url\_list\[0\]
print(mp4\_url)
# save(mp4\_url)
def save(url):
video \= requests.get(url)
if video.status\_code==200:
m\=hashlib.md5()
m.updata(url.encode('utf-8'))
m.updata(str(time.time()).encode('utf-8'))
filename\=r'%s.mp4'% m.hexdigest()
filepath\=r'D:\\\\%s'%filename
with open(filepath, 'wb') as f:
f.write(video.content)
def main():
for i in range(5):
p.submit(get\_index,'http://www.xiaohuar.com/list-3-%s.html'% i ).add\_done\_callback(parse\_index)
#1、 Let's start with the task of climbing the home page （get\_index） Asynchronously commit to the thread pool
#2、get\_index After the mission is completed , Through the callback letter add\_done\_callback（） Number notifies the main thread , Task to complete ;
#2、 hold get\_index Execution results （ Note that the thread execution result is an object , call res=res.result() Method , To get the real execution results ）, As a parameter to parse\_index
#3、parse\_index After the task is completed ,
#4、 Through the loop , Put the get details page again get\_detail（） The task is submitted to the thread pool for execution
if \_\_name\_\_ == '\_\_main\_\_':
main()

View Code

Related to knowledge ： Multithreading, multiprocessing

Compute intensive tasks ： Using multiple processes , Because it can Python Yes GIL, Multiple processes can take advantage of CPU Multi core advantage ;

IO Intensive task ： Using multithreading , do IO Switching saves task execution time （ Concurrent ）

Thread pool