Python3 Web crawler development practice
.urllib The library contains the following four basic modules :
request
: The most basic HTTP
Request module , Simulate the sending of the request .
error
: Exception handling module .
parse
: Tool module . Yes URL
Provide split 、 analysis 、 Merge and other functions .
robotparser
: Mainly used to identify the website robots.txt
file , The crawler permissions are set in this file , That is, which crawlers the server allows can crawl which web pages .
It's recorded here request
Module some basic API
Use of functions .
urllib.request.urlopen()
Send web page request API standard :
urllib.request.urlopen(url, data=None, [timeout,]*, cafile=None, capath=None, cadefault=False, context=None)
.Parameter interpretation :
url
: Request URL
data
: The request is sent to the designated url The data of , When this parameter is given , The request mode changes toPOST
, If not given, it isGET
. You need to use... When adding this parameterbytes
Method converts the parameter to the content of byte stream encoding format , The following is an example .
timeout
: Set the timeout . If no response is received within the set time , Throw an exception .
cafile, capath
RespectivelyCA
Certificate and its path ,cadefault, context
No introduction .Examples of use :
import urllib.request response = urllib.request.urlopen('https://www.baidu.com') print(type(response)) # Print the data type of the obtained response object print(response.read().decode('utf-8')) # Print the obtained web page HTML Source code
Use
urlopen
After the function , The objects returned by the server are stored inresponse
in , Printresponse
The data type of the object , byhttp.client.HTTPResponse
.
- If you want to add data to the request , You can use
data
Parameters .Examples of use :
import urllib.request import urllib.parse dic = { 'name': 'Tom' } data = bytes(urllib.parse.urlencode(dic), encoding='utf-8') response = urllib.request.urlopen('https://www.httpbin.org/post', data=data)
adopt
data
The dictionary data passed by the parameter , You need to use it firsturllib.parse.urlencode()
Convert to string , And then throughbytes()
Method is transcoded to byte type .
timeout
: Specify the timeout period . In seconds .
response = urllib.request.urlopen('https://www.baidu.com', timeout=0.01)
stay
0.01
If no response is received from the server within seconds , And throw an exception .
urlopen
Too few parameters for , It also means that , There are too few request headers that we can set .Construct more complete requests : Use
urllib.request.Request
object , This object is the encapsulation of the request header , By usingRequest
object , We can separate the request headers , In order to set , Not like the previous method , Just passingURL
.
- Request Construction method of :
class urllib.request.Request(url, data=None, headers={ }, origin_req_host=None, unverifiable=False, method=None)
Examples of use :
from urllib import request, parse url = 'https://www.httpbin.org/post' headers = { 'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)', 'Host': 'www.httpbin.org' } dict = { 'name': 'Tom'} data = bytes(urllib.parse.urlencode(dict), encoding='utf-8') request = urllib.request.Request(url=url, data=data, headers=headers, method='POST') response = urllib.request.urlopen(request) print(response.read().decode('utf-8'))
A is constructed in advance
Request
object , It is then passed as a parameter to theurlopen()
Method .