您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python crawler -- requests Library

編輯：Python

List of articles

The reptile library Requests
- 1. install
- 2. Send a request
- - GET request
  - POST request
  - Complex request mode
- 3. Get a response

The reptile library Requests

Requests yes Python A very practical HTTP Client library , Fully meet the needs of today's web crawlers . And Urllib contrast ,Requests Not only has Urllib All the functions of ; On development and use , The grammar is easy to understand , It's completely in line Python grace 、 Concise features ; On compatibility , Fully compatible with Python 2 and Python3, It has strong applicability , The operation is more humanized .

1. install

Requests Ku as python Third party library , Can pass pip install , As shown below ：

Windows：pip install requests.
Linux: sudo pip install requests.

Besides using pip Besides the installation , It can also be downloaded whl Files installed , But the steps are complicated , There is no more introduction here .

It is worth mentioning that Requests It's an open source library , The source code is located in GitHub:https://github.com/kennethreitz/requests, If you want to download the latest version , You can go straight to GitHub Upload and download Requests Source code , The download link is ：https://github.com/kennethreitz/requests/releases. Decompress the source code package , Then go to the unzipped folder , function setup.py File can .

2. Send a request

HTTP A common request for is GET and POST, Requests There are two different ways to request this .

import requests
url = 'https://baidu.com/'
# GET request 
r = requests.get(url, params=params, headers=headers, proxies=proxies, verify=True, cookies=cookies)
# POST request 
r = requests.post(url, data=data, files=files,headers=headers, proxies=proxies, verify=True, cookies=cookies)

GET request

GET There are two forms of request , They are without parameters and with parameters , for example ：

# With no arguments 
https://www.baidu.com/
# With parameters wd
https://www.baidu.com/s?wd=python

Judge URL With or without parameters , It can be done to == Symbol “?”== Judge . The end of the general web address （ domain name ） with “?”, It means that URL With request parameters , Otherwise, there is no parameter .GET The parameters are described as follows ：

wd Is the parameter name , The parameter name is determined by the web site （ The server ） Regulations .
python Is the parameter value , It can be set by the user .
If one URL There are multiple parameters , Use... Between parameters “&” Connect .

Requests Realization GET request , For... With parameters URL There are two ways to request ：

import requests
# The first way 
r = requests.get('https://www.baidu.com/s?wd=python')
# The second way 
url = 'https://www.baidu.com/s'
params = {
'wd':'python'}
r = requests.get(url, params=params)
# Output generated URL
print(r.url)

Both ways are possible , The second method is to pass in parameters and parameter values in the form of dictionaries , The effect is equal to , The first method is recommended in actual development , Because the code is concise , If the parameter changes dynamically , Then you can use string formatting to URL Dynamic setting , for example ：'https://www.baidu.com/s?wd=%s' %('python').

POST request

POST A request is what we often call a submission form , The data content of the form is POST Request parameters for .**Requests Realization POST Request parameters should be set for the request data, The data format can be Dictionaries 、 Tuples 、 List and JSON Format ,** Different data formats have different advantages .

# Dictionary type 
data = {
'key1':'value1', 'key2':'value2'}
# Tuples or lists 
(('key1', 'value1'), ('key2', 'value2'))
# JSON
import json
data = {
'key1':'value1', 'key2':'value2'}
# Convert dictionary to JSON
data = json.dumps(data)
# send out POST request 
import requests
r = requests.post("https://www.baidu.com/", data=data)
print(r.text)

Complex request mode

Complex requests are often made in Request header 、 agent IP、 Certificate verification and Cookies And so on .Requests This series of complex requests is simplified , Pass these functions in the form of parameters in the sending request and act on the request .

(1) Add request header ： The request header is generated in the form of a dictionary , Then send the... Set in the request headers Parameters , Point to the defined request header .

headers = {

'User-Agent':'......',
'...':'...',
......
}
requests.get("https://www.baidu.com/", headers=headers)

Add request header headers Is the solution requests One of the ways to request reverse crawling , It is equivalent to the server itself that we enter this web page , Pretend you're crawling data . It solves the problem of requesting web page crawling , Output text Sorry will appear in the message , Unable to access and so on .

(2) Using agents IP： agent IP The usage of is consistent with that of the request header , Set up proxies Parameters can be .

import requests
proxies = {

"http":"http://10.10.1.10:3128",
"https":"http://10.10.1.10:1080",
}
requests.get("https://www.baidu.com/", proxies=proxies)

Because we use python A crawler crawls through a web site , One fixed IP The frequency of visits will be very high , This does not meet the standard of human operation , Because people can't operate in a few ms Inside , Make such frequent visits . So some websites will set up a IP Threshold of access frequency , If one IP Access frequency exceeds this threshold , It means that this person is not visiting , It's a crawler . So our IP May be sealed , And their own IP It can also be found , In this case, we can use some high-level agents IP To meet our greater needs .

(3) Certificate validation ： Generally, you can set to turn off verification . Set parameters on request verify=False The certificate verification can be turned off , By default True. If you need to set the certificate file , Then you can set parameters verify The value is the certificate path .

General web pages have authentication certificates , So general requests Do not add this parameter when requesting , But it is still possible to visit a web page without an authentication certificate , You can then turn off certificate validation . However, closing the validation run program will be popular , But it still works , No impact

(4) timeout ： After sending the request , Because of the Internet 、 Server and other factors , There is a time difference between the request and the response . If you don't want the program to wait too long or extend the waiting time , You can set timeout Wait seconds for , Stop waiting for a response after this time . If the server is in timeout No response in seconds , An exception will be thrown .

requests.get('https://www.baidu.com', timeout = 5)
requests.post('https://www.baidu.com', timeout = 5)

(5) Set up Cookies： Use... In the request process Cookies Just set the parameters Cookies that will do .Cookies Is used to identify users , stay Requests Use a dictionary or RequestsCookieJar Object as parameter . The acquisition method is mainly generated by reading from the browser and running the program .

import requests
test_cookies = 'JSESSIONID=2C30FCABDBF3B92E358E3D4FCB69C95B; Max-Age=172800;'
cookies = {
}
# Division , Convert string to dictionary format 
for i in test_cookies.split(';'):
value = i.split('=')
cookies[value[0]] = value[1]
r = requests.get(url, cookies=cookies)
print(r.text)

When the program sends a request （ No parameters cookies）, It will automatically generate a RequestsCookieJar object , This object is used to store Cookies Information .

import requests
url = 'https://www.baidu.com/'
r = requests.get(url)
# r.cookies yes RequestsCookieJar object 
print(r.cookies)
thecookies = r.cookies
# RequestsCookieJar Convert to dictionary 
cookie_dict = requests.utils.dict_from_cookiejar(thecookies)
print(cookie_dict)
# Dictionary conversion RequestsCookieJar
cookie_jar = requests.cookiejar_from_dict(cookie_dict, cookiejar=None, overwrite=True)
print(cookie_jar)
# stay RequestsCookieJar Add to object Cookies Dictionaries 
print(requests.utils.add_dict_to_cookiejar(thecookies, cookie_dict))

3. Get a response

When to the website （ The server ） When sending a request , The website will return the corresponding response （response） object , Contains information about the server response .Requests Provide the following methods to get the response content .

r.status_code： Response status code .
r.content： Byte response body , It needs to be decoded .
r.text： String style response body , It will automatically decode according to the character encoding of the response header .
r.raw： Original responder , Use r.raw.read（） Read .
r.encoding： Get the encoding format .
r.headers： Store the server response header as a dictionary object , But this dictionary is special , Dictionary keys are not case sensitive , If the key doesn't exist , Then return to None.
r.json（）：Requests The built-in JSON decoder .
r.raise_for_status（）： request was aborted （ Not 200 Respond to ）, Throw an exception .
r.cookies： Get the... After the request cookies.
r.url： Get request link .
r.history: Will request parameters in allow_redirects Set to True, Allow redirection , Can pass r.history Field to view history information , That is, all request jump information before successful access .

Be careful ： When getting the response content, you can use r.text, But sometimes there are decoding errors , Get the condition of garbled code , This is because the obtained encoding format is incorrect , Can pass r.encoding To view the , It's fine too r.encoding = ‘…’ To specify the correct encoding format , The code of a general web page is utf-8( It could be gbk).

But this manual approach is a bit clumsy , Here's a simpler way ：chardet, This is a good string / File code detection module .

Install the module first ：pip install chardet

After installation , Use chardet.detect() Return dictionary , among confidence It's detection accuracy ,encoding It's the coding form .

import requests
r = requests.get('http://www.baidu.com')
print(chardet.detect(r.content))
# take chardet The detected code is assigned to r.encoding Realize decoding 
r.encoding = chardet.detect(r.content)['encoding']
print(r.text)