Requests yes Python A very practical HTTP Client library , Fully meet the needs of today's web crawlers . And Urllib contrast ,Requests Not only has Urllib All the functions of ; On development and use , The grammar is easy to understand , It's completely in line Python grace 、 Concise features ; On compatibility , Fully compatible with Python 2 and Python3, It has strong applicability , The operation is more humanized .
Requests Ku as python Third party library , Can pass pip install , As shown below :
pip install requests
.sudo pip install requests
. Besides using pip Besides the installation , It can also be downloaded whl Files installed , But the steps are complicated , There is no more introduction here .
It is worth mentioning that Requests It's an open source library , The source code is located in GitHub:https://github.com/kennethreitz/requests, If you want to download the latest version , You can go straight to GitHub Upload and download Requests Source code , The download link is :https://github.com/kennethreitz/requests/releases. Decompress the source code package , Then go to the unzipped folder , function setup.py File can .
HTTP A common request for is GET and POST, Requests There are two different ways to request this .
import requests
url = 'https://baidu.com/'
# GET request
r = requests.get(url, params=params, headers=headers, proxies=proxies, verify=True, cookies=cookies)
# POST request
r = requests.post(url, data=data, files=files,headers=headers, proxies=proxies, verify=True, cookies=cookies)
GET There are two forms of request , They are without parameters and with parameters , for example :
# With no arguments
https://www.baidu.com/
# With parameters wd
https://www.baidu.com/s?wd=python
Judge URL With or without parameters , It can be done to == Symbol “?”== Judge . The end of the general web address ( domain name ) with “?”, It means that URL With request parameters , Otherwise, there is no parameter .GET The parameters are described as follows :
Requests Realization GET request , For... With parameters URL There are two ways to request :
import requests
# The first way
r = requests.get('https://www.baidu.com/s?wd=python')
# The second way
url = 'https://www.baidu.com/s'
params = {
'wd':'python'}
r = requests.get(url, params=params)
# Output generated URL
print(r.url)
Both ways are possible , The second method is to pass in parameters and parameter values in the form of dictionaries , The effect is equal to , The first method is recommended in actual development , Because the code is concise , If the parameter changes dynamically , Then you can use string formatting to URL Dynamic setting , for example :'https://www.baidu.com/s?wd=%s' %('python')
.
POST A request is what we often call a submission form , The data content of the form is POST Request parameters for .**Requests Realization POST Request parameters should be set for the request data, The data format can be Dictionaries 、 Tuples 、 List and JSON Format ,** Different data formats have different advantages .
# Dictionary type
data = {
'key1':'value1', 'key2':'value2'}
# Tuples or lists
(('key1', 'value1'), ('key2', 'value2'))
# JSON
import json
data = {
'key1':'value1', 'key2':'value2'}
# Convert dictionary to JSON
data = json.dumps(data)
# send out POST request
import requests
r = requests.post("https://www.baidu.com/", data=data)
print(r.text)
Complex requests are often made in Request header 、 agent IP、 Certificate verification and Cookies And so on .Requests This series of complex requests is simplified , Pass these functions in the form of parameters in the sending request and act on the request .
(1) Add request header : The request header is generated in the form of a dictionary , Then send the... Set in the request headers Parameters , Point to the defined request header .
headers = {
'User-Agent':'......',
'...':'...',
......
}
requests.get("https://www.baidu.com/", headers=headers)
Add request header headers Is the solution requests One of the ways to request reverse crawling , It is equivalent to the server itself that we enter this web page , Pretend you're crawling data . It solves the problem of requesting web page crawling , Output text Sorry will appear in the message , Unable to access and so on .
(2) Using agents IP: agent IP The usage of is consistent with that of the request header , Set up proxies Parameters can be .
import requests
proxies = {
"http":"http://10.10.1.10:3128",
"https":"http://10.10.1.10:1080",
}
requests.get("https://www.baidu.com/", proxies=proxies)
Because we use python A crawler crawls through a web site , One fixed IP The frequency of visits will be very high , This does not meet the standard of human operation , Because people can't operate in a few ms Inside , Make such frequent visits . So some websites will set up a IP Threshold of access frequency , If one IP Access frequency exceeds this threshold , It means that this person is not visiting , It's a crawler . So our IP May be sealed , And their own IP It can also be found , In this case, we can use some high-level agents IP To meet our greater needs .
(3) Certificate validation : Generally, you can set to turn off verification . Set parameters on request verify=False The certificate verification can be turned off , By default True. If you need to set the certificate file , Then you can set parameters verify The value is the certificate path .
General web pages have authentication certificates , So general requests Do not add this parameter when requesting , But it is still possible to visit a web page without an authentication certificate , You can then turn off certificate validation . However, closing the validation run program will be popular , But it still works , No impact
(4) timeout : After sending the request , Because of the Internet 、 Server and other factors , There is a time difference between the request and the response . If you don't want the program to wait too long or extend the waiting time , You can set timeout Wait seconds for , Stop waiting for a response after this time . If the server is in timeout No response in seconds , An exception will be thrown .
requests.get('https://www.baidu.com', timeout = 5)
requests.post('https://www.baidu.com', timeout = 5)
(5) Set up Cookies: Use... In the request process Cookies Just set the parameters Cookies that will do .Cookies Is used to identify users , stay Requests Use a dictionary or RequestsCookieJar Object as parameter . The acquisition method is mainly generated by reading from the browser and running the program .
import requests
test_cookies = 'JSESSIONID=2C30FCABDBF3B92E358E3D4FCB69C95B; Max-Age=172800;'
cookies = {
}
# Division , Convert string to dictionary format
for i in test_cookies.split(';'):
value = i.split('=')
cookies[value[0]] = value[1]
r = requests.get(url, cookies=cookies)
print(r.text)
When the program sends a request ( No parameters cookies), It will automatically generate a RequestsCookieJar object , This object is used to store Cookies Information .
import requests
url = 'https://www.baidu.com/'
r = requests.get(url)
# r.cookies yes RequestsCookieJar object
print(r.cookies)
thecookies = r.cookies
# RequestsCookieJar Convert to dictionary
cookie_dict = requests.utils.dict_from_cookiejar(thecookies)
print(cookie_dict)
# Dictionary conversion RequestsCookieJar
cookie_jar = requests.cookiejar_from_dict(cookie_dict, cookiejar=None, overwrite=True)
print(cookie_jar)
# stay RequestsCookieJar Add to object Cookies Dictionaries
print(requests.utils.add_dict_to_cookiejar(thecookies, cookie_dict))
When to the website ( The server ) When sending a request , The website will return the corresponding response (response) object , Contains information about the server response .Requests Provide the following methods to get the response content .
Be careful : When getting the response content, you can use r.text, But sometimes there are decoding errors , Get the condition of garbled code , This is because the obtained encoding format is incorrect , Can pass r.encoding To view the , It's fine too r.encoding = ‘…’ To specify the correct encoding format , The code of a general web page is utf-8( It could be gbk).
But this manual approach is a bit clumsy , Here's a simpler way :chardet, This is a good string / File code detection module .
Install the module first :pip install chardet
After installation , Use chardet.detect() Return dictionary , among confidence It's detection accuracy ,encoding It's the coding form .
import requests
r = requests.get('http://www.baidu.com')
print(chardet.detect(r.content))
# take chardet The detected code is assigned to r.encoding Realize decoding
r.encoding = chardet.detect(r.content)['encoding']
print(r.text)