您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Yyds dry inventory Python scratch proxy middleware is one of the contents that a crawler must master

編輯：Python

This blog explains scrapy Agent related knowledge points in .

Usage scenarios of agents

Programmers who write crawler code , Never get around is to use a proxy , During coding , You will encounter the following situations ：

The network is not good , Need agent ;
The target site cannot be accessed domestically , Need agent ;
The website blocked you IP, Need agent .

Use HttpProxyMiddleware middleware

The test site still uses http://httpbin.org/, By visiting http://httpbin.org/ip Can get the current request IP Address .
HttpProxyMiddleware Middleware is enabled by default , You can view its source code, focusing on process_request() Method .

The way to modify the proxy is very simple , Only need Requests When requesting creation , increase meta Parameters can be .

import scrapy
class PtSpider(scrapy.Spider):
name = 'pt'
allowed_domains = ['httpbin.org']
start_urls = ['http://httpbin.org/ip']
def start_requests(self):
yield scrapy.Request(url=self.start_urls[0], meta={'proxy': 'http://202.5.116.49:8080'})
def parse(self, response):
print(response.text)


1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.

Next, by getting https://www.kuaidaili.com/free/ The agent of the website IP, And test whether its agent is available .

import scrapy
class PtSpider(scrapy.Spider):
name = 'pt'
allowed_domains = ['httpbin.org', 'kuaidaili.com']
start_urls = ['https://www.kuaidaili.com/free/']
def parse(self, response):
IP = response.xpath('//td[@data-title="IP"]/text()').getall()
PORT = response.xpath('//td[@data-title="PORT"]/text()').getall()
url = 'http://httpbin.org/ip'
for ip, port in zip(IP, PORT):
proxy = f"http://{ip}:{port}"
meta = {
'proxy': proxy,
'dont_retry': True,
'download_timeout': 10,
}
yield scrapy.Request(url=url, callback=self.check_proxy, meta=meta, dont_filter=True)
def check_proxy(self, response):
print(response.text)


1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.

Next, the available agents IP Save to JSON In file .

import scrapy
class PtSpider(scrapy.Spider):
name = 'pt'
allowed_domains = ['httpbin.org', 'kuaidaili.com']
start_urls = ['https://www.kuaidaili.com/free/']
def parse(self, response):
IP = response.xpath('//td[@data-title="IP"]/text()').getall()
PORT = response.xpath('//td[@data-title="PORT"]/text()').getall()
url = 'http://httpbin.org/ip'
for ip, port in zip(IP, PORT):
proxy = f"http://{ip}:{port}"
meta = {
'proxy': proxy,
'dont_retry': True,
'download_timeout': 10,
'_proxy': proxy
}
yield scrapy.Request(url=url, callback=self.check_proxy, meta=meta, dont_filter=True)
def check_proxy(self, response):
proxy_ip = response.json()['origin']
if proxy_ip is not None:
yield {
'proxy': response.meta['_proxy']
}


1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.

At the same time to modify start_requests Method , obtain 10 Page proxy .

class PtSpider(scrapy.Spider):
name = 'pt'
allowed_domains = ['httpbin.org', 'kuaidaili.com']
url_format = 'https://www.kuaidaili.com/free/inha/{}/'
def start_requests(self):
for page in range(1, 11):
yield scrapy.Request(url=self.url_format.format(page))


1.
2.
3.
4.
5.
6.
7.
8.
9.

It is also easier to implement a custom proxy middleware , There are two ways , The first kind of inheritance HttpProxyMiddleware, Write the following code ：


from scrapy.downloadermiddlewares.httpproxy import HttpProxyMiddleware
from collections import defaultdict
import random
class RandomProxyMiddleware(HttpProxyMiddleware):
def __init__(self, auth_encoding='latin-1'):
self.auth_encoding = auth_encoding
self.proxies = defaultdict(list)
with open('./proxy.csv') as f:
proxy_list = f.readlines()
for proxy in proxy_list:
scheme = 'http'
url = proxy.strip()
self.proxies[scheme].append(self._get_proxy(url, scheme))
def _set_proxy(self, request, scheme):
creds, proxy = random.choice(self.proxies[scheme])
request.meta['proxy'] = proxy
if creds:
request.headers['Proxy-Authorization'] = b'Basic ' + creds


1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.

The code core rewrites __init__ Construction method , Rewrite the _set_proxy Method , Random proxy acquisition is implemented .
Synchronous modification settings.py Code in file .

DOWNLOADER_MIDDLEWARES = {
'proxy_text.middlewares.RandomProxyMiddleware': 543,
}


1.
2.
3.

Create a new proxy middleware class

class NRandomProxyMiddleware(object):
def __init__(self, settings):
# from settings Read the agent configuration in PROXIES
self.proxies = settings.getlist("PROXIES")
def process_request(self, request, spider):
request.meta["proxy"] = random.choice(self.proxies)
@classmethod
def from_crawler(cls, crawler):
if not crawler.settings.getbool("HTTPPROXY_ENABLED"):
raise NotConfigured
return cls(crawler.settings)


1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.

You can see this class from settings.py In the document PROXIES Reading configuration , Therefore, modify the corresponding configuration as follows ：

DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': None,
'proxy_text.middlewares.NRandomProxyMiddleware': 543,
}
# The code is the result of the previous code collection 
PROXIES = ['http://140.249.48.241:6969',
'http://47.96.16.149:80',
'http://140.249.48.241:6969',
'http://47.100.14.22:9006',
'http://47.100.14.22:9006']


1.
2.
3.
4.
5.
6.
7.
8.
9.
10.