In the process of crawling some web pages , Sometimes if we send too many requests per unit time , The website may block our IP Address , At this time, in order to ensure the normal operation of our reptiles , We will use agents IP.
Here's how to build your own IP pool .
We use fast agents to get agents ip Address : Domestic high-tech enterprises are free of charge HTTP agent IP - Come on, agent
adopt lxml Modular etree, We will soon be able to get the storage agent through the web source code ip Address and port and IP Type of label , It is not difficult to crawl them .
IP After climbing, next we need to verify what we climb IP Is it really effective , After all, as free IP, The effectiveness is still not high , We need further screening .
Here I share another website :
http://icanhazip.com/
By visiting this website , We can get our current IP Address , From this, we can compare the returned data obtained by visiting the website with the agent we use IP Compare and observe whether it is the same , We can judge the agent we crawled IP Is it effective .
Here is the complete code :
import requests
from lxml import etree
import time
headers = {"User-Agent": "mozilla/4.0 (compatible; MSIE 5.5; Windows NT)"}
def get_ips():
ls = []
ipps = []
for i in range(1, 3):
url = f"https://free.kuaidaili.com/free/inha/{i}/"
page = requests.get(url=url, headers=headers).text
tree = etree.HTML(page)
ips = tree.xpath('//table[@class="table table-bordered table-striped"]/tbody/tr')
for i in ips:
try:
ip = "http://" + i.xpath('./td[@data-title="IP"]/text()')[0] + ":" + \
i.xpath('./td[@data-title="PORT"]/text()')[0]
ipps.append(ip)
except:
continue
time.sleep(1)
ipps = list(set(ipps))
for ip in ipps:
dic = {}
dic["http"] = ip
ls.append(dic)
return ls
def check_ips(ls):
url = 'http://icanhazip.com/'
for i in ls[::-1]:
try:
r = requests.get(url=url, headers=headers, proxies=i,timeout=5)
r.raise_for_status()
if r.text[:13]==i['http'][7:20]:
continue
else:
ls.remove(i)
except:
ls.remove(i)
return ls
def ips():
a=get_ips()
b=check_ips(a)
return b
if __name__ == '__main__':
print(ips())
I only crawl the first two pages of the proxy website with the above code IP Content , If you need more , Changeable get_ips() The first function is for Number of cycles . But this method also has a drawback , I crawl page by page , Then put the crawling agent IP Judge one by one , The speed will slow down a lot . In order to crawl more efficiently , We can import thread pool , Thread pools are imported during crawling and validation , It can double our crawling speed .
import requests
from lxml import etree
from multiprocessing.dummy import Pool
headers = {"User-Agent": "mozilla/4.0 (compatible; MSIE 5.5; Windows NT)"}
ls=[]
ipps=[]
def get_ips(a):
url = f"https://free.kuaidaili.com/free/inha/{a}/"
page = requests.get(url=url, headers=headers).text
tree = etree.HTML(page)
ips = tree.xpath('//table[@class="table table-bordered table-striped"]/tbody/tr')
for i in ips:
try:
ip = "http://" + i.xpath('./td[@data-title="IP"]/text()')[0] + ":" + \
i.xpath('./td[@data-title="PORT"]/text()')[0]
ipps.append(ip)
except:
continue
pool_1=Pool(2)
pool_1.map(get_ips,list(range(1,3)))
pool_1.close()
pool_1.join()
for ip in list(set(ipps)):
dic={}
dic['http']=ip
ls.append(dic)
print(len(ls))
def check_ips(i):
url = 'http://icanhazip.com/'
try:
r = requests.get(url=url, headers=headers, proxies=i,timeout=5)
r.raise_for_status()
if r.text[:13]==i['http'][7:20]:
pass
else:
ls.remove(i)
except:
ls.remove(i)
pool_2=Pool(15)
pool_2.map(check_ips,ls)
pool_2.close()
pool_2.join()
if __name__ == '__main__':
print(ls)
Crawl the first two pages of the proxy IP, You can find , Basically every 5 A broker IP Only in 1 One is useful , But after all, it's free , We just added a step of verification , in general , It's still very good .