Hello, everyone , I'm Ning Yi , Today we are going to talk about Python Reptiles , use Python To crawl through the dragnet data , Dragnet's anti - crawler technology is very powerful , Through ordinary header Requests always return frequently requested information
So we mainly use selenium This plug-in to crawl data , This plug-in is to simulate the operation of our real people , Automatically click on the page , Read page content , Efficiency is better than using header Go ahead request The request is much lower
Let's see how I operate , Just look at the code
First, we will reference four plug-ins , Respectively used to simulate real people to grab web data 、 Parse web data 、 Time module 、 File module
from selenium import webdriver # Simulate a real person to operate a web page
import pyquery as pq # Parse web pages
import time # Time module
import os # File module
Helping –> About Google Chrome(E) View browser version information in , My version is 81.0.4044.138
Open the url http://npm.taobao.org/mirrors/chromedriver/, Select the corresponding chrome Browser version plug-ins , Download and unzip the file , Put it in line with what we are editing now Python File in the same directory
And then introduce... Into the code chromedriver plug-in unit , This plug-in will help us open chrome Browser simulation window
path = os.getcwd()
driver = webdriver.Chrome(executable_path=(path+"/chromedriver"))
# Wait for the entire page to load before proceeding to the next step
driver.implicitly_wait(10)
This URL opens yes Python The page of the new student's position
# website
lagou_http = "https://www.lagou.com/jobs/list_Python/p-city_2?px=default&gx=%E5%85%A8%E8%81%8C&gj=&isSchoolJob=1#filterBox"
# Define an empty list to save the found data
data = []
driver.get(lagou_http)
Find the button on the next page and click the button automatically , And get the data of the current web page , use getData Method to parse web page data , And save the returned data to data In the list ,getData Methods we will write later
Print data when the next page button is not found , Out of the loop
while True:
# Find the next button
next_html = driver.find_element_by_css_selector(".pager_next").get_attribute('class')
if next_html == 'pager_next ':
# Get the web page data
items = pq.PyQuery(driver.page_source).find(".con_list_item")
# print(items)
data += getData(items)
time.sleep(2)
# Click the next button
driver.find_element_by_xpath("//span[@action='next']").click()
else:
print(' End of data crawling ')
print(data)
break
Print the obtained web page data ( Only a part of it ), Pass these data as parameters to getData Method into our final visual data
<div>
<div class="p_bot">
<div class="li_b_l">
<span class="money">10k-20k</span>
<!--<i></i>--> Experience fresh graduates / Undergraduate
</div>
</div>
</div>
<div class="company">
<div class="industry">
Enterprise service , Data services / C round / 150-500 people
</div>
</div>
<div class="list_item_bot">
<div class="li_b_l">
<span> Server side </span>
<span>Linux/Unix</span>
<span>Hadoop</span>
<span>Scala</span>
</div>
<div class="li_b_r">“ rapid growth , Artificial intelligence , big data , Good treatment ”</div>
</div>
def getData(items):
datalist=[]
for item in items.items():
temp = dict()
temp[' Job title ']= item.attr('data-positionname')
temp[' Salary ']= item.attr('data-salary')
temp[' Company name ']= item.attr('data-company')
temp[' Company description ']=pq.PyQuery(item).find(".industry").text()
temp[' Work experience ']=pq.PyQuery(item).find(".p_bot>.li_b_l").remove(".money").text()
datalist.append(temp)
return datalist
# coding=utf-8
from selenium import webdriver # Simulate a real person to operate a web page
import pyquery as pq # Parse web pages
import time # Time plug in
import os # File module
path = os.getcwd()
driver = webdriver.Chrome(executable_path=(path+"/chromedriver"))
# Wait for the entire page to load before proceeding to the next step
driver.implicitly_wait(5)
print(" Start crawling data ")
# website
lagou_http = "https://www.lagou.com/jobs/list_Ruby/p-city_0?px=default&gx=%E5%85%A8%E8%81%8C&gj=&isSchoolJob=1#filterBox"
# Define an empty list to save the found data
data = []
driver.get(lagou_http)
def getData(items):
datalist=[]
for item in items.items():
temp = dict()
temp[' Job title ']= item.attr('data-positionname')
temp[' salary range ']= item.attr('data-salary')
temp[' Company name ']= item.attr('data-company')
temp[' Company description ']=pq.PyQuery(item).find(".industry").text()
temp[' Work experience ']=pq.PyQuery(item).find(".p_bot>.li_b_l").remove(".money").text()
datalist.append(temp)
return datalist
while True:
# Find the next button
next_html = driver.find_element_by_css_selector(".pager_next").get_attribute('class')
if next_html == 'pager_next ':
# Get the web page data
items = pq.PyQuery(driver.page_source).find(".con_list_item")
# print(items)
data += getData(items)
time.sleep(2)
# Click the next button
driver.find_element_by_xpath("//span[@action='next']").click()
else:
print(' End of data crawling ')
print(data)
break
# Finally, save the obtained data to a.txt In file
file = open(path+"/a.txt","w")
file.write(str(data))
print(' Write file successful ')