您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python crawls the position data of dragnet

編輯：Python

Hello, everyone , I'm Ning Yi , Today we are going to talk about Python Reptiles , use Python To crawl through the dragnet data , Dragnet's anti - crawler technology is very powerful , Through ordinary header Requests always return frequently requested information

So we mainly use selenium This plug-in to crawl data , This plug-in is to simulate the operation of our real people , Automatically click on the page , Read page content , Efficiency is better than using header Go ahead request The request is much lower

Let's see how I operate , Just look at the code

List of articles

- - - - 1、 Introducing plug-ins
      - 2、 download chromedriver plug-in unit
      - （1） see chrome Browser version
        （2） download chromedriver plug-in unit
      - 3、 Define and request the URL we want to crawl
      - 4、 Crawl the data on this page , And loop through the next button on each page
      - 5、 To write getData Method , Get the position name 、 Salary and other important data
      - 6、 Complete code

1、 Introducing plug-ins

First, we will reference four plug-ins , Respectively used to simulate real people to grab web data 、 Parse web data 、 Time module 、 File module

from selenium import webdriver # Simulate a real person to operate a web page 
import pyquery as pq # Parse web pages 
import time # Time module 
import os # File module

2、 download chromedriver plug-in unit

（1） see chrome Browser version

Helping –> About Google Chrome(E) View browser version information in , My version is 81.0.4044.138

（2） download chromedriver plug-in unit

Open the url http://npm.taobao.org/mirrors/chromedriver/, Select the corresponding chrome Browser version plug-ins , Download and unzip the file , Put it in line with what we are editing now Python File in the same directory

And then introduce... Into the code chromedriver plug-in unit , This plug-in will help us open chrome Browser simulation window

path = os.getcwd()
driver = webdriver.Chrome(executable_path=(path+"/chromedriver"))
# Wait for the entire page to load before proceeding to the next step 
driver.implicitly_wait(10)

3、 Define and request the URL we want to crawl

This URL opens yes Python The page of the new student's position

# website 
lagou_http = "https://www.lagou.com/jobs/list_Python/p-city_2?px=default&gx=%E5%85%A8%E8%81%8C&gj=&isSchoolJob=1#filterBox"
# Define an empty list to save the found data 
data = []
driver.get(lagou_http)

4、 Crawl the data on this page , And loop through the next button on each page

Find the button on the next page and click the button automatically , And get the data of the current web page , use getData Method to parse web page data , And save the returned data to data In the list ,getData Methods we will write later

Print data when the next page button is not found , Out of the loop

while True:
# Find the next button 
next_html = driver.find_element_by_css_selector(".pager_next").get_attribute('class')
if next_html == 'pager_next ':
# Get the web page data 
items = pq.PyQuery(driver.page_source).find(".con_list_item")
# print(items)
data += getData(items)
time.sleep(2)
# Click the next button 
driver.find_element_by_xpath("//span[@action='next']").click()
else:
print(' End of data crawling ')
print(data)
break

Print the obtained web page data （ Only a part of it ）, Pass these data as parameters to getData Method into our final visual data

<div>
<div class="p_bot">
<div class="li_b_l">
<span class="money">10k-20k</span>
<!--<i></i>--> Experience fresh graduates / Undergraduate
</div>
</div>
</div>
<div class="company">
<div class="industry">
Enterprise service , Data services / C round / 150-500 people
</div>
</div>
<div class="list_item_bot">
<div class="li_b_l">
<span> Server side </span>
<span>Linux/Unix</span>
<span>Hadoop</span>
<span>Scala</span>
</div>
<div class="li_b_r">“ rapid growth , Artificial intelligence , big data , Good treatment ”</div>
</div>

5、 To write getData Method , Get the position name 、 Salary and other important data

def getData(items):
datalist=[]
for item in items.items():
temp = dict()
temp[' Job title ']= item.attr('data-positionname')
temp[' Salary ']= item.attr('data-salary')
temp[' Company name ']= item.attr('data-company')
temp[' Company description ']=pq.PyQuery(item).find(".industry").text()
temp[' Work experience ']=pq.PyQuery(item).find(".p_bot>.li_b_l").remove(".money").text()
datalist.append(temp)
return datalist

6、 Complete code

# coding=utf-8
from selenium import webdriver # Simulate a real person to operate a web page 
import pyquery as pq # Parse web pages 
import time # Time plug in 
import os # File module 
path = os.getcwd()
driver = webdriver.Chrome(executable_path=(path+"/chromedriver"))
# Wait for the entire page to load before proceeding to the next step 
driver.implicitly_wait(5)
print(" Start crawling data ")
# website 
lagou_http = "https://www.lagou.com/jobs/list_Ruby/p-city_0?px=default&gx=%E5%85%A8%E8%81%8C&gj=&isSchoolJob=1#filterBox"
# Define an empty list to save the found data 
data = []
driver.get(lagou_http)
def getData(items):
datalist=[]
for item in items.items():
temp = dict()
temp[' Job title ']= item.attr('data-positionname')
temp[' salary range ']= item.attr('data-salary')
temp[' Company name ']= item.attr('data-company')
temp[' Company description ']=pq.PyQuery(item).find(".industry").text()
temp[' Work experience ']=pq.PyQuery(item).find(".p_bot>.li_b_l").remove(".money").text()
datalist.append(temp)
return datalist
while True:
# Find the next button 
next_html = driver.find_element_by_css_selector(".pager_next").get_attribute('class')
if next_html == 'pager_next ':
# Get the web page data 
items = pq.PyQuery(driver.page_source).find(".con_list_item")
# print(items)
data += getData(items)
time.sleep(2)
# Click the next button 
driver.find_element_by_xpath("//span[@action='next']").click()
else:
print(' End of data crawling ')
print(data)
break
# Finally, save the obtained data to a.txt In file 
file = open(path+"/a.txt","w")
file.write(str(data))
print(' Write file successful ')