您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python crawler (II) lxml Use of etree (five minutes to climb the part-time network of college students)

編輯：Python

All libraries and installation methods used

Need to install requests Module and lxml modular
utilize pip install

pip install requests
pip install lxml

Three minutes to solve lxml

About requests I use the library in Last one It's written in the article, so I won't repeat it here
here lxml We use it etree library
Usage is as follows ：
First, we need to find the website we want to crawl https://www.dxsjz.com/part/ Notice that at this point we slide to the lowest end
Select next page , Then select the next page
https://www.dxsjz.com/part/page_2.html
https://www.dxsjz.com/part/page_3.html
Comparing the two links, you can find the two url Only the last difference makes it reasonable to guess that the first page is except https://www.dxsjz.com/part/ still https://www.dxsjz.com/part/page_1.html

import requests
for num in range(100):
r = requests.get("https://www.dxsjz.com/part/page_%d.html"%(num+1))

Here, the contents of multiple pages can be crawled through the loop
Start using on this basis lxml

import requests
from lxml import etree
for num in range(100):
r = requests.get("https://www.dxsjz.com/part/page_%d.html"%(num))
html = etree.HTML(r.text)

Build a etree object , Then it can be called xpath
We open the link and press F12 Enter developer mode

Click the mouse icon in the upper left corner of the developer mode, and then click the content to be crawled to see the content on the right html Corresponding positioning

Right click the location to select Copy Xpath You can copy xpath route

 html.xpath("/html/body/div[5]/div[2]/div[3]/div[1]/a"))

From this line you can extract a list , There is an object in the list , We take it out by index and pass it through .text Property to get the content

 html.xpath("/html/body/div[5]/div[2]/div[3]/div[1]/a"))[0].text

Then through other xpath We can find rules
To write the final program 、

import requests
from lxml import etree
for num in range(100):
r = requests.get("https://www.dxsjz.com/part/page_%d.html"%(num))
html = etree.HTML(r.text)
with open(" Part time job network for college students %d.txt"%(num),'w') as f:
for i in range(10):
try:
html.xpath("/html/body/div[5]/div[2]/div[%d]/div[1]/span"%(i+3))[0].text
except:
f.write(html.xpath("/html/body/div[5]/div[2]/div[%d]/div[1]/a"%(i+3))[0].text)
f.write('\n')
f.write(html.xpath("/html/body/div[5]/div[2]/div[%d]/div[2]/span[1]"%(i+3))[0].text)
f.write('\n')
f.write(html.xpath("/html/body/div[5]/div[2]/div[%d]/div[2]/span[2]"%(i+3))[0].text)
f.write('\n')
else:
f.write(html.xpath("/html/body/div[5]/div[2]/div[%d]/div[2]/a"%(i+3))[0].text)
f.write('\n')
f.write(html.xpath("/html/body/div[5]/div[2]/div[%d]/div[3]/span[1]"%(i+3))[0].text)
f.write('\n')
f.write(html.xpath("/html/body/div[5]/div[2]/div[%d]/div[3]/span[2]"%(i+3))[0].text)
f.write('\n')