您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python crawler ｜ web page data asynchronous loading (completed in combination with selenium)

編輯：Python

List of articles

Problem description
1. combination Selenium、Edge Parse the data of the search page of the website
2. combination lxml Parse web data
3. additional ： Not an asynchronously loaded web page , combination requests Request data directly

Problem description

Some websites have a lot of redirects , To jump to the real resource page . Then the crawler will report an error ：requests.exceptions.TooManyRedirects: Exceeded 30 redirects.
This situation , You can turn off redirection directly , Judge that the response status is 301 or 302 Then redirect manually .
Reference resources ：Python Requests:TooManyRedirects Problem solving .

After manual reset , I met again Load asynchronously The problem of .
The only pages you can crawl are “ Loading ”, No actual content .

The page with the problem is ： A website often used for crawling .
This is the only thing you get after crawling ：

View the web page , You can see that the data is run by the browser JS Scripts are loaded asynchronously , At this time, the browser must render , The content of the web page will be displayed .

I found two solutions , First, analyze the source code , Then try to render it by yourself ; The two is the use of requests-html, It provides render Rendering 、 Provides html Asynchronous parsing .

requests-html On the surface, it doesn't seem to poke , It seems that I have everything , But the problem is that it has nothing to do with requests It's the same author ,requests It is still updated today , and requests-html It hasn't been updated for two years , It looks like a sinkhole ？

that , Do not combine with the browser , Let's give it a simulated rendering ？

Silently picked up what I had not used for a long time Selenium.
Sure enough , To work around , Still can't escape writing simulation ？

My version ：

requests 2.27.1
selenium 4.2.0
Edge 102.0.1245.44

1. combination Selenium、Edge Parse the data of the search page of the website

First , use Selenium You need to download a browser emulation driver ：selenium Official website .

Choose your favorite browser and you're done . What I chose was Edge.
Be careful ： Be consistent with the browser version in your computer .

Use Selenium I met the following problems when I was young ：

RequestsDependencyWarning: urllib3 (1.26.9) or chardet (3.0.4) doesn‘t match a supported version
solve ： to update requests：pip3 install --upgrade requests
DeprecationWarning: executable_path has been deprecated, please pass in a Service object
solve ： appear DeprecationWarning Wrong type of warning ： Most warnings of this type belong to the version that has been updated , Outdated method used . Query the reconstructed function of the current version , It was before executable_path Refactored to Service In the function . therefore , New version of the selenium Can't continue to use executable_path, It should be written as Service. Solution reference ：https://blog.csdn.net/qq_43472877/article/details/121097672
Open the browser , But there is no web page . Found after debugging Service The connection fails . This is because the browser driver download error . Browser driven Download URL ：selenium Official website .
invalid argument: invalid locator
(Session info: MicrosoftEdge=102.0.1245.44)
solve ： There are three reasons ：
① The browser version is inconsistent with the driver version （ Please check your browser version in settings , Then I said above Browser driven Download URL Put down your own version there ）;
② No, sleep;（ New version of the selenium It is already possible not to sleep 了 , It won't go back , The program will exit automatically after running ）
③ Wrong code . such as edge.find_elements(by="//div"), The right is edge.find_elements(by='xpath',value="//div").

A super simple one that works selenium The procedure is as follows ：

from selenium import webdriver
from selenium.webdriver.edge.service import Service
import requests
from time import sleep
s = Service('./edgedriver_win64/msedgedriver.exe')
edge = webdriver.Edge(service=s)
edge.get("https://movie.douban.com/subject_search?search_text= A bouquet of love ")
# requests library , Construct a conversation 
session = requests.Session()
# obtain cookies
cookies = edge.get_cookies()
# fill 
for cookie in cookies:
session.cookies.set(cookie['name'], cookie['value'])
# Give the page a little time to load , Or later find No element 
sleep(2)
detail = edge.find_elements(by='xpath', value="//div[@class='detail']")
for i in range(0, len(detail)):
title = detail[i].find_element(by='xpath', value="./div[@class='title']")
rating_nums = detail[i].find_elements(by='xpath', value="./div/span[@class='rating_nums']")
title_a = detail[i].find_element(by='xpath', value="./div[@class='title']/a").get_property('href')
if (len(rating_nums) == 0):
print(i, title.text, ' No score ', title_a)
else:
print(i, title.text, rating_nums[0].text, title_a)
if (i >= 5): # Display at most 6 Just a result 
break

I put the commonly used xpath dependent 、selenium As far as possible, the related code is written in the above code .
This is for learning , It should be a good example .
such as get_property Is to get the attributes of the element .
such as xpath Medium “ Catalog ” level ,./div[@class='title']（ In fact, the directory structure is the same as that of the computer file system , Look more and you will find that this thing is so simple ）.
such as find_elements and find_element The difference between .

Be careful ： When there is no find_element When the element can be found , It just throws an exception “invalid Locator”, Instead of returning empty .
So you can only use when you are not sure find_elements.

Show us a practical effect ~

2. combination lxml Parse web data

selenium Of xpath The grammar is not well written lxml The concise , I used to use lxml Parse web pages .
selenium Get page → Web page text → use lxml Do elemental analysis ,（ be prone to ）;
Web page text →selenium Class ,（ It is difficult to ）;
So my advice is ： Just keep using lxml Library to do web page text parsing , This is more general , And the amount of code is much smaller .

Let's see selenium and lxml Comparison of （ It's on it selenium, Here is lxml）：

detail = edge.find_elements(by='xpath', value="//div[@class='detail']")
for i in range(0, len(detail)):
title = detail[i].find_element(by='xpath', value="./div[@class='title']")
rating_nums = detail[i].find_elements(by='xpath', value="./div/span[@class='rating_nums']")
title_a = detail[i].find_element(by='xpath', value="./div[@class='title']/a").get_property('href')
if (len(rating_nums) == 0):
print(i, title.text, ' No score ', title_a)
else:
print(i, title.text, rating_nums[0].text, title_a)
if (i >= 5): # Display at most 6 Just a result 
break

Here is lxml：

from lxml import etree
html=etree.HTML(edge.page_source) # you 're right ！ Just add a line ！ It can be smoothly converted into lxml
detail = html.xpath("//div[@class='detail']")
for i in range(0, len(detail)):
title = detail[i].xpath("./div[@class='title']/a/text()")
rating_nums = detail[i].xpath("./div/span[@class='rating_nums']/text()")
title_a = detail[i].xpath("./div[@class='title']/a/@href")
if (len(rating_nums) == 0):
print(i, title[0], ' No score ', title_a[0])
else:
print(i, title[0], rating_nums[0], title_a[0])
if (i >= 5): # Display at most 6 Just a result 
break

The running result is the same as that of the above program ：

3. additional ： Not an asynchronously loaded web page , combination requests Request data directly

The above program is read directly selenium The web content of , It's using find_elements And so on. . And every time you read it, you have to open the web page 、 Wait for loading , This is actually quite tedious .

For pages that are not loaded asynchronously , Just one url, You can use it requests Come and ask .

There are several ways to request ：requests.get、requests.post、session.get etc. .

The first two are often used , But I found out later session Of bug A little less ： When the web page is redirected repeatedly , It is said that session Session requests are not lost headers The content of .

session Just one more step . for example ：

import requests
headers = {
}
sessions = requests.session() # Just one more step 
sessions.headers = headers
r = sessions.get(url, headers = headers)

I wrote a general function to request data , The return is res and lxml Analytic html,
You can take it away if you need it ：

from lxml import etree
def getData(url):
while (1):
res = session.get(url, headers=headers)
if (res.status_code == requests.codes.ok):
content = res.text
html = etree.HTML(content)
return res, html
elif (res.status_code == 302 or res.status_code == 301):
print(url + " -- Redirect to --> " + res.headers["Location"])
url = res.headers["Location"]
else:
print(" Something's wrong ！", res.status_code)
cookieChange = int(input(" Whether to replace Cookie( Input 1 or 0)："))
if (cookieChange):
headers['Cookie'] = input(" Please enter Cookie：")
sleep(5)