您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

[pit] Jinshan document batch download skill record (python+js)

編輯：Python

List of articles

preface
requests Request problems
selenium The code analysis
JS Download code parsing
summary

preface

As of this publication , In the web version of Jinshan document , If you need to download at the same time 2 More than files , You must open a member . It's easy to think of writing crawlers to download files one by one to achieve the purpose of batch downloading .

You can find that this is not complicated by capturing packets , Select any file and click download , monitor XHR Packet under , The following key packets can be captured ：

chart 1 Single file loading packet response results

In the response result on the right of the above figure url The field is the file download address , Note that the download address does not need to be logged in , You can access and download in any state （ But there is a validity period ）.

Then this thing is very simple , Let's take a look at the request field of this packet ：

chart 2 Single file load packet request field

As marked in the figure , Above request URL There are only two variable fields in ：groupid And fileid,groupid It's easy to get directly from the current page URL You can read .fileid There are many ways to read , In fact, one of the source code of the current page <script> All files are stored under the tag fileid（ Stored in variables window.__API_CACHED__ in , Call directly in the console window.__API_CACHED__ It's fine too ）：

chart 3 Page source code window.API_CACHED

This window.__API_CACHED__ For a long , It's time to <script> Take out the complete contents under the label ：

chart 4 window.__API_CACHED__ Complete data structure
As shown in the figure above , Information of all documents （ Include fileid） Are saved in the field value where the red box is located （ Has been put away ）.

You may feel like reading fileid too troublesome , Therefore, you can also call the one given in the red box API Interface （ You can also grab bags ）：

https://drive.kdocs.cn/api/v5/groups/{
group_id}/files?include=acl,pic_thumbnail&with_link=true&offset=0&count={
count}

among {group_id} As mentioned above ,{count} That is, the number of files to be obtained , Generally, you can take as many files as there are in this folder , Of course, the login status is required to access this interface , The data returned by the interface is shown in the following figure （ File name blocked , The three red boxes show groupid, file name ,fileid）：

chart 5 All file information returned by the interface

Of all documents groupid and fileid All available , It seems that the problem has been solved , But since it can be written as a blog , It's not that simple .

requests Request problems

As the preface says , Key requests such as chart 2 Shown ：

chart 2 Single file load packet request field （copy）

Based on past experience , Although the request of this packet must be in login status , But in theory, just attach all the request headers in the lower right box , You should get something like chart 1 Response result of （ That is, the download address of the file ）.

Although I resend this request in the browser, I can still get chart 1 Result , It proves that the request is not burn after reading . But if you simply use requests The library makes a request , You can't get the same response ：

def cookie_to_string(cookies: list) -> str:
string = ''
for cookie in cookies:
string += '{}={}; '.format(cookie['name'], cookie['value'])
return string.strip()
# Turn the request header in string form into dictionary form headers
def headers_to_dict(headers: str) -> dict:
lines = headers.splitlines()
headers_dict = {
}
for line in lines:
key, value = line.strip().split(':', 1)
headers_dict[key.strip()] = value.strip()
return headers_dict
url = f'https://drive.kdocs.cn/api/v5/groups/{
group_id}/files/{
file_id}/download?isblocks=false&support_checksums=md5,sha1,sha224,sha256,sha384,sha512'
cookies = driver.get_cookies()
headers_string = f"""Host: drive.kdocs.cn User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8 Accept-Language: zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2 Accept-Encoding: gzip, deflate, br Connection: keep-alive Cookie: {
cookie_to_string(cookies=cookies)} Upgrade-Insecure-Requests: 1 Sec-Fetch-Dest: document Sec-Fetch-Mode: navigate Sec-Fetch-Site: none Sec-Fetch-User: ?1"""
r = requests.get(url, headers=headers_to_dict(headers=headers_string)) # Unable to get response results

Then this thing seems very strange , I tried for a long time , Also switch to requests.Session test , Still not ok . This shows that Kingsoft document anti crawl is indeed in place , According to the results of successful crawler implementation ,Cookie It does contain all login information , Then I guess Jinshan documents should have made some middleware restrictions on the access process , Or restrict cross domain requests , This is really very unfriendly to reptiles .

But it always needs to be solved , Then we have to turn to omnipotent selenium 了 .

selenium The code analysis

This part will be parsed in combination with the code , Because there are so many pits , But it is also an improvement of reptile skills ：

# -*- coding: utf-8 -*-
# @author: caoyang
# @email: [email protected]
import re
import json
import time
import requests
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
def get_download_urls(group_id=1841861380, count=50):
# firefox_profile = webdriver.FirefoxProfile(r'C:\Users\caoyang\AppData\Roaming\Mozilla\Firefox\Profiles\sfwjk6ps.default-release')
# driver = webdriver.Firefox(firefox_profile=firefox_profile)
driver = webdriver.Firefox()
driver.get('https://account.wps.cn/') # The login page 
WebDriverWait(driver, 30).until(lambda driver: driver.find_element_by_xpath('//*[contains(text(), "VIU")]').is_displayed())
driver.get('https://www.kdocs.cn/latest')
WebDriverWait(driver, 30).until(lambda driver: driver.find_element_by_xpath('//span[contains(text(), " share ")]').is_displayed())
def cookie_to_string(cookies: list) -> str:
string = ''
for cookie in cookies:
string += '{}={}; '.format(cookie['name'], cookie['value'])
return string.strip()
def headers_to_dict(headers: str) -> dict:
lines = headers.splitlines()
headers_dict = {
}
for line in lines:
key, value = line.strip().split(':', 1)
headers_dict[key.strip()] = value.strip()
return headers_dict
# driver.get(f'https://drive.kdocs.cn/api/v5/groups/{group_id}/files?include=acl,pic_thumbnail&with_link=true&offset=0&count={count}')
# time.sleep(3)
# html = driver.page_source
# windows = driver.window_handles
# print(html)
# print(len(windows))
# print(driver.current_url)
# https://drive.kdocs.cn/api/v5/groups/1841861380/files?include=acl,pic_thumbnail&with_link=true&offset=0&count=50
cookies = driver.get_cookies()
headers_string = f"""Host: drive.kdocs.cn User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8 Accept-Language: zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2 Accept-Encoding: gzip, deflate, br Connection: keep-alive Cookie: {
cookie_to_string(cookies=cookies)} Upgrade-Insecure-Requests: 1 Sec-Fetch-Dest: document Sec-Fetch-Mode: navigate Sec-Fetch-Site: none Sec-Fetch-User: ?1"""
r = requests.get(f'https://drive.kdocs.cn/api/v5/groups/{
group_id}/files?include=acl,pic_thumbnail&with_link=true&offset=0&count={
count}', headers=headers_to_dict(headers=headers_string))
html = r.text
json_response = json.loads(html)
files = json_response['files']
print(f' total {
len(files)} File ')
download_urls = []
filenames = []
for file_ in files:
group_id = file_['groupid']
file_id = file_['id']
filename = file_['fname']
print(filename, group_id, file_id)
url = f'https://drive.kdocs.cn/api/v5/groups/{
group_id}/files/{
file_id}/download?isblocks=false&support_checksums=md5,sha1,sha224,sha256,sha384,sha512'
# driver.get(url)
# time.sleep(3)
# html = driver.page_source
cookies = driver.get_cookies()
headers_string = f"""Host: drive.kdocs.cn User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:101.0) Gecko/20100101 Firefox/101.0 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8 Accept-Language: zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2 Accept-Encoding: gzip, deflate, br Connection: keep-alive Cookie: {
cookie_to_string(cookies=cookies)} Upgrade-Insecure-Requests: 1 Sec-Fetch-Dest: document Sec-Fetch-Mode: navigate Sec-Fetch-Site: none Sec-Fetch-User: ?1"""
r = requests.get(url, headers=headers_to_dict(headers=headers_string))
html = r.text
# print(html)
json_response = json.loads(html)
download_url = json_response['url']
print(download_url)
download_urls.append(download_url)
filenames.append(filename)
with open('d:/download_urls.txt', 'w') as f:
for download_url, filename in zip(download_urls, filenames):
f.write(filename + '\t' + download_url + '\n')
driver.quit()
get_download_urls()

It is recommended to copy the above code first , The following description will be expanded by the number of lines of code （ Do not delete commented out lines , Those are the pits ）.

First of all to see 14-18 That's ok ：
At first, I thought about whether to import user data only （ The application of user data in crawlers can be viewed My blog ） You can skip Kingsoft document login . It should be noted that I did not test Chrome Browser situation , however Firefox It really doesn't work , Even if I first open a window to log in to Jinshan documents （ At this point, no matter how many windows I open to access Jinshan documents , Are in login status ）, Then start importing user data selenium, Still stuck on the login page . Therefore, we can only comment out 14-15 That's ok .
then 17 Line to access the login page ,18 Line is waiting for login to succeed （ During this period, you can click Wechat login , Then scan the code to confirm ）.
Actually used selenium We should all know , If you time.sleep Too long ,selenium It will collapse , If you manually operate on the page （ For example, click on , slide , Input text and so on ）,selenium It's going to break down . I always thought selenium After startup, the browser cannot be operated at all , Now I find that I only need to write WebDriverWait（ among xpath The user name is searched ,VIU Is my user name ）, Then you can click scan code to login , This is very convenient .
Next is 20-21 That's ok ：
This special pit , If you log in , Direct access 37 Line interface （ namely chart 5）, It does not show chart 5 Result , I'll just tell you The user is not logged in , Therefore, you can only visit the homepage of Jinshan document first . In fact, I think it is possible to restrict the process to reverse crawl .
23-35 Line is two tool functions ,cookie_to_string Yes, it will driver.get_cookies() Back to Cookie Format （ Form like [{'name': name, 'value': 'value'}, ...]） To string Cookie Add to request header ,headers_to_dict Is to rewrite the string request header copied from the browser into a dictionary （ be used for requests.get Of headers Parameters ）
37-61 That's ok ：
Here comes the pit , At this time, I use 37 Row access chart 5 The interface of can be seen chart 5 The data of , however 39 Yes driver.page_source Return is indeed 20 Go to the homepage of Jinshan document HTML, this TM It hurts . You can see 40 That's ok -43 OK, I did some tests , I have proved that there is really only one window at present （len(windows) by 1）, also driver.current_url Display the current page URL Indeed, it is not the homepage of Jinshan document .
It's been a problem for a long time , I checked drivers The function of does not get the page JSON Method of data . I haven't noticed this response before. The result is JSON Your page can't pass driver.page_source Get page data , So I was forced to use requests Library to rewrite this logic （47-61 That's ok ）.
Someone may have said , According to the preface , This request response is not and cannot be used requests Get it ？ Such is the case , If it is only used alone 47-61 Line to access chart 5 Result , Indeed, what is still returning is The user is not logged in , But here I use drivers.get_cookies() Back to Cookie Information to replace... In the request header copied directly from the browser Cookie Information , It turned out to be miraculously , To be honest, I don't particularly understand the principle here , I don't know how the back-end code of Jinshan document determines whether it is a request made by a crawler .
63-66 That's ok ： Get chart 5 File information data .
68-99 That's ok ：
What happened here is similar to 37-61 The line is exactly the same , We want to get chart 1 Response result of （ Naturally also JSON Formatted data ）, If you still use driver To access the interface , Got driver.page_source It is still on the homepage of Jinshan document HTML, So the same method is used here （requests rewrite ）, You can get 96 Line file download address .
similarly , According to the situation in the preface , If you use it directly requests visit chart 2 It's not feasible , But here it is selenium After completing the login operation, it is indeed feasible .

All file download addresses are stored in d:/download_urls.txt in , Because this download address can be used even if it is not logged in , The finishing work is very simple .

with open('d:/download_urls.txt', 'r') as f:
lines = f.read().splitlines()
for line in lines:
filename, url = line.split('\t')
r = requests.get(url)
with open(f'd:/{
filename}', 'wb') as f:
f.write(r.content)

JS Download code parsing

Here is an equivalent JS Code , Theoretically, you can download it by running it directly in the console , But the problem is that cross domain request errors will occur , So it seems that it's still not very good , I wonder if any friends can solve this problem .

let groups = "1842648021";
let count = 54;
let res = await fetch(`https://drive.kdocs.cn/api/v5/groups/${
groups}/files?include=acl,pic_thumbnail&with_link=true&offset=0&count=${
count}&orderby=fname&order=ASC&filter=folder`);
let files = await res.json();
files = files.files;
let urls = []
let fid, info, url;
for (let f of files) {

fid = f.id;
res = await fetch(`https://drive.kdocs.cn/api/v5/groups/${
groups}/files/${
fid}/download?isblocks=false&support_checksums=md5,sha1,sha224,sha256,sha384,sha512`, {

"method": "GET",
"mode": "cors",
"credentials": "include"
});
info = await res.json();
url = info.url;
urls.push(url);
}
console.log(" Number of files to be downloaded ：", urls.length);
for (let i = 0; i < urls.length; i++) {

let url = urls[i];
let fname = files[i].fname
fetch(url).then(res => res.blob().then(blob => {

let a = document.createElement('a');
let url = window.URL.createObjectURL(blob);
let filename = fname;
a.href = url;
a.download = filename;
a.click();
window.URL.revokeObjectURL(url);
}))
}

summary

At present, the urgent problem to be solved is selenium visit JSON How should the page read data , I think of a very bullshit method is to use from selenium.webdriver.common.keys import Keys In the direct Ctrl+A,Ctrl+C Copy the page data to get the string , Although this is stupid , But it seems feasible .

Another question is whether there is any way to use requests Without relying on selenium Complete the batch download of Jinshan documents , And why there are problems in the preface , This is really disturbing .

All in all , A mixture of requests and selenium It's really not very beautiful , I think someone should be able to come up with a more beautiful solution .