您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python多進程根據標題批量爬取視頻，自動解析片段數量，自動解析.m3u8鏈接,並按標題分類存儲

編輯：Python

注：此博客無任何教程，只有代碼和部分注釋，博主自己看的！剛學習python三天，不喜勿噴

功能：

1.解析首頁資源

2.解析首頁對應標題下的資源頁面鏈接

3.自動解析每個資源鏈接的.ts數量

4.自動解析每個資源對應的.m3u8資源（用於分析片段數量）

5.按標題分類存儲

代碼僅提供爬蟲實現的思路，無法復用

code：

import multiprocessing
import os
import string
import requests
import re
from bs4 import BeautifulSoup
from multiprocessing import Pool
# 頭部信息
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"}
# m3u8文件鏈接的前綴（所有鏈接前綴一樣，所以提取出來）
m3u8path = "https://xxx.xxx.com/"
# 每個資源鏈接前綴(所有鏈接前綴一樣，所以提取出來)
repath = "https://xxx.xxx.de/"
# 請求首頁，並解析html拿到所有資源頁鏈接
response = requests.get("https://www.xxx.xx/xxx.html", headers=headers)
# 設置字符編碼
response.encoding = "utf-8"
# 獲取首頁html
syHtml = response.text
soup = BeautifulSoup(syHtml, "lxml")
# 解析首頁html(使用css選擇器定位元素)
resourceList = soup.select("a[class='video-pic loading']")
# 將首頁所有的資源鏈接保存在matrix數組中
matrix = []
#將所有元素的href標簽內容提取出來
for i in range(0, len(resourceList)):
matrix.insert(i, resourceList[i].get("href"))
for url in matrix:
response1 = requests.get(repath+url, headers=headers)
response1.encoding="utf-8"
#得到每個資源鏈接的html內容
rehtml=response1.text
#以下代碼解析資源頁面的m3u8鏈接
soup1 = BeautifulSoup(rehtml, "lxml")
reinfo = soup1.select_one("#vpath")
# 以下代碼得到每個視頻的標題
title = soup1.select_one(".player_title>h1").text.split()
#請求m3u8，計算得到ts文件數量
response2 = requests.get(m3u8path+reinfo.text.strip(), headers=headers)
response2.encoding = "utf-8"
count = response2.text.count("ts")
#以下代碼解析ts鏈接
flag = 0
num = 0
while 1 == 1:
try:
k = (m3u8path + reinfo.text.strip()).index("/", num)
num = k+1
except:
flag=num
break
#解析出url前綴
vedioUrlPre = (m3u8path + reinfo.text.strip())[0:int(flag)]
#爬取當前資源
for i in range(0,count):
vedioUrl = vedioUrlPre + "%04d.ts" % i
response3 = requests.get(vedioUrl, headers=headers)
dir = "D:\\爬蟲\\" + str(title)
if not os.path.exists(dir):
os.makedirs(dir)
file = open(dir+"\\{}".format(vedioUrl[-7:]), "wb")
print("開始寫入資源："+ url +" 的第" + str(i)+"個片段")
file.write(response.content)
print("寫入片段" + str(i) + "結束\n")
file.close()
print("所有視頻爬取完畢！！！")