您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python爬蟲——爬取古詩詞

編輯：Python

文章目錄

前言
一、基本目標
二、使用步驟
- 1.進行分析
- 2.整體代碼
結果
總結

前言

當你喜歡哪個詩人，想獲取他的全部詩詞數據的時候，可以通過爬蟲來解決這個問題，用爬蟲把詩詞全部爬下來，然後存到txt文檔中，打印出來背誦，豈不美哉。

提示：以下是本篇文章正文內容，下面案例可供參考

一、基本目標

我們要爬取張若虛這個詩人的全部詩詞和他的個人簡介

二、使用步驟

1.進行分析

先在該頁面中獲取詩人信息，但是該頁面難以獲取全部詩詞內容，那麼在該頁面中先獲取到詩詞詳細的url，根據詩詞詳情頁的url再繼續深一層爬取詳情頁信息，進而獲取詩詞內容

2.整體代碼

代碼如下（示例）：

import requests
from lxml import etree
import re
import time
# 設置要爬取的url
base_url = "https://www.shicimingju.com/chaxun/zuozhe/04.html"
# 反反爬
headers = {

"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "zh-CN,zh;q=0.9",
"Referer":"https://www.shicimingju.com"
}
# requests爬取源碼
resp = requests.get(url=base_url,headers=headers)
# XPATH解析
html = etree.HTML(resp.text)
# xpath定位，拿到作者名字
author_name = html.xpath('//*[@id="main_right"]/div[1]/div[2]/div[1]/h4/a/text()')[0]
# 解析數據
# 設置re正則表達式獲取作者簡介的頁面元素
obj_introduction = re.compile(r'<div class="des">(?P<introduction>.*?)</div>', re.S)
# 開始匹配正則
result_introduction = obj_introduction.finditer(resp.text)
# 設置作者簡介
author_introduction = ""
# 對作者簡介頁面元素進行正則剔除多余的html標簽，並把作者簡介進行賦值獲取文字信息
for it in result_introduction:
author_introduction = it.group("introduction")
pattern = re.compile(r'<[^>]+>', re.S)
author_introduction = pattern.sub('', author_introduction).strip()
# xpath定位，拿到每篇的url鏈接，為了進行下一層訪問
poet_list = html.xpath('//*[@id="main_left"]/div[1]/div')
poet_list = poet_list[1::2]
for poet in poet_list:
url = poet.xpath('./div[2]/h3/a/@href')[0]
url = "https://www.shicimingju.com" + url
# 爬取具體的詩詞信息
resp_poet = requests.get(url=url)
resp_poet.encoding = 'utf-8'
# XPATH解析
html_child = etree.HTML(resp_poet.text)
# xpath定位，拿到作者名字
poet_name = html_child.xpath('//*[@id="zs_title"]/text()')[0]
# 解析數據，設置獲取詩詞內容的正則
obj_content = re.compile(r'<div class="item_content" id="zs_content">(?P<poetry_content>.*?)</div>', re.S)
# 對正則進行過濾獲取到正則後的內容
result_content = obj_content.finditer(resp_poet.text)
poetry_content = ""
# 對正則後的內容進行過濾html標簽，連接到poetry_content詩詞內容字符串上
for it in result_content:
poetry_content = it.group("poetry_content")
pattern = re.compile(r'<[^>]+>', re.S)
poetry_content = pattern.sub('', poetry_content).strip()
with open('poet.txt', 'a', encoding='utf-8') as file:
file.write("作者姓名:" + author_name + "\n作者簡介:" + author_introduction + "\n詩詞題目:" + poet_name+"\n詩詞內容:"+poetry_content+"\n")
print("作者姓名:" + author_name + "\n作者簡介:" + author_introduction + "\n詩詞題目:" + poet_name+"\n詩詞內容:"+poetry_content+"\n")
time.sleep(1)
print("結束！")