您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python crawler - crawling news of a website

編輯：Python

List of articles

Preface
One 、 Basic goal
Two 、 Use steps
- The overall code
result
summary

Preface

Randomly found a website to crawl , Our goal is
1. Using reptilian re、xpath Such as knowledge , Crawl to the news on this official website , The content includes ： News headlines , Release time , News link , Reading times , Five attributes of news sources .
2. Put the data we crawled into a csv In the file of ！
So let's start ！

Tips ： Reptiles cannot be used as illegal activities , Set the sleep time when crawling , Do not over crawl , Causing server downtime , Be legally liable ！！！

One 、 Basic goal

Our goal is to crawl this https://www.cqwu.edu.cn/channel_23133_0310.html News data of the website

Two 、 Use steps

The overall code

import re
import time
import requests
from lxml import etree
import csv
# With crawling URL 
base_url = "https://www.cqwu.edu.cn/channel_23133_0310.html"
# Anti creep 
headers = {

"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "zh-CN,zh;q=0.9"
}
# Crawling information 
resp = requests.get(url=base_url,headers=headers)
# Pass the information through etree Assign a value 
html = etree.HTML(resp.text)
# xpath Locate news list information 
news_list = html.xpath("/html/body/div/div[3]/div/div/div[2]/div/ul/li")
data_list = []
# Loop through the news list 
for news in news_list:
# Get news links 
news_url = news.xpath("./a/@href")[0]
# Continue to crawl to the details page of the news 
news_resp = requests.get(url=news_url)
# Put the details page html Information assignment 
news_html = etree.HTML(news_resp.text)
# xpath Locate the title of the news details page 
news_title = news_html.xpath("/html/body/div/div[3]/div[1]/div/div[2]/div/div/h4/text()")[0]
# re Regular get date and reference source 
time_refer_obj = re.compile(r'<div class="news-date">.*? Release time ：(?P<time>.*?) Browse ：.*? Time source ：(?P<refer>.*?)</div>', re.S)
result = time_refer_obj.finditer(news_resp.text)
for it in result:
# Assign values to dates and references 
news_time = it.group("time")
news_refer = it.group("refer").strip()
# re Regular access to traffic data 
count_obj = re.compile(r" Browse ：<Script Language='Javascript' src='(?P<count_url>.*?)'>", re.S)
result = count_obj.finditer(news_resp.text)
for it in result:
count_url = "https://www.cqwu.edu.cn" + it.group("count_url")
count_resp = requests.get(url=count_url)
news_read = count_resp.text.split("'")[1].split("'")[0]
# Create a dictionary , Assign the crawled information to the dictionary 
data = {
}
data[' News headlines '] = news_title
data[' Release time '] = news_time
data[' News link '] = news_url
data[' Reading times '] = news_read
data[' News source '] = news_refer
# Add dictionary to list , Then it becomes xml Use 
data_list.append(data)
# Sleep for a second 
time.sleep(1)
print(data)
# 1. Create a file object ,encoding='utf-8' Is to set the encoding format ,newline='' To prevent blank lines 
f = open('news.csv', 'w', encoding='utf-8',newline='')
# 2. Build on file objects csv Write object 
csv_write = csv.writer(f)
# 3. Build list headers 
csv_write.writerow([' News headlines ', ' Release time ', ' News link ', ' Reading times ', ' News source '])
for data in data_list:
# 4. write in csv file 
csv_write.writerow([data[' News headlines '], data[' Release time '], data[' News link '], data[' Reading times '], data[' News source ']])
print(" End of climb ！")

result

The following is the output process of our program

This is a program that stores data in csv The document of the document

summary

The basic steps of a reptile ：
1. Check whether there is anti climbing , Set the normal reverse crawl ,User-Agent and referer Are the most common anti climbing methods
2. utilize xpath and re Technology positioning , Get the desired data after positioning
3. utilize csv The library writes data to csv In file