I've been chasing novels recently , I will go to see if there is any update from time to time when I am free , So write a script to help me see if there is any update .
Realize the idea
First, find the website where we usually read novels , Crawl to the contents page of the novel , Check whether the chapter is updated , When you update, you send an email to notify yourself , Otherwise, continue crawling inspection .
Concrete realization
The website I use is This ,( Pirated novel website , If you have the ability, please support the original ).
1. First search for your novel in the search bar , Check to see if there's a novel you're after
2. Careful analysis url You will find that every novel has one or more id, The format is as follows .15_15698、12_12366 etc. .
3. Extract the latest chapter number , Compare the number of chapters you see , Greater than is considered as an update .
4. Extract content and send mail ( As for how to send Baidu, there are many tutorials )
If you also use this website to read novels, you can directly slightly modify my source code to use , The code is attached at the end of the article
Code runs
It is the script that helps you check whether the novel has been updated , So you have to keep the script running , So you must have the best ECs , After all, my computer always has to be turned off . I hang my code on my server .linux An example of the run command is as follows nohup python3 -u listen.py -a 12_12585 -n 1000 [email protected] > novel.log &
Command interpretation
nohup When you exit the remote connection terminal, the program will not be kill(ps -aux You can see the running program , I can try it myself )
> novel.log & Is to hang in the background and redirect the output to novel.log In the document
-a -n --email These are the input parameters
python3 -u -listen.py use python3 Environment is running listen.py Script ,-u Represents real-time output to a file , Otherwise view novel.log May be an empty file
( I was thinking of using django Write a website so that some readers without a server can use the novel to notify the script , But I haven't had time until half of the development , So if you don't have a server, hang it on this computer or find another way )
It used to be beautifulsoup Library parsing to get data , But when I was looking for a job, I found that many people would ask me if I understood xpath Of , So this time I just used xpath.
Then you will learn to input parameters from the command line nohup and ps These orders , Although my blog is quite Pediatrics ( Never plagiarize , Resolutely original ), But why not learn something besides entertainment .
import smtplib
from email.mime.text import MIMEText
# email Used to build mail content
from email.header import Header
# Used to build headers
from lxml import etree
import requests
import re
import time
import getopt
import sys
def send_mail(content):
# Information from the sender : Email address ,QQ Email authorization code
from_addr = 'XXXXXX'
password = 'XXXXXXX'
# Sending server
smtp_server = 'smtp.qq.com'
# Email body content , The first parameter is the content , The second parameter is format (plain For plain text ), The third parameter is encoding
msg = MIMEText(content,'plain','utf-8')
# Header message
msg['From'] = Header(from_addr)
msg['To'] = Header(to_addr)
msg['Subject'] = Header(' Updated ')
# Turn on the messaging service , This is encrypted transmission
server = smtplib.SMTP_SSL(host=smtp_server)
server.connect(smtp_server,465)
# Log in to your email address
server.login(from_addr, password)
# Send E-mail
server.sendmail(from_addr, to_addr, msg.as_string())
# Shut down the server
server.quit()
# Crawl the content of the article
def get_content(urls):
global new_art
new_art = new_art + int(len(urls))
all_content = ''
# Because it is obtained from the latest chapter , So reverse order iteration
for detial in reversed(urls) :
# Network fluctuations sometimes cause requests to fail , Pause for a moment before requesting
while True:
url = host + detial
try:
res = requests.get(url=url,headers=headers)
except:
time.sleep(60)
else:
break
res.encoding = "utf-8"
html = etree.HTML(res.text)
title = html.xpath('//div[@class="bookname"]/h1/text()')
all_content = all_content + title[0] +'\n' + ' Web link :' + url + '\n'
content = html.xpath('//div[@id="content"]/text()')
# Remove unwanted characters
content = re.sub(r'[(\\u3000)\']','',str(content),count=0,flags=re.M|re.I)
all_content = all_content + content + '\n'
return all_content
# Get the number of chapters by title
def get_new_art_num(text):
art_num = re.search('\d+',text[0])
return int(art_num.group(0))
# Get the latest five chapters on this page , Check for updates
def get_url(a_id):
url = host + a_id + '/'
re = requests.get(url=url, headers=headers)
re.encoding = "utf-8"
tree = etree.HTML(re.text)
new_art_url = []
# Get the latest five chapters
# A normal novel does not update more than five chapters a day , If you have your own modification
for i in range(1,6):
i = str(i)
art_title = tree.xpath('//div[@id="list"]//dd[' + i + ']/a/text()')
art_url = tree.xpath('//div[@id="list"]//dd[' + i + ']/a/@href')
if get_new_art_num(art_title) <= new_art :
if len(new_art_url) == 0 :
return 0
else:
return new_art_url
else :
new_art_url.append(art_url[0])
def useage():
print(
'''
Usage:
-a The novel's id( Required )
-n The latest chapter you see ( Required )
-b The program starts running every day
-e End time
--email Receive notification email ( Required )
''')
if __name__ == '__main__':
if len(sys.argv) < 2:
useage()
sys.exit(3)
host = 'https://www.78zw.com/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0',
'host': 'www.78zw.com',
}
# initialization , Run default without input
# My novels are usually updated from 1 p. m. to 7 p. m , It hasn't been updated after seven o'clock. I don't think it will be updated today
begin_time = 13
end_time = 19
# The recipient's email address and other required information
a_id = new_art = to_addr = None
try:
opts, args = getopt.getopt(sys.argv[1:], 'a:n:b:e:', ['email='])
for name,value in opts:
if name == '-a':
a_id = value
if name == '-n':
new_art = int(value)
if name == '-b':
begin_time = int(value)
if name == '-e':
begin_time = int(value)
if name == '--email':
if re.match(r'[\d\w_-][email protected][\d\w_-]+(\.[\d\w_-]+)+',value) == None:
print(' Mailbox format error ')
sys.exit(3)
to_addr = value
except getopt.GetoptError:
useage()
if not (a_id and new_art and to_addr):
print(' Incomplete information input ')
useage()
sys.exit(3)
while True:
if begin_time <= int(time.strftime('%H',time.localtime(time.time()))) < end_time :
urls = get_url(a_id)
if urls == 0 :
print(time.strftime('%m{m}%d{d} %H:%M', time.localtime(time.time())).format(m=' month ',d=' Japan '),': No updates ')
# Crawl every ten minutes
# You can modify the crawl frequency , The interval shall not be too short , On the one hand, it puts pressure on the server , On the one hand, it is likely to trigger anti climbing and other mechanisms
time.sleep(60*10)
continue
# If updated , Send to email , Update the latest chapters you have seen , And I'm not crawling today
send_mail(get_content(urls))
sleep_time = begin_time + 24 - int(time.strftime('%H',time.localtime(time.time())))
print(time.strftime('%m{m}%d{d} %H:%M', time.localtime(time.time())).format(m=' month ',d=' Japan '), " Updated today , The program is sleeping ...")
time.sleep(sleep_time*60*60)
else :
print(time.strftime('%m{m}%d{d} %H:%M', time.localtime(time.time())).format(m=' month ',d=' Japan ')," Not yet , Go to sleep again ...")
time.sleep(60*60)