Catalog
Preface
One 、 What is a journey ?
Two 、 The advantages of synergy
3、 ... and 、 The code analysis
1. Import and stock in
2. Get links to all timelines
3. Get links to all albums in a timeline
4. Get all the picture links in an album and the name of the album
5. Download and save pictures
6.main function
7. Main method
Four 、 Complete code
In the process of crawling , Efficiency is a key issue , The most common is multithreading 、 Multi process 、 Thread pool 、 Process pool, etc . This article mainly introduces the use of collaboration process to complete the capture of sister pictures .
coroutines , english Coroutines, It's a lighter existence than threads . Just as a process can have multiple threads , A thread can also have multiple coroutines . most important of all , The process is not managed by the operating system kernel , And it's completely controlled by the program ( That is, users execute ).
The advantage is that the performance has been greatly improved , It doesn't consume resources like thread switching . The cost of the process is much less than that of the thread . The essence of a coroutine is a single thread , Without using more system resources , take IO The operation was suspended , Then continue with other tasks , wait for IO After the operation , Return to the original task to continue .
import requests # Synchronous network request module
import re # Regular modules , Used to extract data
import asyncio # Modules that create and manage event loops
import aiofiles # Asynchronous file operation module
import aiohttp # Asynchronous network request module
import os # Modules that can call the operating system
# Define a synchronization function , Use requests The library makes a request
def get_date_list():
# Target links
url='https://zhaocibaidicaiyunjian.ml/'
# Camouflage request header
header={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'}
# Send a request
r=requests.get(url=url,headers=header)
# Get web page source code
htm=r.text
# Use regular modules re Parse the link to get the timeline
results=re.findall('''<aside id="archives-2" class="widget widget_archive">(.*?)</aside>''',htm,re.S)
date_list_str=str(re.findall('''<a href=(.*?)>.*?</a>''',str(results)))
date_list=re.findall('''\\'(.*?)\\\\\\\\\'''',date_list_str)
# Returns a list of all timeline Links
return date_list
# Determine whether a timeline page contains a second page ( In the following coplanar function get_album_urls Call in )
def hasnext(Responsetext):
if re.findall('''<a class="next page-numbers" href="(.*?)"> The next page </a>''', Responsetext):
nextpage = re.findall('''<a class="next page-numbers" href="(.*?)"> The next page </a>''',Responsetext)[0]
return nextpage
else:
return None
#async Define a coprocessor function , The parameter is a timeline link
async def get_album_urls(date_url):
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'}
# Create a context manager , And create an asynchronous session object
async with aiohttp.ClientSession() as session:
#session Of get() Methods and requests The usage in the library is the same ( notes :requests In the library get() The proxy parameter of the method is in the form of a dictionary proxies=dict, Support http and https. and session in get() The proxy parameter of the method is a string ,proxy=str. And only support http, I won't support it https)
async with session.get(url=date_url,headers=header) as Response:
# Get web page source code , Use await The Internet IO Request pending , The program continues to perform other tasks , Wait for the content to return, and then jump to this place to continue
htm=await Response.text()
# Use regular to extract links from all albums
album_urls=re.findall('''<a href="(.*?)" class="more-link"> Read on <span class="screen-reader-text">.*?</span></a>''',htm)
# Determine whether this timeline page contains a second page
nextpage=hasnext(htm)
# If there is a second page, extract all the album links on the second page
if nextpage:
async with session.get(url=nextpage,headers=header) as Response1:
htm1=await Response1.text()
# List builder , Add all the album links in the second page to the list album_urls in
[album_urls.append(album_url) for album_url in re.findall('''<a href="(.*?)" class="more-link"> Read on <span class="screen-reader-text">.*?</span></a>''',htm1)]
# Return all album links in a timeline
return album_urls
#async Define a coprocessor function , The parameter is an album link
async def get_pic_urls_and_title(album_url):
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'}
# Create a context manager , And create an asynchronous session object
async with aiohttp.ClientSession() as session:
#session Of get() Methods and requests The usage in the library is the same ( notes :requests In the library get() The proxy parameter of the method is in the form of a dictionary proxies=dict, Support http and https. and session in get() The proxy parameter of the method is a string ,proxy=str. And only support http, I won't support it https)
async with session.get(url=album_url,headers=header) as Response:
# Get web page source code , Use await The Internet IO Request pending , The program continues to perform other tasks , Wait for the content to return, and then jump to this place to continue
htm=await Response.text()
# Use regular to extract all picture addresses and album names
pic_urls=re.findall('''<img src="(.*?)" alt=".*?" border="0" />.*?''',htm,re.S)
title=re.findall('''<h1 class="entry-title">(.*?)</h1>''',htm,re.S)[0]
# Return all picture addresses and album names
return pic_urls,title
# Define a coprocessor function , The parameters are all picture addresses of an album and the name of the album
async def download(pic_urls,title):
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'}
# Create a folder for an album
dir_name = title
# Determine whether this folder exists , There is an end , Otherwise, create a folder and put the pictures in
if os.path.exists(dir_name):
print(dir_name + ' This picture folder already exists ')
return False
elif not os.path.exists(dir_name):
os.mkdir(dir_name)
print(f'-------- Downloading : {dir_name}--------')
# Make an asynchronous request for each photo
for pic_url in pic_urls:
# The name of the picture is the link address of the picture
pic_name = pic_url.split('/')[-1]
async with aiohttp.ClientSession() as session:
async with session.get(url=pic_url, headers=header) as Response:
# Create a context manager , And create an asynchronous file object
async with aiofiles.open(file=dir_name + '/' + pic_name, mode='wb') as f:
# because Response Object's read() and text() Method will read the response into memory all at once , This will cause the memory to be full , It's causing a jam , Affect efficiency .
# So it takes the form of byte stream , Each read 4096 Bytes and write to the file
while True:
# encounter IO The blocking condition is suspended , Wait for the content to return, and then jump to this place to continue
pic_stream = await Response.content.read(4096)
# If after reading , Then jump out of this cycle
if not pic_stream:
break
# File written as IO operation , Perform other tasks after suspending , After writing, jump to this place to continue execution
await f.write(pic_stream)
# Define a coprocessor function , The parameter is a timeline link
async def main(date_url):
# Get all the photo album links in a timeline , Use await Do a suspend operation ( Because here get_album_urls Is a collaboration object , And inside there is IO wait for )
album_urls=await get_album_urls(date_url)
# Get the album name and all picture addresses in each album
for album_url in album_urls:
# Get all the picture addresses and album names in an album , Use await Do a suspend operation ( Because here get_pic_urls_and_title Is a collaboration object , And inside there is IO wait for )
pic_urls,title=await get_pic_urls_and_title(album_url)
# Download an album , Use await Do a suspend operation ( Because here download Is a collaboration object , And inside there is IO wait for )
await download(pic_urls,title)
if __name__=="__main__":
# Create a task list
tasks=[]
# Use synchronization to get the addresses of all timelines ( Because all the following operations are based on the obtained timeline link , So use the synchronization operation to get all the links before proceeding to the next operation )
date_list=get_date_list()
# Create a task object for each timeline link
for date_url in date_list:
# This is not an immediate execution main function , Instead, a collaboration object is created
task=main(date_url)
# Add a task to the task list
tasks.append(task)
# Create an event loop , Users receive messages ( Receive the status of a task , unexecuted ?, In execution ?, completion of enforcement ?)
loop=asyncio.get_event_loop()
# Here is the real task , Wait until all tasks are completed ( It can be considered as a fixed way of writing )
loop.run_until_complete(asyncio.wait(tasks))
# Close the event loop after all tasks are executed , Release resources
loop.close()
import requests # Synchronous network request module
import re # Regular modules , Used to extract data
import asyncio # Modules that create and manage event loops
import aiofiles # Asynchronous file operation module
import aiohttp # Asynchronous network request module
import os # Modules that can call the operating system
# Define a synchronization function , Use requests The library makes a request
def get_date_list():
# Target links
url='https://zhaocibaidicaiyunjian.ml/'
# Camouflage request header
header={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'}
# Send a request
r=requests.get(url=url,headers=header)
# Get web page source code
htm=r.text
# Use regular modules re Parse the link to get the timeline
results=re.findall('''<aside id="archives-2" class="widget widget_archive">(.*?)</aside>''',htm,re.S)
date_list_str=str(re.findall('''<a href=(.*?)>.*?</a>''',str(results)))
date_list=re.findall('''\\'(.*?)\\\\\\\\\'''',date_list_str)
# Returns a list of all timeline Links
return date_list
# Determine whether a timeline page contains a second page ( In the following coplanar function get_album_urls Call in )
def hasnext(Responsetext):
if re.findall('''<a class="next page-numbers" href="(.*?)"> The next page </a>''', Responsetext):
nextpage = re.findall('''<a class="next page-numbers" href="(.*?)"> The next page </a>''',Responsetext)[0]
return nextpage
else:
return None
#async Define a coprocessor function , The parameter is a timeline link
async def get_album_urls(date_url):
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'}
# Create a context manager , And create an asynchronous session object
async with aiohttp.ClientSession() as session:
#session Of get() Methods and requests The usage in the library is the same ( notes :requests In the library get() The proxy parameter of the method is in the form of a dictionary proxies=dict, Support http and https. and session in get() The proxy parameter of the method is a string ,proxy=str. And only support http, I won't support it https)
async with session.get(url=date_url,headers=header) as Response:
# Get web page source code , Use await The Internet IO Request pending , The program continues to perform other tasks , Wait for the content to return, and then jump to this place to continue
htm=await Response.text()
# Use regular to extract links from all albums
album_urls=re.findall('''<a href="(.*?)" class="more-link"> Read on <span class="screen-reader-text">.*?</span></a>''',htm)
# Determine whether this timeline page contains a second page
nextpage=hasnext(htm)
# If there is a second page, extract all the album links on the second page
if nextpage:
async with session.get(url=nextpage,headers=header) as Response1:
htm1=await Response1.text()
# List builder , Add all the album links in the second page to the list album_urls in
[album_urls.append(album_url) for album_url in re.findall('''<a href="(.*?)" class="more-link"> Read on <span class="screen-reader-text">.*?</span></a>''',htm1)]
# Return all album links in a timeline
return album_urls
#async Define a coprocessor function , The parameter is an album link
async def get_pic_urls_and_title(album_url):
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'}
# Create a context manager , And create an asynchronous session object
async with aiohttp.ClientSession() as session:
#session Of get() Methods and requests The usage in the library is the same ( notes :requests In the library get() The proxy parameter of the method is in the form of a dictionary proxies=dict, Support http and https. and session in get() The proxy parameter of the method is a string ,proxy=str. And only support http, I won't support it https)
async with session.get(url=album_url,headers=header) as Response:
# Get web page source code , Use await The Internet IO Request pending , The program continues to perform other tasks , Wait for the content to return, and then jump to this place to continue
htm=await Response.text()
# Use regular to extract all picture addresses and album names
pic_urls=re.findall('''<img src="(.*?)" alt=".*?" border="0" />.*?''',htm,re.S)
title=re.findall('''<h1 class="entry-title">(.*?)</h1>''',htm,re.S)[0]
# Return all picture addresses and album names
return pic_urls,title
# Define a coprocessor function , The parameters are all picture addresses of an album and the name of the album
async def download(pic_urls,title):
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'}
# Create a folder for an album
dir_name = title
# Determine whether this folder exists , There is an end , Otherwise, create a folder and put the pictures in
if os.path.exists(dir_name):
print(dir_name + ' This picture folder already exists ')
return False
elif not os.path.exists(dir_name):
os.mkdir(dir_name)
print(f'-------- Downloading : {dir_name}--------')
# Make an asynchronous request for each photo
for pic_url in pic_urls:
# The name of the picture is the link address of the picture
pic_name = pic_url.split('/')[-1]
async with aiohttp.ClientSession() as session:
async with session.get(url=pic_url, headers=header) as Response:
# Create a context manager , And create an asynchronous file object
async with aiofiles.open(file=dir_name + '/' + pic_name, mode='wb') as f:
# because Response Object's read() and text() Method will read the response into memory all at once , This will cause the memory to be full , It's causing a jam , Affect efficiency .
# So it takes the form of byte stream , Each read 4096 Bytes and write to the file
while True:
# encounter IO The blocking condition is suspended , Wait for the content to return, and then jump to this place to continue
pic_stream = await Response.content.read(4096)
# If after reading , Then jump out of this cycle
if not pic_stream:
break
# File written as IO operation , Perform other tasks after suspending , After writing, jump to this place to continue execution
await f.write(pic_stream)
# Define a coprocessor function , The parameter is a timeline link
async def main(date_url):
# Get all the photo album links in a timeline , Use await Do a suspend operation ( Because here get_album_urls Is a collaboration object , And inside there is IO wait for )
album_urls=await get_album_urls(date_url)
# Get the album name and all picture addresses in each album
for album_url in album_urls:
# Get all the picture addresses and album names in an album , Use await Do a suspend operation ( Because here get_pic_urls_and_title Is a collaboration object , And inside there is IO wait for )
pic_urls,title=await get_pic_urls_and_title(album_url)
# Download an album , Use await Do a suspend operation ( Because here download Is a collaboration object , And inside there is IO wait for )
await download(pic_urls,title)
if __name__=="__main__":
# Create a task list
tasks=[]
# Use synchronization to get the addresses of all timelines ( Because all the following operations are based on the obtained timeline link , So use the synchronization operation to get all the links before proceeding to the next operation )
date_list=get_date_list()
# Create a task object for each timeline link
for date_url in date_list:
# This is not an immediate execution main function , Instead, a collaboration object is created
task=main(date_url)
# Add a task to the task list
tasks.append(task)
# Create an event loop , Users receive messages ( Receive the status of a task , unexecuted ?, In execution ?, completion of enforcement ?)
loop=asyncio.get_event_loop()
# Here is the real task , Wait until all tasks are completed ( It can be considered as a fixed way of writing )
loop.run_until_complete(asyncio.wait(tasks))
# Close the event loop after all tasks are executed , Release resources
loop.close()