程序師世界是廣大編程愛好者互助、分享、學習的平台，程序師世界有你更精彩！


設為首頁	加入收藏

首頁
編程語言: C語言|JAVA編程
 Python編程
網頁編程: ASP編程|PHP編程
 JSP編程
數據庫知識: MYSQL數據庫|SqlServer數據庫
 Oracle數據庫|DB2數據庫

您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python asynchronous crawlers collaborative process grabs sister pictures (aiohttp, aiofiles)

編輯：Python

Catalog

Preface

One 、 What is a journey ？

Two 、 The advantages of synergy

3、 ... and 、 The code analysis

1. Import and stock in

2. Get links to all timelines

3. Get links to all albums in a timeline

4. Get all the picture links in an album and the name of the album

5. Download and save pictures

6.main function

7. Main method

Four 、 Complete code

Preface

In the process of crawling , Efficiency is a key issue , The most common is multithreading 、 Multi process 、 Thread pool 、 Process pool, etc . This article mainly introduces the use of collaboration process to complete the capture of sister pictures .

One 、 What is a journey ？

coroutines , english Coroutines, It's a lighter existence than threads . Just as a process can have multiple threads , A thread can also have multiple coroutines . most important of all , The process is not managed by the operating system kernel , And it's completely controlled by the program （ That is, users execute ）.

Two 、 The advantages of synergy

The advantage is that the performance has been greatly improved , It doesn't consume resources like thread switching . The cost of the process is much less than that of the thread . The essence of a coroutine is a single thread , Without using more system resources , take IO The operation was suspended , Then continue with other tasks , wait for IO After the operation , Return to the original task to continue .

3、 ... and 、 The code analysis

1. Import and stock in

import requests # Synchronous network request module
import re # Regular modules , Used to extract data
import asyncio # Modules that create and manage event loops
import aiofiles # Asynchronous file operation module
import aiohttp # Asynchronous network request module
import os # Modules that can call the operating system

2. Get links to all timelines

# Define a synchronization function , Use requests The library makes a request
def get_date_list():
# Target links
url='https://zhaocibaidicaiyunjian.ml/'
# Camouflage request header
header={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'}
# Send a request
r=requests.get(url=url,headers=header)
# Get web page source code
htm=r.text
# Use regular modules re Parse the link to get the timeline
results=re.findall('''<aside id="archives-2" class="widget widget_archive">(.*?)</aside>''',htm,re.S)
date_list_str=str(re.findall('''<a href=(.*?)>.*?</a>''',str(results)))
date_list=re.findall('''\\'(.*?)\\\\\\\\\'''',date_list_str)
# Returns a list of all timeline Links
return date_list

3. Get links to all albums in a timeline

# Determine whether a timeline page contains a second page ( In the following coplanar function get_album_urls Call in )
def hasnext(Responsetext):
if re.findall('''<a class="next page-numbers" href="(.*?)"> The next page </a>''', Responsetext):
nextpage = re.findall('''<a class="next page-numbers" href="(.*?)"> The next page </a>''',Responsetext)[0]
return nextpage
else:
return None
#async Define a coprocessor function , The parameter is a timeline link
async def get_album_urls(date_url):
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'}
# Create a context manager , And create an asynchronous session object
async with aiohttp.ClientSession() as session:
#session Of get() Methods and requests The usage in the library is the same ( notes :requests In the library get() The proxy parameter of the method is in the form of a dictionary proxies=dict, Support http and https. and session in get() The proxy parameter of the method is a string ,proxy=str. And only support http, I won't support it https)
async with session.get(url=date_url,headers=header) as Response:
# Get web page source code , Use await The Internet IO Request pending , The program continues to perform other tasks , Wait for the content to return, and then jump to this place to continue
htm=await Response.text()
# Use regular to extract links from all albums
album_urls=re.findall('''<a href="(.*?)" class="more-link"> Read on <span class="screen-reader-text">.*?</span></a>''',htm)
# Determine whether this timeline page contains a second page
nextpage=hasnext(htm)
# If there is a second page, extract all the album links on the second page
if nextpage:
async with session.get(url=nextpage,headers=header) as Response1:
htm1=await Response1.text()
# List builder , Add all the album links in the second page to the list album_urls in
[album_urls.append(album_url) for album_url in re.findall('''<a href="(.*?)" class="more-link"> Read on <span class="screen-reader-text">.*?</span></a>''',htm1)]
# Return all album links in a timeline
return album_urls

4. Get all the picture links in an album and the name of the album

#async Define a coprocessor function , The parameter is an album link
async def get_pic_urls_and_title(album_url):
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'}
# Create a context manager , And create an asynchronous session object
async with aiohttp.ClientSession() as session:
#session Of get() Methods and requests The usage in the library is the same ( notes :requests In the library get() The proxy parameter of the method is in the form of a dictionary proxies=dict, Support http and https. and session in get() The proxy parameter of the method is a string ,proxy=str. And only support http, I won't support it https)
async with session.get(url=album_url,headers=header) as Response:
# Get web page source code , Use await The Internet IO Request pending , The program continues to perform other tasks , Wait for the content to return, and then jump to this place to continue
htm=await Response.text()
# Use regular to extract all picture addresses and album names
pic_urls=re.findall('''<img src="(.*?)" alt=".*?" border="0" />.*?''',htm,re.S)
title=re.findall('''<h1 class="entry-title">(.*?)</h1>''',htm,re.S)[0]
# Return all picture addresses and album names
return pic_urls,title

5. Download and save pictures

# Define a coprocessor function , The parameters are all picture addresses of an album and the name of the album
async def download(pic_urls,title):
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'}
# Create a folder for an album
dir_name = title
# Determine whether this folder exists , There is an end , Otherwise, create a folder and put the pictures in
if os.path.exists(dir_name):
print(dir_name + ' This picture folder already exists ')
return False
elif not os.path.exists(dir_name):
os.mkdir(dir_name)
print(f'-------- Downloading : {dir_name}--------')
# Make an asynchronous request for each photo
for pic_url in pic_urls:
# The name of the picture is the link address of the picture
pic_name = pic_url.split('/')[-1]
async with aiohttp.ClientSession() as session:
async with session.get(url=pic_url, headers=header) as Response:
# Create a context manager , And create an asynchronous file object
async with aiofiles.open(file=dir_name + '/' + pic_name, mode='wb') as f:
# because Response Object's read() and text() Method will read the response into memory all at once , This will cause the memory to be full , It's causing a jam , Affect efficiency .
# So it takes the form of byte stream , Each read 4096 Bytes and write to the file
while True:
# encounter IO The blocking condition is suspended , Wait for the content to return, and then jump to this place to continue
pic_stream = await Response.content.read(4096)
# If after reading , Then jump out of this cycle
if not pic_stream:
break
# File written as IO operation , Perform other tasks after suspending , After writing, jump to this place to continue execution
await f.write(pic_stream)

6.main function

# Define a coprocessor function , The parameter is a timeline link
async def main(date_url):
# Get all the photo album links in a timeline , Use await Do a suspend operation ( Because here get_album_urls Is a collaboration object , And inside there is IO wait for )
album_urls=await get_album_urls(date_url)
# Get the album name and all picture addresses in each album
for album_url in album_urls:
# Get all the picture addresses and album names in an album , Use await Do a suspend operation ( Because here get_pic_urls_and_title Is a collaboration object , And inside there is IO wait for )
pic_urls,title=await get_pic_urls_and_title(album_url)
# Download an album , Use await Do a suspend operation ( Because here download Is a collaboration object , And inside there is IO wait for )
await download(pic_urls,title)

7. Main method

if __name__=="__main__":
# Create a task list
tasks=[]
# Use synchronization to get the addresses of all timelines ( Because all the following operations are based on the obtained timeline link , So use the synchronization operation to get all the links before proceeding to the next operation )
date_list=get_date_list()
# Create a task object for each timeline link
for date_url in date_list:
# This is not an immediate execution main function , Instead, a collaboration object is created
task=main(date_url)
# Add a task to the task list
tasks.append(task)
# Create an event loop , Users receive messages ( Receive the status of a task , unexecuted ？, In execution ？, completion of enforcement ？)
loop=asyncio.get_event_loop()
# Here is the real task , Wait until all tasks are completed ( It can be considered as a fixed way of writing )
loop.run_until_complete(asyncio.wait(tasks))
# Close the event loop after all tasks are executed , Release resources
loop.close()

Four 、 Complete code

import requests # Synchronous network request module
import re # Regular modules , Used to extract data
import asyncio # Modules that create and manage event loops
import aiofiles # Asynchronous file operation module
import aiohttp # Asynchronous network request module
import os # Modules that can call the operating system
# Define a synchronization function , Use requests The library makes a request
def get_date_list():
# Target links
url='https://zhaocibaidicaiyunjian.ml/'
# Camouflage request header
header={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'}
# Send a request
r=requests.get(url=url,headers=header)
# Get web page source code
htm=r.text
# Use regular modules re Parse the link to get the timeline
results=re.findall('''<aside id="archives-2" class="widget widget_archive">(.*?)</aside>''',htm,re.S)
date_list_str=str(re.findall('''<a href=(.*?)>.*?</a>''',str(results)))
date_list=re.findall('''\\'(.*?)\\\\\\\\\'''',date_list_str)
# Returns a list of all timeline Links
return date_list
# Determine whether a timeline page contains a second page ( In the following coplanar function get_album_urls Call in )
def hasnext(Responsetext):
if re.findall('''<a class="next page-numbers" href="(.*?)"> The next page </a>''', Responsetext):
nextpage = re.findall('''<a class="next page-numbers" href="(.*?)"> The next page </a>''',Responsetext)[0]
return nextpage
else:
return None
#async Define a coprocessor function , The parameter is a timeline link
async def get_album_urls(date_url):
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'}
# Create a context manager , And create an asynchronous session object
async with aiohttp.ClientSession() as session:
#session Of get() Methods and requests The usage in the library is the same ( notes :requests In the library get() The proxy parameter of the method is in the form of a dictionary proxies=dict, Support http and https. and session in get() The proxy parameter of the method is a string ,proxy=str. And only support http, I won't support it https)
async with session.get(url=date_url,headers=header) as Response:
# Get web page source code , Use await The Internet IO Request pending , The program continues to perform other tasks , Wait for the content to return, and then jump to this place to continue
htm=await Response.text()
# Use regular to extract links from all albums
album_urls=re.findall('''<a href="(.*?)" class="more-link"> Read on <span class="screen-reader-text">.*?</span></a>''',htm)
# Determine whether this timeline page contains a second page
nextpage=hasnext(htm)
# If there is a second page, extract all the album links on the second page
if nextpage:
async with session.get(url=nextpage,headers=header) as Response1:
htm1=await Response1.text()
# List builder , Add all the album links in the second page to the list album_urls in
[album_urls.append(album_url) for album_url in re.findall('''<a href="(.*?)" class="more-link"> Read on <span class="screen-reader-text">.*?</span></a>''',htm1)]
# Return all album links in a timeline
return album_urls
#async Define a coprocessor function , The parameter is an album link
async def get_pic_urls_and_title(album_url):
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'}
# Create a context manager , And create an asynchronous session object
async with aiohttp.ClientSession() as session:
#session Of get() Methods and requests The usage in the library is the same ( notes :requests In the library get() The proxy parameter of the method is in the form of a dictionary proxies=dict, Support http and https. and session in get() The proxy parameter of the method is a string ,proxy=str. And only support http, I won't support it https)
async with session.get(url=album_url,headers=header) as Response:
# Get web page source code , Use await The Internet IO Request pending , The program continues to perform other tasks , Wait for the content to return, and then jump to this place to continue
htm=await Response.text()
# Use regular to extract all picture addresses and album names
pic_urls=re.findall('''<img src="(.*?)" alt=".*?" border="0" />.*?''',htm,re.S)
title=re.findall('''<h1 class="entry-title">(.*?)</h1>''',htm,re.S)[0]
# Return all picture addresses and album names
return pic_urls,title
# Define a coprocessor function , The parameters are all picture addresses of an album and the name of the album
async def download(pic_urls,title):
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36'}
# Create a folder for an album
dir_name = title
# Determine whether this folder exists , There is an end , Otherwise, create a folder and put the pictures in
if os.path.exists(dir_name):
print(dir_name + ' This picture folder already exists ')
return False
elif not os.path.exists(dir_name):
os.mkdir(dir_name)
print(f'-------- Downloading : {dir_name}--------')
# Make an asynchronous request for each photo
for pic_url in pic_urls:
# The name of the picture is the link address of the picture
pic_name = pic_url.split('/')[-1]
async with aiohttp.ClientSession() as session:
async with session.get(url=pic_url, headers=header) as Response:
# Create a context manager , And create an asynchronous file object
async with aiofiles.open(file=dir_name + '/' + pic_name, mode='wb') as f:
# because Response Object's read() and text() Method will read the response into memory all at once , This will cause the memory to be full , It's causing a jam , Affect efficiency .
# So it takes the form of byte stream , Each read 4096 Bytes and write to the file
while True:
# encounter IO The blocking condition is suspended , Wait for the content to return, and then jump to this place to continue
pic_stream = await Response.content.read(4096)
# If after reading , Then jump out of this cycle
if not pic_stream:
break
# File written as IO operation , Perform other tasks after suspending , After writing, jump to this place to continue execution
await f.write(pic_stream)
# Define a coprocessor function , The parameter is a timeline link
async def main(date_url):
# Get all the photo album links in a timeline , Use await Do a suspend operation ( Because here get_album_urls Is a collaboration object , And inside there is IO wait for )
album_urls=await get_album_urls(date_url)
# Get the album name and all picture addresses in each album
for album_url in album_urls:
# Get all the picture addresses and album names in an album , Use await Do a suspend operation ( Because here get_pic_urls_and_title Is a collaboration object , And inside there is IO wait for )
pic_urls,title=await get_pic_urls_and_title(album_url)
# Download an album , Use await Do a suspend operation ( Because here download Is a collaboration object , And inside there is IO wait for )
await download(pic_urls,title)
if __name__=="__main__":
# Create a task list
tasks=[]
# Use synchronization to get the addresses of all timelines ( Because all the following operations are based on the obtained timeline link , So use the synchronization operation to get all the links before proceeding to the next operation )
date_list=get_date_list()
# Create a task object for each timeline link
for date_url in date_list:
# This is not an immediate execution main function , Instead, a collaboration object is created
task=main(date_url)
# Add a task to the task list
tasks.append(task)
# Create an event loop , Users receive messages ( Receive the status of a task , unexecuted ？, In execution ？, completion of enforcement ？)
loop=asyncio.get_event_loop()
# Here is the real task , Wait until all tasks are completed ( It can be considered as a fixed way of writing )
loop.run_until_complete(asyncio.wait(tasks))
# Close the event loop after all tasks are executed , Release resources
loop.close()

上一篇文章： Python foundation ----- decorator ----- correct posture of decorator used by all functions in the class
下一篇文章： Python crawler

Python

Python+opencv plt Imshow() can display CV2 normally Imshow() is displayed as a white problem

problem ： In image processing

[dry goods sharing] recommend 5 Python automated scripts that can get twice the result with half the effort

Im sure youve heard about auto

Function problems in Python

Why is the result of this pict

Python learning day 6

One 、 Data type conversion 1.

Python job 2

One . Single topic selection (

python中print打印顯示顏色

顯示顏色的格式：\033 [顯示方式;前景色;背景色m …

相關文章

Python coroutine & asyncio & asynchronous programming

Can Python crawlers be a sideline? At which level can I take orders? See how Python crawlers make money

Getting started with Python asynchronous task framework celery

Python common skills: a prerequisite for getting started with crawlers - advantages and usage of IP proxy

Understanding the asynchronous IO model

Python asynchronous task queue

How do Python crawlers make money? Six Python crawlers make money. Its not a problem to engage in sidelines

[Python] asynchronous web framework sanic

[Python] asynchronous programming -- high concurrency

Python crawler ｜ web page data asynchronous loading (completed in combination with selenium)

閱讀排行榜

Python environment configuration 【python基礎】生成式、裝飾器、高階函數 Python從零基礎入門到精通：一個月就夠了 Daily question -1108 IP address invalidation_ Python 學習Python記錄（1） python學習3 Python installs numpy, Matplotlib, SciPy, Seaborn modules to implement FFT algorithm. pythonopen文件，取出某列，轉化成天，然後和當前時間【從1月1日到現在多少天】做比較，得出不可用百分比 Python built-in functions - dir() You can use pandas Dataframe method to obtain the data daily frequency quantization of K-line Python+selenium UI free automated test - Chrome browser

熱門圖文

《Effective Modern C++》讀書筆記 Item 1 理解模板類型推導，effectivemodernc Asp.net自定義控件之單選、多選控件 YII Framework框架教程之安全方案詳解 Struts開發指南之Taglib C#創建和調用DLL代碼 AppServ安裝配置Apache+PHP+Mysql環境 GdiPlus[28]: IGPPen: 建立復合畫筆主類型的過載

欄目導航

編程綜合問答

更多關於編程

編程問題解答

Copyright © 程式師世界 All Rights Reserved