Recently, I am studying reptile novels , There is a garbled code in a web page . Its web page is gb2312 code , I use gb2312、gbk、utf-8 I tried it once and couldn't recognize . Because I am crawling the text page by page , An error report means a chapter is missing , It's hard .
I want to ask you , Is there any way to directly ignore the characters that cannot be encoded , Write the extracted content directly ?
The download code is as follows
# download async def download(url, name): async with semaphore: async with aiohttp.ClientSession() as session: async with session.get(url) as reques: reques.encoding = 'gbk' page = bs4.BeautifulSoup(await reques.text(), 'html.parser') div = page.find('div', class_="read_chapterDetail") p = div.find_all('p') # Open file , Open mode , The data is binary with open(f'{name}.txt', mode='wb') as f: for i in p: text = i.text + '\n' f.write(text.encode('utf-8')) print(f'{name} Download complete !')