您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

The first Python crawler

編輯：Python

Right before Python Crawlers and machine learning are very interested , Recently, I finally started to learn ....

ok , It's not that there's no time , But when I have time, I do something else , So I can only take time to learn ‘ ok ’ The attitude of ...

Today I hurriedly started a small example , Just try to climb the website , It's entry-level , But I am still very excited because of my interest .

I took a look at it two days ago Python Basics , Because there are other languages based on HTML、js All will , So I just read the basic grammar and java What's the difference , Then some theoretical knowledge .

I was in Liao Xuefeng's blog And some basic videos I found , I have a preliminary understanding Python The grammar of , also Python and Java The difference between , For two languages to achieve the same function of different writing methods, and so on .

Then I learned Python The history of , Differences between and versions .

I chose Python3.7 Erection sequence .

Some basic knowledge has not been taken notes for the time being , Basically, you can understand it by referring to liaoxuefeng's blog and some online videos . If you want to go deeper, you'd better buy books to read .

Some of the foundations are not very clear , I want to go deep into a single knowledge point when I do it .

ok , Get down to business . This example is still very simple , Because I have seen relevant videos before , So I understand .

The goal is to crawl some data from the homepage of Meitu bar .

Access to the page

Python To access web pages, you need to first introduce urllib.request ( I used to use urllib No, it seems that the reason for the version , I feel like I have learned the wrong version )

urllib There is urllib.request.urlopen(str) Method to open a web page and return an object , Calling this object read() Method to get the source code of the web page directly , The content is the same as that of the source code when the browser right clicks .

print(chardet.detect(htmlCode))

Output {'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'} , The encoding method of the crawled content obtained

Then output the source code

import urllib.request
import chardet
page = urllib.request.urlopen('http://www.meituba.com/tag/juesemeinv.html') # Open the web page
htmlCode = page.read() # Get web source code
print(chardet.detect(htmlCode)) # Print the encoding of the returned page
print(htmlCode.decode('utf-8')) # Print web source code

Be careful ： Direct output print(htmlCode) There will be coding problems , Then go to the original web page to check the source code , But run htmlCode.decode("UTF-8") When , The following error appears ：

line 19, in <module>
    data = data.decode("UTF-8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

May be print Why , There may or may not be utf-8 code . You can try ,data = htmlCode.decode("UTF-8") print(data)

Save the crawled web page source code to the document

import urllib.request
import chardet
page = urllib.request.urlopen('http://www.meituba.com/tag/juesemeinv.html') # Open the web page
htmlCode = page.read() # Get web source code
#print(chardet.detect(htmlCode)) # Check the encoding
data = htmlCode.decode('utf-8')
#print(data) # Print web source code
pageFile = open('pageCode.txt','wb')# Open... In writing pageCode.txt
pageFile.write(htmlCode)# write in
pageFile.close()# Remember to turn it off when you turn it on

In this way test.py The directory will generate a pageCode.txt The file .

Get other information

open pageCode.txt file ( It can also be directly on the original web page F12 Debug get ), View the tag information that needs to get data .

For example, I want to take pictures now

Write the regular expression of the picture ： reg = r'src="(.+?\.jpg)"'

Explain —— Match with src=" Start with one or more arbitrary characters ( Not greed ), With .jpg" a null-terminated string . For example, in the red box in the figure src after The link in double quotation marks is a matching string .

Then what we have to do is start from get_html Method returns a long string Get That satisfies a regular expression character string .

be used python Medium re In the library re.findall(str) It returns a list of matching strings

import urllib.request
import chardet
import re
page = urllib.request.urlopen('http://www.meituba.com/tag/juesemeinv.html') # Open the web page
htmlCode = page.read() # Get web source code
#print(chardet.detect(htmlCode)) # Check the encoding
data = htmlCode.decode('utf-8')
#print(data) # Print web source code
#pageFile = open('pageCode.txt','wb')# Open... In writing pageCode.txt
#pageFile.write(htmlCode)# write in
#pageFile.close()# Remember to turn it off when you turn it on
reg = r'src="(.+?\.jpg)"'# Regular expressions
reg_img = re.compile(reg)# Compile the , Run faster
imglist = reg_img.findall(data)# Match
for img in imglist:
print(img)

Output results

Then download the image to the local

urllib There is one in the library urllib.request.urlretrieve( link , name ) Method , Its function is to download the content in the link with the name of the second parameter , Let's try it out

x = 0
for img in imglist:
print(img)
urllib.request.urlretrieve('http://ppic.meituba.com/uploads/160322/8-1603220U50O23.jpg', '%s.jpg'  % x)
x += 1

Found an error

It should have been intercepted , Only those accessed through the browser can download , Anti reptile .

It should be OK to add the request header ...

Try another website , Baidu pictures https://image.baidu.com/ .. No way

Sina pictures http://photo.sina.com.cn/.. finally OK 了 ..

Try one more ： Crawling online novels

First crawl through all the chapters , Then get the body content of each chapter according to the hyperlink of each chapter and save it locally

import re
import urllib.request
def getGtmlCode():
html = urllib.request.urlopen("http://www.quanshuwang.com/book/44/44683").read() # Get web source code
html = html.decode("gbk") # Convert to this website format
reg = r'<li><a href="(.*?)" title=".*?">(.*?)</a></li>' # Regular matching according to website style ：(.*?) Can match everything , Bracketed for what we need
reg = re.compile(reg)
urls = re.findall(reg, html)
for url in urls:
#print(url)
chapter_url = url[0] # Chapter path
chapter_title = url[1] # Chapter name
chapter_html = urllib.request.urlopen(chapter_url).read() # Get the full-text code of this chapter
chapter_html = chapter_html.decode("gbk")
chapter_reg = r'</script>&nbsp;&nbsp;&nbsp;&nbsp;.*?<br />(.*?)<script type="text/javascript">' # Match article content
chapter_reg = re.compile(chapter_reg,re.S)
chapter_content = re.findall(chapter_reg, chapter_html)
for content in chapter_content:
content = content.replace("&nbsp;&nbsp;&nbsp;&nbsp;","") # Use spaces instead of
content = content.replace("<br />","") # Use spaces instead of
print(content)
f = open('{}.txt'.format(chapter_title),'w') # Save to local
f.write(content)
getGtmlCode()

The final results ：

.. Many websites are not so easy to climb to the data （ Page rules are not uniform ）, Before, I wanted to climb the data of Weibo, but I needed to log in or other authentication （ Anti crawler mechanism ）, In addition, the crawled data is directly stored in the database and displayed through certain rules, etc （ Climbing down data processing ）.. I will learn more later , Knowledge cannot be learned .