Right before Python Crawlers and machine learning are very interested , Recently, I finally started to learn ....
ok , It's not that there's no time , But when I have time, I do something else , So I can only take time to learn ‘ ok ’ The attitude of ...
Today I hurriedly started a small example , Just try to climb the website , It's entry-level , But I am still very excited because of my interest .
I took a look at it two days ago Python Basics , Because there are other languages based on HTML、js All will , So I just read the basic grammar and java What's the difference , Then some theoretical knowledge .
I was in Liao Xuefeng's blog And some basic videos I found , I have a preliminary understanding Python The grammar of , also Python and Java The difference between , For two languages to achieve the same function of different writing methods, and so on .
Then I learned Python The history of , Differences between and versions .
I chose Python3.7 Erection sequence .
Some basic knowledge has not been taken notes for the time being , Basically, you can understand it by referring to liaoxuefeng's blog and some online videos . If you want to go deeper, you'd better buy books to read .
Some of the foundations are not very clear , I want to go deep into a single knowledge point when I do it .
ok , Get down to business . This example is still very simple , Because I have seen relevant videos before , So I understand .
The goal is to crawl some data from the homepage of Meitu bar .
Access to the page
Python To access web pages, you need to first introduce urllib.request ( I used to use urllib No, it seems that the reason for the version , I feel like I have learned the wrong version )
urllib There is urllib.request.urlopen(str) Method to open a web page and return an object , Calling this object read() Method to get the source code of the web page directly , The content is the same as that of the source code when the browser right clicks .
print(chardet.detect(htmlCode))
Output {'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'} , The encoding method of the crawled content obtained
Then output the source code
import urllib.request
import chardet
page = urllib.request.urlopen('http://www.meituba.com/tag/juesemeinv.html') # Open the web page
htmlCode = page.read() # Get web source code
print(chardet.detect(htmlCode)) # Print the encoding of the returned page
print(htmlCode.decode('utf-8')) # Print web source code
Be careful : Direct output print(htmlCode) There will be coding problems , Then go to the original web page to check the source code , But run htmlCode.decode("UTF-8")
When , The following error appears :
line 19, in <module>
data = data.decode("UTF-8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
May be print Why , There may or may not be utf-8 code . You can try ,data = htmlCode.decode("UTF-8") print(data)
Save the crawled web page source code to the document
import urllib.request
import chardet
page = urllib.request.urlopen('http://www.meituba.com/tag/juesemeinv.html') # Open the web page
htmlCode = page.read() # Get web source code
#print(chardet.detect(htmlCode)) # Check the encoding
data = htmlCode.decode('utf-8')
#print(data) # Print web source code
pageFile = open('pageCode.txt','wb')# Open... In writing pageCode.txt
pageFile.write(htmlCode)# write in
pageFile.close()# Remember to turn it off when you turn it on
In this way test.py The directory will generate a pageCode.txt The file .
Get other information
open pageCode.txt file ( It can also be directly on the original web page F12 Debug get ), View the tag information that needs to get data .
For example, I want to take pictures now
Write the regular expression of the picture : reg = r'src="(.+?\.jpg)"'
Explain —— Match with src=" Start with one or more arbitrary characters ( Not greed ), With .jpg" a null-terminated string . For example, in the red box in the figure src after The link in double quotation marks is a matching string .
Then what we have to do is start from get_html Method returns a long string Get That satisfies a regular expression character string .
be used python Medium re In the library re.findall(str) It returns a list of matching strings
import urllib.request
import chardet
import re
page = urllib.request.urlopen('http://www.meituba.com/tag/juesemeinv.html') # Open the web page
htmlCode = page.read() # Get web source code
#print(chardet.detect(htmlCode)) # Check the encoding
data = htmlCode.decode('utf-8')
#print(data) # Print web source code
#pageFile = open('pageCode.txt','wb')# Open... In writing pageCode.txt
#pageFile.write(htmlCode)# write in
#pageFile.close()# Remember to turn it off when you turn it on
reg = r'src="(.+?\.jpg)"'# Regular expressions
reg_img = re.compile(reg)# Compile the , Run faster
imglist = reg_img.findall(data)# Match
for img in imglist:
print(img)
Output results
Then download the image to the local
urllib There is one in the library urllib.request.urlretrieve( link , name ) Method , Its function is to download the content in the link with the name of the second parameter , Let's try it out
x = 0
for img in imglist:
print(img)
urllib.request.urlretrieve('http://ppic.meituba.com/uploads/160322/8-1603220U50O23.jpg', '%s.jpg' % x)
x += 1
Found an error
It should have been intercepted , Only those accessed through the browser can download , Anti reptile .
It should be OK to add the request header ...
Try another website , Baidu pictures https://image.baidu.com/ .. No way
Sina pictures http://photo.sina.com.cn/.. finally OK 了 ..
Try one more : Crawling online novels
First crawl through all the chapters , Then get the body content of each chapter according to the hyperlink of each chapter and save it locally
import re
import urllib.request
def getGtmlCode():
html = urllib.request.urlopen("http://www.quanshuwang.com/book/44/44683").read() # Get web source code
html = html.decode("gbk") # Convert to this website format
reg = r'<li><a href="(.*?)" title=".*?">(.*?)</a></li>' # Regular matching according to website style :(.*?) Can match everything , Bracketed for what we need
reg = re.compile(reg)
urls = re.findall(reg, html)
for url in urls:
#print(url)
chapter_url = url[0] # Chapter path
chapter_title = url[1] # Chapter name
chapter_html = urllib.request.urlopen(chapter_url).read() # Get the full-text code of this chapter
chapter_html = chapter_html.decode("gbk")
chapter_reg = r'</script> .*?<br />(.*?)<script type="text/javascript">' # Match article content
chapter_reg = re.compile(chapter_reg,re.S)
chapter_content = re.findall(chapter_reg, chapter_html)
for content in chapter_content:
content = content.replace(" ","") # Use spaces instead of
content = content.replace("<br />","") # Use spaces instead of
print(content)
f = open('{}.txt'.format(chapter_title),'w') # Save to local
f.write(content)
getGtmlCode()
The final results :
.. Many websites are not so easy to climb to the data ( Page rules are not uniform ), Before, I wanted to climb the data of Weibo, but I needed to log in or other authentication ( Anti crawler mechanism ), In addition, the crawled data is directly stored in the database and displayed through certain rules, etc ( Climbing down data processing ).. I will learn more later , Knowledge cannot be learned .