您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Day9 (Day10 expansion):python crawler

編輯：Python

python Reptiles （ Batch crawler skills ）

1、 Definition of reptile

Automatically capture valuable information on the Internet ,

2、 Crawler architecture

Scheduler 、URL Manager 、 Downloader 、 Parser 、 Applications

 Scheduler # The equivalent of a computer CPU, Mainly responsible for dispatching URL Manager 、 Downloader 、 Coordination between parsers .
URL Manager # Including the ones to be crawled URL Address and crawled URL Address , Prevent repeated grabs URL And loop grabbing URL, Realization URL There are three main ways of manager , Through memory 、 data library 、 Cache the database to achieve 
Web downloader # By passing in a URL Address to download the web page , Convert the web page to a string , The web downloader has urllib2（Python The official basic module ） Including the need to log in 、 agent 、 and cookie,requests( Third party package )
Parser #(html.parser,beautifulsoup,lxml) Parse a web page string , Extract useful information as required , According to the DOM The tree's parsing way to parse 
Applications # An application composed of useful data extracted from web pages .

3、requests library

pip install requests
import requests
''' View the contents of the Library '''
print(dir(requests))

3.1、 Response information ：

apparent_encoding # Encoding mode 
encoding # decode r.text Coding method of 
headers # Return response header , Dictionary format 
history # Return the list of response objects containing the request history （url）
links # Return the parsing header link of the response 
reason # Description of response status , such as "OK"
request # Return the request object requesting this response 
url # Returns the URL
status_code # return http The status code , such as 404 and 200（200 yes OK,404 yes Not Found）
close() # Close the connection to the server 
content # Returns the content of the response , In bytes 
cookies # Return to one CookieJar object , Contains the message sent back from the server cookie
elapsed # Return to one timedelta object , Contains the amount of time elapsed between sending the request and the arrival of the response , It can be used to test the response speed . such as r.elapsed.microseconds Indicates how many microseconds it takes for the response to arrive .
is_permanent_redirect # If the response is permanently redirected url, Then return to True, Otherwise return to False
is_redirect # If the response is redirected , Then return to True, Otherwise return to False
iter_content() # Iterative response 
iter_lines() # Iterate the response line 
json() # Return the result of JSON object ( The result needs to be JSON formatted , Otherwise, it will lead to mistakes ) http://www.baidu.com/ajax/demo.json
next # Returns the next request in the redirection chain PreparedRequest object 
ok # Check "status_code" Value , If it is less than 400, Then return to True, If not less than 400, Then return to False
raise_for_status() # If an error occurs , Method returns a HTTPError object 
text # Returns the content of the response ,unicode Type data

Example ：

import requests
# Send a request 
a = requests.get('http://www.baidu.com')
print(a.text)
# return http The status code 
print(a.status_code)

3.2、requests Request method

delete(url, args) # send out DELETE Request to specify url
get(url, params, args) # send out GET Request to specify url
head(url, args) # send out HEAD Request to specify url
patch(url, data, args) # send out PATCH Request to specify url
post(url, data, json, args) # send out POST Request to specify url
'''post() Method can send POST Request to specify url, The general format is as follows ：'''
requests.post(url, data={
key: value}, json={
key: value}, args)
put(url, data, args) # send out PUT Request to specify url
request(method, url, args) # To specify the url Send the specified request method

url request url.
data The parameter is to be sent to the specified url Dictionary 、 Tuple list 、 Byte or file object .
json The parameter is to be sent to the specified url Of JSON object .
args For other parameters , such as cookies、headers、verify etc. .

Example ：

a = requests.request('get','http://www.baidu.com/')
a = requests.post('http://www.baidu.com/xxxxx/1.txt')
requests.get(url)
requests.put(url)
requests.delete(url)
requests.head(url)
requests.options(url)

import requests
info = {
'frame':' Information '}
# Set request header 
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}
# params Receive a dictionary or string query parameter , Dictionary type is automatically converted to url code , Unwanted urlencode()
response = requests.get("https://www.baidu.com/", params = info, headers = headers)
# Check the response status code 
print (response.status_code)
# Check the response header character encoding 
print (response.encoding)
# See the full url Address 
print (response.url)
# View the response content ,response.text The return is Unicode Formatted data 
print(response.text)

3.3、requests Classes commonly used in libraries

requests.Request # Represents the request object Used to prepare a request to be sent to the server
requests.Response # Represents the response object Include server pairs http Response to the request
requests.Session # Indicates a request session Provide cookie persistence 、 Connection pool ( The technology of creating and managing a connected buffer pool ) And configuration

3.4、 File is written to

Write to existing file

To write to an existing file , You have to go to open() Function add parameter ：

"a" - Additional - Will be appended to the end of the file
"w" - write in - Will overwrite any existing content

Create a new file

If you need to be in Python Create a new file , Please use open() Method , And use one of the following parameters ：

"x" - establish - A file will be created , If the file exists, an error is returned
"a" - Additional - If the specified file does not exist , A file will be created
"w" - write in - If the specified file does not exist , A file will be created

Example ：（ Use crawlers to write files ）

import requests
f = open("C:/Users/ Jiuze /Desktop/demo.txt", "w",encoding='utf-8')
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}
response = requests.get("https://www.baidu.com/",headers = headers)
f.write(response.content.decode('utf-8'))

4、 Crawling pictures

4.1、os library

os Module provides a very rich way to handle files and directories

Rookie tutorial address =========>>>>>>https://www.runoob.com/
Provided by rookie tutorial

https://www.runoob.com/python/os-file-methods.html

4.2、 Show a single picture capture

Example ：

import requests
import os
url="xxxxxxxxxxxxxxxxxxxxxxxxx"
compile="C:/Users/ Jiuze /Desktop/imgs/"
path=compile + url.split('/')[-1]
try:
if not os.path.exists(compile):
os.mkdir(compile)
if not os.path.exists(path):
headers={

'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
rr=requests.get(url=url,headers=headers)
with open(path,'wb') as f:
f.write(rr.content)
f.close()
print(' File saved successfully ！')
else:
print(' file already exist ！！')
except:
print(" Crawling failed ！！！")

4.3、BeautifulSoup plug-in unit

To extract xml and HTML Data in

4.3.1、title

Get... In the source code title Label content

title.name # Get the tag name
title.string # obtain string Type character
title.parent.string # Get the parent tab

4.3.2、p

** Example ：** obtain html Of p label

import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.baidu.com")
n = r.content
m = BeautifulSoup(n,"html.parser")
for i in m.find_all("p"): # find_all() Get all the contents of a specified label in the source code 
with open('C:/Users/ Jiuze /Desktop/imgs/2.txt','w+',encoding='utf-8') as f:
f.write(str(i))
f.close()

** Example ：** Use regular expressions to output p label

import requests
import re
from bs4 import BeautifulSoup
r = requests.get("http://www.baidu.com")
n = r.content
m = BeautifulSoup(n,"html.parser")
for tag in m.find_all(re.compile("^p")):
with open('C:/Users/ Jiuze /Desktop/imgs/1.txt','w+',encoding='utf-8') as f:
f.write(str(tag))
f.close()
print(' Climb to success ！')

4.3.3、 Reptiles crawl novels

#-*-coding:utf-8 -*-
import requests
import re
import time
headers ={

'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
f=open('C:/Users/ Jiuze /Desktop/imgs/dpcq.txt','a+')
def get_info(url):
res = requests.get(url,headers=headers)
if(res.status_code==200):
contents=re.findall('<p>(.*?)<p>',res.content.decode('utf-8'),re.S) #re.S Input special characters 
for content in contents:
f.write(content + '\n')
else:
pass
if __name__ =='__main__':
urls=['http://www.doupoxs.com/doupocangqiong/{}.html'.format(str(i)) for i in range(1,200)]
for url in urls:
get_info(url)
time.sleep(1)
f.close()

4.3.8、bs analysis html

import requests
from bs4 import BeautifulSoup
r=requests.get("http://www.baidu.com")
m=r.content
n= BeautifulSoup(m,"html.parser")
print(n)

4.3.9、 Use Beautiful Soup analysis html file

#!/usr/bin/python
# -*- coding: UTF-8 -*-
import re
from bs4 import BeautifulSoup
html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """
# Create a BeautifulSoup Parse object 
soup = BeautifulSoup(html_doc,"html.parser",from_encoding="utf-8")
# Get all the links 
links = soup.find_all('a')
print " All links "
for link in links:
print link.name,link['href'],link.get_text()
print " Get specific URL Address "
link_node = soup.find('a',href="http://example.com/elsie")
print link_node.name,link_node['href'],link_node['class'],link_node.get_text()
print " Regular Expression Matching "
link_node = soup.find('a',href=re.compile(r"ti"))
print link_node.name,link_node['href'],link_node['class'],link_node.get_text()
print " obtain P The text of the paragraph "
p_node = soup.find('p',class_='story')
print p_node.name,p_node['class'],p_node.get_text()

4、urllib library

urllib Libraries are used to manipulate web pages URL, And grab the content of the web page

grammar ：

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

 url： #url Address .
data： # Other data objects sent to the server , The default is None.
timeout： # Set access timeout .
cafile and capath： #cafile by CA certificate , capath by CA Path to certificate , Use HTTPS Need to use .
cadefault： # It has been deprecated .
context： #ssl.SSLContext type , Used to specify SSL Set up

urllib.robotparser # analysis robots.txt file 
urllib.request # open / Read url
urllib.parse # analysis url
urllib.error # contain urllib.request Exception thrown .
readline() # Read a line from the file 
readlines() # Read the entire contents of the file , It assigns the read content to a list variable

Example ：

from urllib import request
file = request.urlopen('http://www.baidu.com')
data = file.read()
f= open('C:/Users/ Jiuze /Desktop/2.html','wb')
f.write(data)
f.close()

5、Scrapy

Scrapy Yes, it is Python The implementation of a web site in order to crawl data 、 Application framework for extracting structural data .

Scrapy Often used in including data mining , In a series of programs that process or store historical data .

Usually we can simply pass Scrapy The framework implements a crawler , Grab the content or pictures of the specified website .

 engine # be responsible for Spider、ItemPipeline、Downloader、Scheduler Intermediate communication , The signal 、 Data transfer, etc 
Scheduler # Responsible for receiving the engine sent Request request , And in accordance with a certain way to arrange the arrangement , The team , When the engine needs , Give it back to the engine 
Downloader # Responsible for downloading all messages sent by the engine Requests request , And get it Responses Give it back to the engine , Engine to Spider To deal with it 
Reptiles # Take care of all Responses, Analyze and extract data from it , obtain Item The data required for the field , And will need to follow up URL Submit to engine , Enter the scheduler again 
The Conduit # Responsible for handling Spider Obtained in Item, And carry out post-processing （ Detailed analysis 、 Filter 、 Storage, etc ） The place of 
middleware # A custom extension and operation engine and Spider Functional components of intermediate communication 
Download Middleware # Think of it as a component that you can customize to extend the download functionality

 Make Scrapy Steps for ：
New projects (scrapy startproject xxx)： Create a new crawler project
Clear objectives （ To write items.py）： Identify the goals you want to capture
Making reptiles （spiders/xxspider.py）： Make a crawler to start crawling the web
Store content （pipelines.py）： Design pipeline to store crawling content

1、 Create a new project

scrapy startproject mySpider # Create a mySpider New projects

2、 View the project file structure

scrapy genspider mySpider

mySpider/
scrapy.cfg
mySpider/
__init__.py
items.py
pipelines.py
settings.py
spiders/
__init__.py
...

scrapy.cfg: # The configuration file for the project .
mySpider/: # Project Python modular , The code will be referenced from here .
mySpider/items.py: # The project's target file .
mySpider/pipelines.py: # Pipeline files for the project .
mySpider/settings.py: # The setup file for the project .
mySpider/spiders/: # Store crawler code directory .

3、 adopt scrapy Crawling through website data

3.1、 Clear objectives

 open mySpider In the catalog items.py
Item Define structured data fields , Used to save crawled data , It's kind of like Python Medium dict, But it provides some extra protection to reduce errors .
You can do this by creating a scrapy.Item class , And the definition type is scrapy.Field Class property to define a Item（ It can be understood as similar to ORM The mapping relation of ）.

Next , Create a ItcastItem class , And build item Model （model）

import scrapy
class ItcastItem(scrapy.Item):
name = scrapy.Field()
title = scrapy.Field()
info = scrapy.Field()

3.2、 Making reptiles （ Climbing data ）

Enter a command in the current directory , Will be in mySpider/spider Create a directory called itcast The reptiles of , And specify the scope of the crawl domain ：

scrapy genspider itcast "itcast.cn"

open mySpider/spider In the directory itcast.py, The following code has been added by default :

import scrapy
class ItcastSpider(scrapy.Spider):
name = "itcast"
allowed_domains = ["www.itcast.cn"]
start_urls = (
'http://www.itcast.cn/channel/teacher.shtml',
)
def parse(self, response):
pass

take start_urls The value of is changed to the first one that needs to be crawled url：

start_urls = ("http://www.itcast.cn/channel/teacher.shtml",)

modify parse() Method ：

 def parse(self, response):
filename = "teacher.html"
open(filename, 'w').write(response.body)

And then run , stay mySpider Execute under directory ：

scrapy crawl itcast

Specify the encoding format for saving content

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

3.3、 Making reptiles （ Take the data ）

Crawling through the entire page is finished , The next step is to take the process , First look at the page source code ：

<div class="li_txt">
<h3> xxx </h3>
<h4> xxxxx </h4>
<p> xxxxxxxx </p>

xpath Method , We only need to input xpath Rules can be located to the corresponding html Tag node ：

 Parameters :
/html/head/title: choice HTML In the document <head> Inside the label <title> Elements
/html/head/title/text(): Choose the one mentioned above <title> The text of the element
//td: Choose all <td> Elements
//div[@class="mine"]: Select all of them with class="mine" Attribute div Elements

modify itcast.py The document code is as follows ：

# -*- coding: utf-8 -*-
import scrapy
class Opp2Spider(scrapy.Spider):
name = 'itcast'
allowed_domains = ['www.itcast.cn']
start_urls = ['http://www.itcast.cn/channel/teacher.shtml']
def parse(self, response):
# Get website title 
context = response.xpath('/html/head/title/text()')
# Extract Website Title 
title = context.extract_first()
print(title)
pass

Execute the following command ：

scrapy crawl itcast

stay mySpider/items.py There is a definition of ItcastItem class . Here we introduce :

from mySpider.items import ItcastItem

Then we encapsulate the data we get into a ItcastItem In the object , You can save the properties of each ：

import scrapy
from mySpider.items import ItcastItem
def parse(self, response):
#open("teacher.html","wb").write(response.body).close()
# A collection of teacher information 
items = []
for each in response.xpath("//div[@class='li_txt']"):
# Encapsulate the data we get into a `ItcastItem` object 
item = ItcastItem()
#extract() Method returns all of unicode character string 
name = each.xpath("h3/text()").extract()
title = each.xpath("h4/text()").extract()
info = each.xpath("p/text()").extract()
#xpath It returns a list of elements 
item['name'] = name[0]
item['title'] = title[0]
item['info'] = info[0]
items.append(item)
# Return the final data directly 
return items

3.4、 Save the data

scrapy There are four simple ways to save information ,-o Output file in specified format , The order is as follows ：

scrapy crawl itcast -o teachers.json

json lines Format , The default is Unicode code

scrapy crawl itcast -o teachers.jsonl

csv Comma expression , You can use Excel open

scrapy crawl itcast -o teachers.csv

xml Format

scrapy crawl itcast -o teachers.csv

Each parameter attribute

name = "" ： The identifying name of this crawler , Must be unique , Different names must be defined in different crawlers .
allow_domains = [] Is the domain name range of the search , That's the constraint area of the crawler , The crawler only crawls the web page under this domain name , There is no the URL Will be ignored .
start_urls = () ： The crawl URL Yuan Zu / list . This is where the crawler starts to grab data , therefore , The first download of data will come from these urls Start . Other children URL It will come from these start URL In inheritance generation .
parse(self, response) ： The method of analysis , Each initial URL When the download is complete, it will be called , Call when passed in from each URL Back to the Response Object as the only parameter , The main functions are as follows ：
1. Responsible for parsing the returned web page data (response.body), Extract structured data ( Generate item)
2. Generate... That requires the next page URL request .

( This failed, so there is no specific content to crawl out ) change another one

Examples of reptiles ：

1. The main idea of crawling data

We learned from the website （[https://so.gushiwen.cn/shiwenv_4c5705b99143.aspx](javascript:void(0))） Crawl through the title and verse of this poem , Then save it in our folder ,

2.scrapy Reptile case analysis

2.1、 Create a new one scrapy The frame name is ’poems‘ Folder

scrapy startproject poems

2.2、 Create a new one called ’verse‘ My crawler file

scrapy genspider verse www.xxx.com

2.3、 Send a request to a web page

Open the crawler file ’verse‘, Change the web address that needs to be crawled

import scrapy
class VerseSpider(scrapy.Spider):
name = 'verse'
# allowed_domains = ['www.xxx.com']
start_urls = ['www.xxx.com']

2.4、 Parsing data

change parse The analysis part , For the obtained data （response） Data analysis , The parsing method used is xpath analysis , Methods and requests The parsing method of sending requests is similar , First, find the part we need to parse , And fill in the corresponding code （ Here's the picture ）. We found that , And requests The difference in the parsing method of sending requests is , Add... To the original basis extract Method , and join Method to get text information

title = response.xpath('//div[@class="cont"]/h1/text()').extract()
content = response.xpath('//div[@id=contson4c5705b99143]/text()').extract()
title = ''.join(content)

2.5、 Return the data

To save data, we need parse The module has a return value , Let's create an empty list first data, Then we will title and content Put it in the dictionary and add it to the list

import scrapy
class VerseSpider(scrapy.Spider):
name = 'verse'
allowed_domains = ['https://so.gushiwen.cn/']
start_urls = ['https://so.gushiwen.cn/shiwenv_4c5705b99143.aspx/']
def parse(self, response):
data = []
title = response.xpath('//*[@id="sonsyuanwen"]/div[1]/h1').extract()
content = response.xpath('//div[@id=contson4c5705b99143]/text()').extract()
title = ''.join(title)
content=''.join(content)
dic = {

'title': title, 'content': content
}
data.append(dic)
return data