您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Learning and application of Pythons scrapy

編輯：Python

Scrapy Introduction to reptiles to mastery

http://scrapy-chs.readthedocs.io/zh_CN/latest/intro/tutorial.html#id5*

Reference books 《 Master Scrapy Web crawler 》

1.1 Definition and work overview of crawler

Web crawler refers to a program that automatically crawls the information of website content on the Internet , Also known as web spiders and web robots

The basic crawling process is ：

1.2 Scrapy Introduction and installation

brief introduction ：

Scrapy Use python Language based Twisted The framework is an open source web crawler framework , At present, we support python2.7 And python3.4+

install

pip install scrapy
// During installation, if the system prompts that there are no dependent files
pip install wheel
// This is a Twisted Dependence
pip install C:\Users\10338\Downloads\Twisted-17.9.0-cp36-cp36m-win_amd64.whl

1.3 Write the first Scrapy Reptiles

demand ： Crawl the information on the Internet

http://books.toscrape.com/

Create project

utilize shell Use the command line to create

//scrapy startproject Project name
scrapy startproject book_spider

from scrapy import cmdline
cmdline.execute("scrapy crawl LG_Spider -o LG_Spider.csv".split())

Write code （ stay pycharm Add scrapy engineering ）

import scrapy

class BooksSpider(scrapy.Spider):
   # Define a crawler
   name = "books"
   # Define the starting point for the crawler
   start_urls = ['http://books.toscrape.com/']

   def parse(self, response):
       # Extract the data
       # Get the information of each book on the label class=“b-all-content cf”
       # utilize CSS To find all the elements , And one iteration
       for book in response.css('article.product_pod'):
           book_name = book.xpath('./h3/a/@title').extract_first()
           book_price = book.xpath('p.price_color::text').extract_first()
           yield {'book_name':book_name,
                  'book_price':book_price,
                  }

           # Extract the link
           next_url = response.css('ul.pager li.next a::attr(href)').extract_first()
           if next_url:
               next_url = response.urljoin(next_url)
               yield scrapy.Request(next_url,callback=self.parse)

extract.frist() Return string extract Returns an array of

name attribute ： In a scrapy There may be multiple crawlers in the ,name Attributes are the only difference between crawlers .

start_urls attribute ： A crawler starts crawling from a page , That is, the starting grab point .

parse: After a page is successfully fetched ,Scrapy A page parsing function specified by us will be returned （ The default is parse Method ）.

attr()jQuery. Returns the attribute value of the selected element

1.34 Run crawler

scrapy crawl books -o books.csv

The information obtained will be saved in books.csv In the file of .

2 To write Spider

2.1scrapy The working principle of the frame structure machine

engine （scrapy）

It is used to sort out the data flow processing of the whole system . Triggering event

Scheduler （Scheduler）

To accept requests from the engine , Push into the queue , And return when the engine requests it again , It can be thought of as a URL Priority queue for , It's up to him to decide what the next URL to crawl is , Remove duplicate URLs at the same time .

Downloader （Ｄｏｗｎｌｏａｄｅｒ）

For downloading web content , And return the contents of the web page to spider The Downloader is built on twisted On this efficient asynchronous model

Reptiles （spiders）

Reptiles （ The working and peasant class ） Mainly working , For from intenet Crawling information , The so-called entity . Users can also extract links from it , Give Way spider Continue to crawl to the next page

Project pipeline （pipline）

Responsible for dealing with entities extracted from web pages by Crawlers , The main function is to persist entities , Verify the validity of the entity , Clear unwanted information , When the page crawler parses , Will be sent to project pipeline , And process the data in several specific orders .

Downloader middleware （Dowloader Middlewares）

be located Scrapy The framework between the engine and the downloader , Mainly used to deal with Scrapy The request and response between the engine and the downloader

Crawler middleware （Spider Middlewares）

Be situated between Scrapy The engine and Spider Between the frames , Mainly dealing with Scrapy The engine and Spider Between request and response

Scheduling intermediate value （Scheduler MIddewares）

Be situated between Scrapy The framework between engine and scheduler , from Scrapy Requests and responses sent by the engine to the schedule

scrapy workflow

1. The engine takes a link from the scheduler （URL） For subsequent crawling

2. Engine handle URL Encapsulate as a request （request） To Downloader

3. Downloader Downloads resources , And encapsulate it into a response package （response）

4. Crawler analysis response

5. Parse out the entity （Item）, Then it will be sent to the solid pipeline for further treatment

6. It's a link （URL）, Then put URL Give it to the scheduler for grabbing

2.2 Request and Response

2.2.1 Request

request(url[,callback,method='GET',headers,boday,cookies,meta,encoding='utf-8',priority=0,dont_filter =false,errback])

~url ： Request address

~callback： Page parsing function ,callback type ,request Object requested page download After completion , The page resolution specified by this parameter is called , When not called spider Default call parse Method .

~method：HTTP Request method of , The default is GET Method .

~headers：HTTP Requested header dictionary ,dict type , for example {‘A':'a','B':'b'} If the value of an internal item is NONE, It means that the header of the item is not sent , for example ：{‘C’：None}, Prohibit sending C

~body：HTTP Body of request ,bytes or str type

~cookies： Information dictionary ,dict type

~meta：request Metadata dictionary for ,meta The label is always in head The interior of the element . Meta information of the header —— The structure used to describe information , grammar , Use and usage, etc ‘.

~encoding： Coding format .

~priority： Set the priority of the request .

~dont_filter： The default value is False, Value changed to True Can request to avoid being filtered , Compulsory Download .

~errback： Erroneous return .

import scrapy
request = scrapy.Request(' Address ')
request = scrapy.Request(' Address ',callback=self.parseItem)

2.2.2Response object

After downloading the page , To get a Response The subclass object of , Subclasses have TextResponse、HtmlResponse、XmlResponse, Because they usually play with web pages , So we use HtmlResponse, Be careful TextResponse yes HtmlResponse and XmlResponse Parent class of .

~url：HTTP Responsive url Address ,str type .

~status：HTTP The status code of the response .

~headers：HTTP Response header ,dict type .

~body：HTTP Response Content ,bytes Li type

~text： The text form corresponds to .

~encoding： Coding format .

~request ： Produce this HTTP Responsive Request The object of .

~meta: namely response.request.meta, structure Request Object can be taken out response.meta Information .

~selector: Selector object Used in response Extract data from （ Selectors ）、

~Xpath（query）： Use xpath stay response Extract data from , It's actually response.selector.xpath Shortcut to method .

~css(query)： Use CSS Selector in response Data extraction in . For the actual response.selector.css Shortcut to method .

~urljoin： Used to construct absolute url. When it comes to url Parameter 4 is a relative address , according to response.url Calculate the corresponding absolute url.

The common methods are as follows ：css xpath To extract data , utilize urljoin Construct absolute url.

2.3Spider Development process

The four steps of development ：

1. Inherit scrapy.Spider

2. by Spider The name

3. Set the starting crawling point .

4. Implement page parsing function .

2.3.1 Inherit scrapy.Spider

Scrapy The framework provides a Spider Base class ,.

stay Spider The base class implements ：

1. for Scrapy The interface called by the engine

2. Utility functions for users

3. Properties for user access .

2.3.2 by spider name

Multiple can be implemented in one project spider,name Property is the only property that distinguishes the small crawls of these jobs . In execution scrapy crawl This logo will be used when

2.3.3 Set the starting crawl point

Spider The starting page to climb ,start_urls It is usually a list , Put all crawling points in it url.

2.3.4 Implement page parsing function

Page parsing function , That is, structure Request When the object passes callback Return function specified by parameter （ perhaps parse Method ）. What needs to be done ：

1. Using selectors to extract data from a page , Data encapsulation （Item or dict） Submit to Scrapy engine

2. Use a selector or LinkExtractor Extract the links in the page , Use it to construct new request Object and submit to Scrapy engine

3 Selector object

Common treatment HTML Page parsing module

～BeatifulSoup

～ＬＸＭＬ

Scrapy It combines the two to realize Selector class , Use first through Xpath perhaps CSS The selector selects the data under the page , Then extract

3.1.1 Create objects

establish Selector Object time , The page can be HTML The document string is passed to Selector Constructor method text Number of parameters ; You can also use Response Object construction Selector Object passes it to Selector Constructor method response Parameters .

3.1.2 Select data

utilize Xpath and css Method

because Selenium Use xpath When positioning, it uses the way of traversing the page , In terms of performance CSS Selectors are better .xpath Although the performance index ratio is poor , But there is good plug-in support in the browser , It is convenient to locate elements , Those with strict performance requirements can be used alternatively .

Xpath:

xpath Using path expressions in xml Navigation in the document

xpath Contains a standard function library

xpath yes XSLT（ take xsl transformation ,xsl Extensible style sheet language ） The main element in

xpath It's a W3C standard （web Technical standards of ）

stay xpath in , There are seven types of nodes ： Elements 、 attribute 、 Text 、 Namespace 、 A processing instruction 、 Annotate first level documents （ The root node ）.XML Documents are treated as trees , The root of the tree is called the document node or the root node .

xpath and css Method returns a SelectorList object , It contains the corresponding... For each selected part Selector object ,SelelctorList Support list interface , have access to for Statement to access each of them Selector object ：

for sel in selector_list:
   print(sel.xpath('./text()'))

selector_list There are also objects. xpath and css Method , The behavior that invokes them is ： Call each of them separately with the received parameters Selector Object's xpath and CSS Method , And all the results will be received by a new SelectorLiist Object is returned to the user eg:

>>>selector_list.xpath('./text()')
[<Selector xpath='./text()'data='hello world'>,<Selector xpath='./text()'data='Hello world'>]

3.1.3 Extract the data

call Selector and SelectorList Object method ：extract()、re()、extract_first()、re_first()（ The last two are SelectorList proper ）

1.extract() Method

call Selector Object's extract Method will return the... Of the selected content Unicode character string , And SelectorList Object's xpath and css similar , SelectorList Object's extract Each of the methods is called internally Selector Object's extract Method , And collect all the results into a list and return it to the user .

2.extract_first()

This method puts back the first one Selector Object call extract Result of method . stay selectorList Object contains only one Selector Object , Extract directly Unicode Strings instead of lists .

3.re and re_first

Extract a part of the content using regular expressions , have access to re Method ,re_first Method also returns the first of these Selector Object call re Result of method .

3.2 Response built-in Selector

In fact, in the application process , There is little need to manually create Selector object , On the first visit to a Response Object's selector Attribute ,Response Objects are automatically created with their own parameters Selector object , And put the Selector Object caching , Convenient for next use .

Knowledge points supplement ：@propety Attributes decorate

Example ：

class Student(object):
   def __init__(self,name;score):
self.name = name
        self.score = score
   @property
   def score(self):
       return self.__score
   @score.setter
   def score(self,score):
       if score < 0 or score > 1000:
           raise ValueError('invaid score')
       self.__score = score

Attention! ： first Score(self) yes get Method , use @property decorate , the second score(self,score) yes set Method , use @score.setter decorate ,@score.setter It was the previous one @property A by-product after decoration .

score Property settings .

>>> s = Student('Bob', 59)
>>> s.score = 60
>>> print s.score
60
>>> s.score = 1000
Traceback (most recent call last):
 ...
ValueError: invalid score

3.3Xpath

XPath That is to say XML Path to the language （XML Path Language）

Xpath The common basic grammar of

3.3.1 Basic grammar

expression describe / Select the root of the document （root）. Select the current node .. Select the parent node of the current node ELEMENT Select all of the sub nodes ELEMENT Element nodes //ELEMENT Select all... In the descendant node ELEMENT Element nodes * Select the child nodes of all elements 、text() Select all text child nodes @ATTR The selected name is ATTR Property node of @* Select all attribute nodes [ Predicate ] Predicates are used to find a specific node or a node containing a specific value

<html>
   <head>
  <base href='http://example.com/'/>
       <title>Example website</title>
   </head>
   <body>
       <div id='images'>
           <a href='image1.html'>Name:Image 1<br/><img src='image1.jpg'></a>
           <a href='image1.html'>Name:Image 2<br/><img src='image2.jpg'></a>
          <a href='image1.html'>Name:Image 3<br/><img src='image3.jpg'></a>
           <a href='image1.html'>Name:Image 4<br/><img src='image4.jpg'></a>
           <a href='image1.html'>Name:Image 5<br/><img src='image5.jpg'></a>
           <a href='image1.html'>Name:Image 6<br/><img src='image6.jpg'></a>
       </div>
   </body>
</html>

from scrapy.selector import Selector
from scrapy.http import HtmlResponse
response = HtmlResponse(url = 'http://www.example.com',body=body.encoding='utf8')
response.xpath('/html')
response.xpath('/html/head')
response.xpath('/html/body/div/a')
response.xpath('//a')

3.3.2 Common grammar

This section is filled in with

xpath Provides many functions , For example, digital 、 character string 、 Time 、 date 、 Statistics, etc. .

string(arg) Returns the word string value of the parameter .

3.4 CSS Selectors

CSS That is, the layer stylesheet , Its selector is a way to determine HTML The language of the location of a part of the document .CSS Compared with xpath Simple , But not as good as xpath Strong function .

Actually using CSS When the method is used ,python The library will cssselect take CSS The selector expression is translated into xpath The expression then calls Selector Object's Xpath Method .

CSS Selectors

expression describe Example * Select all elements *E Choose E Elements pE1,E2 Choose E1,E2 Elements div,preE1>E2 Choose E1 Of the elements of posterity E2 Elements div pE1+E2 Choose E1 Of sibling elements E2 Elements p+strong.CLASS Choose CLASS Attribute contains CLASS The elements of .info#ID Choose id The attribute is ID The elements of #main[ATTR] Check to include ATTR Attribute elements [href][ATTR=VALUE] Check to include ATTR Property and value is VALUE The elements of [method=post][ATTR~=VALUE] Choose ATTR Property and the value contains Value The elements of [class~=clearfix]E:nth-child(n) E:nth-last-child(n) Choose E Elements , And the element must be of its parent element （ Reciprocal ） The first n Sub elements a:ntn-child(1) a:nth-last-child(2)E:first-child E:last-child Choose E Element and the element must be of its parent element （ Reciprocal ） The first sub element a:first-child a:last-childE:empty Select a that has no child elements E Elements div:empyE:text Choose E The text node of the element （Text Node）

The first 4 Chapter Use Item Data encapsulation

4.1Item and Field

Item Base class ： Custom data class （ Interface supporting dictionary ）

field class ： Used to describe which fields the custom data class contains

Customize a data class , Just inherit Item, And create a series of Field Class properties of an object

When assigning a value to a field , If it is an internal field , An exception is thrown .

actually Field yes Python A subclass of a dictionary , You can get... Through the key Field Metadata in objects

Rewrite the previous code

from scrapy import Item,Field
class BookSpiderItem(scrapy.Item):
name = Field()
   price = Field()

Before modification BooksSpider, Use BookItem replace Python Dictionaries

from ..Item import BookSpiderItem
class BooksSPider(scrapy.Spider):
   def parse(self,response):
       for sel in response.css('article.product_pod'):
           book = BookSpiderItem()
           book['name'] = sel.xpath('./h3/a/@title').extract_first()
           book['price'] = sel.css('p.price_color::text').extract_frist()
           yield book

4.2 expand Item Subclass

stay ltem Add a new field in the Field() Attributes of a class .

4.3field Metadata

A data item consists of Spider Submit to Scrapy Behind the engine , It may be submitted to other components for processing （Item Pipline Exporter） Handle , Suppose you want to pass additional information to a component that processes data （ for example , How the master component should handle data ）, You can use Field Metadata .

The fifth chapter Use Item Pipeline Processing data

Data processing , Multiple can be enabled in a project Item Pipeline , Its typical application

Data cleaning

Validate data

Filter duplicate data

Store data in a database

5.1 Realization Item pipeline

When the project is created, a pipelines.py The file of ,

(1) One Iltem Pipleine There is no need to inherit a specific base class 、 You only need to implement certain methods , for example process-_Iitem open_spider close_spider

(2) One item Pipeline A must be achieved process_item(item,spider) Method , This method is used to process each item by spider Crawling data , Every two of these parameters

Item A piece of data crawled （Item Or a dictionary ）.

Spider Crawling this data spider object .

（3） If process_item Dealing with an item item A data item is returned （Item Or a dictionary ）, The returned data will be sent to the next pole Item pipeline To continue processing .

（4） If process_Item Dealing with an item item Throw when （raise） One Droptem abnormal （scarpy.exceptions.DropItem）, that item Item Will be abandoned , No longer deliver to the following Item pipeline To continue processing , Nor will it be exported to a file , Usually , When we detect invalid data or filter data, we will throw DropItem abnormal .

（5）open_spider(self,spider)

Spider When open （ Before processing the data ） Call back the method , Usually this method is used to complete some initialization before starting processing data again , Such as linked database

（6）close_spider(self,Spider)

Spider closed （ After processing the data ） Call back the method , Usually this method is used to clean up after all the data has been processed , Such as closing the database .

（7）from_crawlwer(cls,crawler)

establish Item Pipeline Object to call back this method , Usually in this method, the crawler.settings Reading configuration , Create from configuration Item pipeline object .

5.2 Enable Item pipeline

stay scrapy in . Want to enable a Item Pipleine Need in profile settings.py To configure ：

ITEM_PIPELINES = {
   'example.piplines.priceConverPipeline':300,
}

ITEM_PIPELINES It's a dictionary . We put what we want to enable Item Pipeline Add to this dictionary , The value is 0~1000 The number of , The smaller the number, the earlier it is executed .

Realization

class PriceConveterPipeline(object):
   exchange_rate = 8.5209
   def process_item(self,item,spider):
       price = float(item['price'][1:])*self.exchange_rate
       item['price']='￥%.2f'%price

Code interpretation Item Pipeline There is no need to inherit a specific base class . You only need to implement certain methods, such as process_item、oepn_spider、close_spider.

The above implementation method is very simple to convert the sterling price of the book into a floating-point number multiplied by the exchange rate and keep two decimal places , And then assign it to item Medium price Field , Finally, the processed item.

It can be seen that ,process_item Dealing with an item item A data item is returned when （Item Or a dictionary ）, The returned data will be delivered to the next level process_item（ If there is ） To continue processing

5.3 Example

Filter duplicate data , Handling duplicate data , The code is as follows ：

from scrapy.exceptions import DropItem
class DuplicationPipeline(object):
   def __init__(self):
       self.book_set = set()
   def process_item(self,item,spider):
       name = item['name']
       if name in self.book_set:
           raise DropItem("Duplicate book found: %s"%item)
       self.book_set.add(name)
       return item

Ways to add constructors , Initializes the collection used to de duplicate book titles .

stay process_item In the method , Take out first item Of name Field , Check whether the title of the book is already in the collection book_set in , If there is , Is duplicate data , Throw out DropItemy abnormal , take item abandon , Otherwise it would be item Of name Fields are stored in the set and returned item.

5.4 Store data MongoDB

Store data in some kind of database , By implementing Item Pipeline complete

Example ：

from scrapy.item import Item
import pymongo
class MongoDBPipeline(object):
   DB_URI = 'mongodb://loaclhost:27017/'
   DB_NAME = 'scrapy_data'
   
   def open_spieder(self,spider):
       self.client = pymongo.MongoClient(self.DB_URI)
       self.db = self.client[self.DB_NAME]
       
   def close_spider(self,spider):
       self.client.close()
       
   def process_item(self,item,spider):
       collection = self.db[spider.name]
       post = dict(item)if isinstance(item,item)else item
       collection.insert_one(post)
       return item

Code interpretation ：

Define two constants in the class properties ：DB_URI Database URI Address

DB_NAME Database name

spider The database connection only needs to be executed once when crawling , You should connect to the database before starting data processing , And close the database implementation after processing the database open_spider(spider) and close＿spider(spider).

stay process_item To realize MongoDB Database writing , Use self.db and spider.name Get collection collection, Then insert the data into the set , A collection of insert_one Method needs to pass in a dictionary object , When you can't item object , So before using item The type of judgment , If the time is item Objects are converted to dictionaries .

stay settings.py Enabled in file MongoDBPipelien:

ITEM_PIPELINES = {
   'example.pipelines.PriceConverterPipeline':300
   'example.pipelines.MongoDBPipline':400
}

Chapter six Use LinkExtractor Extract the link

When climbing , There are many links to other web pages in the page , Sometimes we need to extract this information , The commonly used extraction methods are Selector and LinkExtractor The two methods

1.Selector Because the link is also the data in the page , So use the same method to extract data , When extracting a small amount of data , The link extraction method is relatively simple , Use Selector That's enough .

2.LinkExtractor scarpy Provides a special class for extracting Links LinkExtractor, When extracting a large number of links or extracting rules is complex , Use Link Extractor It's more convenient .

class BooksSpider(scrapy.SPider):
   def parse(self,reponse):
       # Extract the link
       # Next page url The information in ul.pager>li>a Inside
       next_url = response.css('ul.pager.li.next.a::attr(href)').extract_first()
       if next_url:
           # If you find the absolute path of the next page, construct a new response object
           next_url = response.urljoin(next_url)
           yield scrapy.Request(next_url,callback=self.parse)

6.1 Use Link Extractor

from scrapy.linkExtractors import LinkExtractor
class BooksSpider(scarpy.Spider):
   def parse(self,reponse):
       le = LinkExtractor(restrict_css='ul.pager li.next')
       links = le.extract_links(reponse)
       if links:
           next_url = links[0].url
           yield scrapy.Request(next_url,callback=self.parse)

Import LinkExtractor, It is located in scrapy.linkExtractors modular

Create a linkExtarctor object , Use one or more constructor parameters to describe the extraction rule , This is passed on to restrict_css Parameter one CSS Selector expression . He describes the area where the next page links .

call LinkExtarctor Object's extract_Links Method passes in a Response object , This method relies on the extraction rules described when the object is created Response Extract links from the pages contained in the , Finally, a list is returned , Each of these elements is a Link object , That is, the extracted link , Finally, a list is returned , Each element is a Link object .

Because of the extraction of a page, there is only one next link on a page , So with Links[0] Information obtained , So with Links[0] obtain Link object ,Link Object's url Attribute is the absolute value of the linked page URL Address （ No need to call response.urljoin Method ）, In use Request Construct and submit .

6.2 Describe extraction rules

Learn to use LinkExtractor Constructor parameters of describe extraction rules

<<<<<<!--example.html>
<html>
<body>
<div di="top">
<a class="internal"herf="/intro/install.html">Installation guide</a>
<a class="internal"herf="/intro/install.html">Tutorial</a>
<a class="internal"herf="/intro/install.html">Examples</a>
</div>
<div>
<p> Here are some off-site links </p>
<a href="http://stackoverflow.com/tags/scrapy/info">StackOverflow</a>
<a href="http;//github.com/scrapy/scrapy">Fork on</a>
</div>
</body>
</html>

<html>
<head>
<script type='text/javascript'src='/js/app1.js/'/>
<script type='text/javascript'src='/js/app2.js/'/>
</head>
<body>
<a href="/home.html"> Home page </a>
<a href="javascript:goToPage('/doc.html');return false"> file </a>
<a href="javascript:goToPage('/example.html');return false"> Case study </a>
</body>
</html>

Use the above two HTML Text constructs two Response object

from scrapy.http import HtmlResponse
html1 = open('example.html1').read()
html2 = open('example.html2').read()
reponse1 = HtmlResponse(url='http://example.com',body=html,encoding='utf8')
reponse2 = HtmlResponse(url='http://example.com',body=html,encoding='utf8') # structure Html Two of the text Response object

Be careful ：LinkExtractor All constructor parameters have default values , Without explanation , The default value will be used .

from scrapy,linkextractor import LinkExtractor
le = LinkExtractor()
links = le.extract_links(reponse1)
print([link.url for link in links])

LinkExtractor Parameters of ：

——allow Accept a regular expression or a list of regular expressions , Extract absolute url Regular expression matching chainers , If the parameter is empty , Just extract all the links .

# Extract page example.html The path in is marked with /info At the beginning
from scrapy.linkExtractors import LinkExtractor
pattern = '/info/.+\.html$'
le = LinkExtractor(allow=patten)
links = le.extract_links(response1)
print(link.url for link in links)

——deny Accept a regular expression or a list of regular expressions , And allow contrary , Exclude absolute and url Matching links

# extract example.html1 Medium is all off-site links, excluding on-site links
from scrapy.linkExtractors import LinkExtractor
from urllib.parse import urlparse
pattern = patten = '^'+urlparse(reponse1.url).geturl()
print(pattern)

——allow_domains Accept a domain name or a list of domain names , Extract the link to the specified domain

# Extract page example1.html From all to GitHub and stackoverflow.com Links to these two web sites
from scrapy.linkextractors import LinkExtarcot
domains = ['github.com','stackoverflow.com']
le = LinkExtractor(allow_domains=domains)
links = le.extract_links(reponse1)
print([link.url link in links])

——deny_domains Accept a domain name or a list of domain names , And allow_domains contrary , Exclude links to the specified domain

# Extract page example,html In addition to GitHub.com Links outside the domain
from scrapy.linkextractors import LinkExtractor
le = LinkExtractor(deny_domains='github.com')
links = le.extract_links(response1)
print([link.url for link in links])

——restrict_xpaths Accept one Xpath An expression or a Xpath A list of , extract Xpath The link under the region in the expression

# Extract page example.html in <div id="top"> Link under element ：
from scrapy.linkextractors import LinkExtractor
le = LinkExtractor(restrict_xpath='//div[@id="top"]')
links = le.extract_links(response)
print([link.url for link in links])

——restrict_css Accept one CSS Or an expression CSS A list of , extract CSS Area link under expression

# Extract the page example.html in <div id="botton"> Element
from scrapy.linkextractors import LinkExtractor
le = LinkExtractor(restrict_css='div#bottom')
links = le.extract_links(response1)
print([link.url for link in links])

——tags Receive a tag （ character string ） Or a list of tags , The default value for extracting links within the specified tag is ['a','area']

——attrs Receive an attribute （ character string ） Or a list of attributes , Extract the link in the specified attribute , The default is ['href']

# Extract page example2.hml I quote JavaScript The file link
from scrapy.linkextractors import LinkExtractor
le = LinkExtractor(tags='script',attrs='sec')

——process_value The receiving form is like func(value） Callback function for . If this function is used ,LinkExtractor The function will be called back , For each link extracted （ Such as a Of harf） To deal with , The callback function should normally return a string , That is, the result of processing , When you want to discard the link you are working on , return None.

import re
def process(value):
m = re.search("javascript:goTopage\('(.*?))'",value)
   if m:
       value = m.group()
   return value
from scrapy.linkextractors import LinkExtractor
le = LinkExtractor(process_value=porcess)
links = le.extract_link(response2)
print([link.url for link in links])

Chapter vii. Use Exporter Derived data

scrapy The component responsible for exporting data in is called Exporter（ Exporter ）,scrapy There are multiple exporters in , Each corresponds to an export of a data format , The supported data are as follows

（1）JSON ——JsonItemExporter

（2）JSON Lines ——JsonLinesItemExporter

（3）CSV ——CsItemExpoter

（4）XML——XmlItemEporter

（5）Pickle——PickleItemExporter

（6）Marshal——MarshalExpotrer

7.1 Specify how to export data

When exporting data , We need to Scapy Provide , The path of the exported file and the exported data format , That is to say, what kind of people should be , Shout out from that place .

It can be specified through command line parameters or configuration file

Running scrapy crawl You can go through -o and -t Parameter specifies the path of the exported file and the format of the data of the exported file .

scrapy crawl books -o books.csv

Among them -o books.csv Specifies the export path of the document , Although not used -t Parameter specifies the format of the exported data , however Scrapy The crawler infers from the file suffix that we use csv Export data in the format of , alike , If you change the data to -o books.json Will be with json Export data in the format of . Use... When you need to specify an export path explicitly -t Parameters .

scrapy crawl books -t csv -o book1.data
scrapy crawl books -t json -o book2.data
scrapy crawl books -t xml -o book3.data

In the configuration Dictionary FEED_EXPORTERS Mid search Exporter,PEED_EXPORTERS Merge the contents of , In the default configuration file FRRD_EXPORTERS_BASE, In the user profile FEED_EXPORTERS.

The former includes internally supported export data formats , The latter includes exporting user-defined export data formats . If the user adds a new export data format （ That is, a new exporter）, You can ask the price in the configuration settings.py In the definition of FEED_EXPORTERS for example ：

FEED_EXPORTERS={'excel':'my_project.my_exporters.ExcellemExporter'}

When specifying the file path , You can also use %(name)s and %(time)s Two special variables ,%(name)s Will be replaced by Spider Name ,%(time)s Will be replaced by the piece creation time .

The configuration file , See how to export data in the configuration file ,

FEED_URL export path

FEED_URL= 'exporter_data/%(name)s.data'

FEED_FORMAT Export format

FEED_FORMAT= 'csv'

FEED_EXPORT_ENCODING Export file code , Note that by default json Documents are digitally encoded , Other uses utf-8 code

FEED_EXPORT_ENCODING = 'gbk'

FEED_EXPORT_FILEDS The fields contained in the exported data , All fields are exported by default , And specify the order

FEED_EXPORT_FILEDSv={'name','author','price'}

FEED_EXPORTERS User defined data format

FEED_EXPORTERS= {'excel':'my_project.my_exporters.ExcelItemExporter'}

7.2 Realization Exporter

exporter_item(self,item) Responsible for exporting and crawling to each data , Parameters item by A crawled data , Each subclass must implement the method .

start_exporting(self) Called at the beginning of the export , You can perform some initialization in this method

finish_exporting(self) Called when the export is complete , Some cleanup work can be performed in this method

With JsonItemExporter As an example :

To make the final export in a json List in , stay start_exporting and finish_exporting Method to write... To the file respectively b"[\n,b"\n]".

stay export_item In the method , call self.encoder.encode Method to convert a term number to json strand , Then write the file .

# Create one in the project my_exportes.py（ And settings.py At the same directory ）, In which ExcelItemExporter
from scrapy.linkexporters import BaseItemExporter
import xlwt
class ExcelItemExporter(BaseItemExporter):
def __init__(self,file,**kwargs):
       self._configure(kwargs)
       self.file = file
       self.wbook = xlwt.Workbook()
       self.wsheet = self.wbook.add_sheet('scrapy')
       self.row = 0
   def finish_exporting(self):
       self.wbook.save(self.file)
   def export_item(self,item):
       filed = self._get_serialized_fileds(item)
       for col,v in enumerate(x for _,x in fields):
           self.wsheet.write(self.row,col,v)
       self.row += 1

Code interpretation ：

Using third party libraries xlwt Write data to Excel file .

Create... In constructor methods Workbook Objects and Worksheet object , And initializes the self.row.

stay exporter_item Method calls the base class. _get_serialized_fileds Method , get item Iterators for all fields , And then call self.wsheet.write Method to write each field to excel form .

finish_exporting Method when all data is written Excel The table is called , Call in this method self.wbook.save Methods will Excel Table write Excel file

After writing the data export settings.py Add the configuration to the file

FEED_EXPORTERS = {'excle':'example.my_exporters.ExecelitemExporter'}

Run the crawler from the command line

scrapy crawl books -t excel -o books.xls

Chapter viii. Project exercise

demand ： Access to web site “http://books.toscrape.com/” Book information of the website . The information includes ： Title ‘、 Price 、 Evaluate stars 、 Number of evaluations 、 Book code and inventory . See for detailed source code GitHub Finally, output the crawled data to book.csv In the document .

Chapter nine Download files and crawl image information

stay scrapy There are two Item Pipeline, Specifically for downloading files and pictures Files Pipeline and ImagesPipeline, We can combine these two Item Pipeline As a special downloader , User pass item A special field of the file information or picture information to be downloaded url Pass it on to them , They will automatically download files or pictures to the local , And save the downloaded result information into item Another field for , So that users can refer to... In the export file .

FilePipeline Usage method ：

1. Enable... In the configuration file FilesPipeline, Usually put it in another Item Pipeline Before

ITEM_PIPELINS = {'scarpy pipelines.files.FilesPipeline'}

2. Use... In the configuration file FIFLES_STORE Specify the directory to download the file . for example

FILE_PIPELINS = {'/home/lxs/Dowload/scrapy'}

3. stay spider When parsing a page that contains a file download link, all the files that need to be downloaded are url To collect a list of addresses , Assign a value to item Of file_url Field (item['file_url']).FIlePipeline In dealing with each item Item when , Will read item['file_urls'], For each of them url Download .

class DownloadBookSpider(scrapy.Spider):
   pass
def parse(response):
       item = {}
       # Download list
       item['file_urls'] = []
       for url in response.xpath('//[email protected]').extract():
           download_url = response.urljoin(url)
           # take url Fill in the list
           item['file_urls'].append(download_url)
       yield item

When FilePipeLine Download the item['file_urls'] After all the files in , The download result information of each file will be collected into a list , Assign each to item Of files Field （item[files]）. Download results include ：Path The relative path to download the file locally ,checksum Inspection and of documents ,url Of documents url Address .