http://scrapy-chs.readthedocs.io/zh_CN/latest/intro/tutorial.html#id5*
Reference books 《 Master Scrapy Web crawler 》
Web crawler refers to a program that automatically crawls the information of website content on the Internet , Also known as web spiders and web robots
The basic crawling process is :
brief introduction :
Scrapy Use python Language based Twisted The framework is an open source web crawler framework , At present, we support python2.7 And python3.4+
install
pip install scrapy // During installation, if the system prompts that there are no dependent files pip install wheel // This is a Twisted Dependence pip install C:\Users\10338\Downloads\Twisted-17.9.0-cp36-cp36m-win_amd64.whl
http://books.toscrape.com/
utilize shell Use the command line to create
//scrapy startproject Project name scrapy startproject book_spider
from scrapy import cmdline cmdline.execute("scrapy crawl LG_Spider -o LG_Spider.csv".split())
Write code ( stay pycharm Add scrapy engineering )
import scrapy class BooksSpider(scrapy.Spider): # Define a crawler name = "books" # Define the starting point for the crawler start_urls = ['http://books.toscrape.com/'] def parse(self, response): # Extract the data # Get the information of each book on the label class=“b-all-content cf” # utilize CSS To find all the elements , And one iteration for book in response.css('article.product_pod'): book_name = book.xpath('./h3/a/@title').extract_first() book_price = book.xpath('p.price_color::text').extract_first() yield {'book_name':book_name, 'book_price':book_price, } # Extract the link next_url = response.css('ul.pager li.next a::attr(href)').extract_first() if next_url: next_url = response.urljoin(next_url) yield scrapy.Request(next_url,callback=self.parse)
extract.frist() Return string extract Returns an array of
name attribute : In a scrapy There may be multiple crawlers in the ,name Attributes are the only difference between crawlers .
start_urls attribute : A crawler starts crawling from a page , That is, the starting grab point .
parse: After a page is successfully fetched ,Scrapy A page parsing function specified by us will be returned ( The default is parse Method ).
attr()jQuery. Returns the attribute value of the selected element
scrapy crawl books -o books.csv
The information obtained will be saved in books.csv In the file of .
It is used to sort out the data flow processing of the whole system . Triggering event
To accept requests from the engine , Push into the queue , And return when the engine requests it again , It can be thought of as a URL Priority queue for , It's up to him to decide what the next URL to crawl is , Remove duplicate URLs at the same time .
For downloading web content , And return the contents of the web page to spider The Downloader is built on twisted On this efficient asynchronous model
Reptiles ( The working and peasant class ) Mainly working , For from intenet Crawling information , The so-called entity . Users can also extract links from it , Give Way spider Continue to crawl to the next page
Responsible for dealing with entities extracted from web pages by Crawlers , The main function is to persist entities , Verify the validity of the entity , Clear unwanted information , When the page crawler parses , Will be sent to project pipeline , And process the data in several specific orders .
be located Scrapy The framework between the engine and the downloader , Mainly used to deal with Scrapy The request and response between the engine and the downloader
Be situated between Scrapy The engine and Spider Between the frames , Mainly dealing with Scrapy The engine and Spider Between request and response
Be situated between Scrapy The framework between engine and scheduler , from Scrapy Requests and responses sent by the engine to the schedule
1. The engine takes a link from the scheduler (URL) For subsequent crawling
2. Engine handle URL Encapsulate as a request (request) To Downloader
3. Downloader Downloads resources , And encapsulate it into a response package (response)
4. Crawler analysis response
5. Parse out the entity (Item), Then it will be sent to the solid pipeline for further treatment
6. It's a link (URL), Then put URL Give it to the scheduler for grabbing
request(url[,callback,method='GET',headers,boday,cookies,meta,encoding='utf-8',priority=0,dont_filter =false,errback])
~url : Request address
~callback: Page parsing function ,callback type ,request Object requested page download After completion , The page resolution specified by this parameter is called , When not called spider Default call parse Method .
~method:HTTP Request method of , The default is GET Method .
~headers:HTTP Requested header dictionary ,dict type , for example {‘A':'a','B':'b'} If the value of an internal item is NONE, It means that the header of the item is not sent , for example :{‘C’:None}, Prohibit sending C
~body:HTTP Body of request ,bytes or str type
~cookies: Information dictionary ,dict type
~meta:request Metadata dictionary for ,meta The label is always in head The interior of the element . Meta information of the header —— The structure used to describe information , grammar , Use and usage, etc ‘.
~encoding: Coding format .
~priority: Set the priority of the request .
~dont_filter: The default value is False, Value changed to True Can request to avoid being filtered , Compulsory Download .
~errback: Erroneous return .
import scrapy request = scrapy.Request(' Address ') request = scrapy.Request(' Address ',callback=self.parseItem)
After downloading the page , To get a Response The subclass object of , Subclasses have TextResponse、HtmlResponse、XmlResponse, Because they usually play with web pages , So we use HtmlResponse, Be careful TextResponse yes HtmlResponse and XmlResponse Parent class of .
~url:HTTP Responsive url Address ,str type .
~status:HTTP The status code of the response .
~headers:HTTP Response header ,dict type .
~body:HTTP Response Content ,bytes Li type
~text: The text form corresponds to .
~encoding: Coding format .
~request : Produce this HTTP Responsive Request The object of .
~meta: namely response.request.meta, structure Request Object can be taken out response.meta Information .
~selector: Selector object Used in response Extract data from ( Selectors )、
~Xpath(query): Use xpath stay response Extract data from , It's actually response.selector.xpath Shortcut to method .
~css(query): Use CSS Selector in response Data extraction in . For the actual response.selector.css Shortcut to method .
~urljoin: Used to construct absolute url. When it comes to url Parameter 4 is a relative address , according to response.url Calculate the corresponding absolute url.
The common methods are as follows :css xpath To extract data , utilize urljoin Construct absolute url.
The four steps of development :
1. Inherit scrapy.Spider
2. by Spider The name
3. Set the starting crawling point .
4. Implement page parsing function .
Scrapy The framework provides a Spider Base class ,.
stay Spider The base class implements :
1. for Scrapy The interface called by the engine
2. Utility functions for users
3. Properties for user access .
Multiple can be implemented in one project spider,name Property is the only property that distinguishes the small crawls of these jobs . In execution scrapy crawl This logo will be used when
Spider The starting page to climb ,start_urls It is usually a list , Put all crawling points in it url.
Page parsing function , That is, structure Request When the object passes callback Return function specified by parameter ( perhaps parse Method ). What needs to be done :
1. Using selectors to extract data from a page , Data encapsulation (Item or dict) Submit to Scrapy engine
2. Use a selector or LinkExtractor Extract the links in the page , Use it to construct new request Object and submit to Scrapy engine
Common treatment HTML Page parsing module
~BeatifulSoup
~LXML
Scrapy It combines the two to realize Selector class , Use first through Xpath perhaps CSS The selector selects the data under the page , Then extract
establish Selector Object time , The page can be HTML The document string is passed to Selector Constructor method text Number of parameters ; You can also use Response Object construction Selector Object passes it to Selector Constructor method response Parameters .
utilize Xpath and css Method
because Selenium Use xpath When positioning, it uses the way of traversing the page , In terms of performance CSS Selectors are better .xpath Although the performance index ratio is poor , But there is good plug-in support in the browser , It is convenient to locate elements , Those with strict performance requirements can be used alternatively .
Xpath:
xpath Using path expressions in xml Navigation in the document
xpath Contains a standard function library
xpath yes XSLT( take xsl transformation ,xsl Extensible style sheet language ) The main element in
xpath It's a W3C standard (web Technical standards of )
stay xpath in , There are seven types of nodes : Elements 、 attribute 、 Text 、 Namespace 、 A processing instruction 、 Annotate first level documents ( The root node ).XML Documents are treated as trees , The root of the tree is called the document node or the root node .
xpath and css Method returns a SelectorList object , It contains the corresponding... For each selected part Selector object ,SelelctorList Support list interface , have access to for Statement to access each of them Selector object :
for sel in selector_list: print(sel.xpath('./text()'))
selector_list There are also objects. xpath and css Method , The behavior that invokes them is : Call each of them separately with the received parameters Selector Object's xpath and CSS Method , And all the results will be received by a new SelectorLiist Object is returned to the user eg:
>>>selector_list.xpath('./text()') [<Selector xpath='./text()'data='hello world'>,<Selector xpath='./text()'data='Hello world'>]
call Selector and SelectorList Object method :extract()、re()、extract_first()、re_first()( The last two are SelectorList proper )
1.extract() Method
call Selector Object's extract Method will return the... Of the selected content Unicode character string , And SelectorList Object's xpath and css similar , SelectorList Object's extract Each of the methods is called internally Selector Object's extract Method , And collect all the results into a list and return it to the user .
2.extract_first()
This method puts back the first one Selector Object call extract Result of method . stay selectorList Object contains only one Selector Object , Extract directly Unicode Strings instead of lists .
3.re and re_first
Extract a part of the content using regular expressions , have access to re Method ,re_first Method also returns the first of these Selector Object call re Result of method .
In fact, in the application process , There is little need to manually create Selector object , On the first visit to a Response Object's selector Attribute ,Response Objects are automatically created with their own parameters Selector object , And put the Selector Object caching , Convenient for next use .
Knowledge points supplement :@propety Attributes decorate
Example :
class Student(object): def __init__(self,name;score): self.name = name self.score = score @property def score(self): return self.__score @score.setter def score(self,score): if score < 0 or score > 1000: raise ValueError('invaid score') self.__score = score
Attention! : first Score(self) yes get Method , use @property decorate , the second score(self,score) yes set Method , use @score.setter decorate ,@score.setter It was the previous one @property A by-product after decoration .
score Property settings .
>>> s = Student('Bob', 59) >>> s.score = 60 >>> print s.score 60 >>> s.score = 1000 Traceback (most recent call last): ... ValueError: invalid score
XPath That is to say XML Path to the language (XML Path Language)
Xpath The common basic grammar of
<html> <head> <base href='http://example.com/'/> <title>Example website</title> </head> <body> <div id='images'> <a href='image1.html'>Name:Image 1<br/><img src='image1.jpg'></a> <a href='image1.html'>Name:Image 2<br/><img src='image2.jpg'></a> <a href='image1.html'>Name:Image 3<br/><img src='image3.jpg'></a> <a href='image1.html'>Name:Image 4<br/><img src='image4.jpg'></a> <a href='image1.html'>Name:Image 5<br/><img src='image5.jpg'></a> <a href='image1.html'>Name:Image 6<br/><img src='image6.jpg'></a> </div> </body> </html>
from scrapy.selector import Selector from scrapy.http import HtmlResponse response = HtmlResponse(url = 'http://www.example.com',body=body.encoding='utf8') response.xpath('/html') response.xpath('/html/head') response.xpath('/html/body/div/a') response.xpath('//a')
This section is filled in with
xpath Provides many functions , For example, digital 、 character string 、 Time 、 date 、 Statistics, etc. .
string(arg) Returns the word string value of the parameter .
CSS That is, the layer stylesheet , Its selector is a way to determine HTML The language of the location of a part of the document .CSS Compared with xpath Simple , But not as good as xpath Strong function .
Actually using CSS When the method is used ,python The library will cssselect take CSS The selector expression is translated into xpath The expression then calls Selector Object's Xpath Method .
CSS Selectors
Item Base class : Custom data class ( Interface supporting dictionary )
field class : Used to describe which fields the custom data class contains
Customize a data class , Just inherit Item, And create a series of Field Class properties of an object
When assigning a value to a field , If it is an internal field , An exception is thrown .
actually Field yes Python A subclass of a dictionary , You can get... Through the key Field Metadata in objects
Rewrite the previous code
from scrapy import Item,Field class BookSpiderItem(scrapy.Item): name = Field() price = Field()
Before modification BooksSpider, Use BookItem replace Python Dictionaries
from ..Item import BookSpiderItem class BooksSPider(scrapy.Spider): def parse(self,response): for sel in response.css('article.product_pod'): book = BookSpiderItem() book['name'] = sel.xpath('./h3/a/@title').extract_first() book['price'] = sel.css('p.price_color::text').extract_frist() yield book
stay ltem Add a new field in the Field() Attributes of a class .
A data item consists of Spider Submit to Scrapy Behind the engine , It may be submitted to other components for processing (Item Pipline Exporter) Handle , Suppose you want to pass additional information to a component that processes data ( for example , How the master component should handle data ), You can use Field Metadata .
Data processing , Multiple can be enabled in a project Item Pipeline , Its typical application
Data cleaning
Validate data
Filter duplicate data
Store data in a database
When the project is created, a pipelines.py The file of ,
(1) One Iltem Pipleine There is no need to inherit a specific base class 、 You only need to implement certain methods , for example process-_Iitem open_spider close_spider
(2) One item Pipeline A must be achieved process_item(item,spider) Method , This method is used to process each item by spider Crawling data , Every two of these parameters
Item A piece of data crawled (Item Or a dictionary ).
Spider Crawling this data spider object .
(3) If process_item Dealing with an item item A data item is returned (Item Or a dictionary ), The returned data will be sent to the next pole Item pipeline To continue processing .
(4) If process_Item Dealing with an item item Throw when (raise) One Droptem abnormal (scarpy.exceptions.DropItem), that item Item Will be abandoned , No longer deliver to the following Item pipeline To continue processing , Nor will it be exported to a file , Usually , When we detect invalid data or filter data, we will throw DropItem abnormal .
(5)open_spider(self,spider)
Spider When open ( Before processing the data ) Call back the method , Usually this method is used to complete some initialization before starting processing data again , Such as linked database
(6)close_spider(self,Spider)
Spider closed ( After processing the data ) Call back the method , Usually this method is used to clean up after all the data has been processed , Such as closing the database .
(7)from_crawlwer(cls,crawler)
establish Item Pipeline Object to call back this method , Usually in this method, the crawler.settings Reading configuration , Create from configuration Item pipeline object .
stay scrapy in . Want to enable a Item Pipleine Need in profile settings.py To configure :
ITEM_PIPELINES = { 'example.piplines.priceConverPipeline':300, }
ITEM_PIPELINES It's a dictionary . We put what we want to enable Item Pipeline Add to this dictionary , The value is 0~1000 The number of , The smaller the number, the earlier it is executed .
Realization
class PriceConveterPipeline(object): exchange_rate = 8.5209 def process_item(self,item,spider): price = float(item['price'][1:])*self.exchange_rate item['price']='¥%.2f'%price
Code interpretation Item Pipeline There is no need to inherit a specific base class . You only need to implement certain methods, such as process_item、oepn_spider、close_spider.
The above implementation method is very simple to convert the sterling price of the book into a floating-point number multiplied by the exchange rate and keep two decimal places , And then assign it to item Medium price Field , Finally, the processed item.
It can be seen that ,process_item Dealing with an item item A data item is returned when (Item Or a dictionary ), The returned data will be delivered to the next level process_item( If there is ) To continue processing
Filter duplicate data , Handling duplicate data , The code is as follows :
from scrapy.exceptions import DropItem class DuplicationPipeline(object): def __init__(self): self.book_set = set() def process_item(self,item,spider): name = item['name'] if name in self.book_set: raise DropItem("Duplicate book found: %s"%item) self.book_set.add(name) return item
Ways to add constructors , Initializes the collection used to de duplicate book titles .
stay process_item In the method , Take out first item Of name Field , Check whether the title of the book is already in the collection book_set in , If there is , Is duplicate data , Throw out DropItemy abnormal , take item abandon , Otherwise it would be item Of name Fields are stored in the set and returned item.
Store data in some kind of database , By implementing Item Pipeline complete
Example :
from scrapy.item import Item import pymongo class MongoDBPipeline(object): DB_URI = 'mongodb://loaclhost:27017/' DB_NAME = 'scrapy_data' def open_spieder(self,spider): self.client = pymongo.MongoClient(self.DB_URI) self.db = self.client[self.DB_NAME] def close_spider(self,spider): self.client.close() def process_item(self,item,spider): collection = self.db[spider.name] post = dict(item)if isinstance(item,item)else item collection.insert_one(post) return item
Code interpretation :
Define two constants in the class properties :DB_URI Database URI Address
DB_NAME Database name
spider The database connection only needs to be executed once when crawling , You should connect to the database before starting data processing , And close the database implementation after processing the database open_spider(spider) and close_spider(spider).
stay process_item To realize MongoDB Database writing , Use self.db and spider.name Get collection collection, Then insert the data into the set , A collection of insert_one Method needs to pass in a dictionary object , When you can't item object , So before using item The type of judgment , If the time is item Objects are converted to dictionaries .
stay settings.py Enabled in file MongoDBPipelien:
ITEM_PIPELINES = { 'example.pipelines.PriceConverterPipeline':300 'example.pipelines.MongoDBPipline':400 }
When climbing , There are many links to other web pages in the page , Sometimes we need to extract this information , The commonly used extraction methods are Selector and LinkExtractor The two methods
1.Selector Because the link is also the data in the page , So use the same method to extract data , When extracting a small amount of data , The link extraction method is relatively simple , Use Selector That's enough .
2.LinkExtractor scarpy Provides a special class for extracting Links LinkExtractor, When extracting a large number of links or extracting rules is complex , Use Link Extractor It's more convenient .
class BooksSpider(scrapy.SPider): def parse(self,reponse): # Extract the link # Next page url The information in ul.pager>li>a Inside next_url = response.css('ul.pager.li.next.a::attr(href)').extract_first() if next_url: # If you find the absolute path of the next page, construct a new response object next_url = response.urljoin(next_url) yield scrapy.Request(next_url,callback=self.parse)
from scrapy.linkExtractors import LinkExtractor class BooksSpider(scarpy.Spider): def parse(self,reponse): le = LinkExtractor(restrict_css='ul.pager li.next') links = le.extract_links(reponse) if links: next_url = links[0].url yield scrapy.Request(next_url,callback=self.parse)
Import LinkExtractor, It is located in scrapy.linkExtractors modular
Create a linkExtarctor object , Use one or more constructor parameters to describe the extraction rule , This is passed on to restrict_css Parameter one CSS Selector expression . He describes the area where the next page links .
call LinkExtarctor Object's extract_Links Method passes in a Response object , This method relies on the extraction rules described when the object is created Response Extract links from the pages contained in the , Finally, a list is returned , Each of these elements is a Link object , That is, the extracted link , Finally, a list is returned , Each element is a Link object .
Because of the extraction of a page, there is only one next link on a page , So with Links[0] Information obtained , So with Links[0] obtain Link object ,Link Object's url Attribute is the absolute value of the linked page URL Address ( No need to call response.urljoin Method ), In use Request Construct and submit .
Learn to use LinkExtractor Constructor parameters of describe extraction rules
<<<<<<!--example.html> <html> <body> <div di="top"> <a class="internal"herf="/intro/install.html">Installation guide</a> <a class="internal"herf="/intro/install.html">Tutorial</a> <a class="internal"herf="/intro/install.html">Examples</a> </div> <div> <p> Here are some off-site links </p> <a href="http://stackoverflow.com/tags/scrapy/info">StackOverflow</a> <a href="http;//github.com/scrapy/scrapy">Fork on</a> </div> </body> </html>
<html> <head> <script type='text/javascript'src='/js/app1.js/'/> <script type='text/javascript'src='/js/app2.js/'/> </head> <body> <a href="/home.html"> Home page </a> <a href="javascript:goToPage('/doc.html');return false"> file </a> <a href="javascript:goToPage('/example.html');return false"> Case study </a> </body> </html>
Use the above two HTML Text constructs two Response object
from scrapy.http import HtmlResponse html1 = open('example.html1').read() html2 = open('example.html2').read() reponse1 = HtmlResponse(url='http://example.com',body=html,encoding='utf8') reponse2 = HtmlResponse(url='http://example.com',body=html,encoding='utf8') # structure Html Two of the text Response object
Be careful :LinkExtractor All constructor parameters have default values , Without explanation , The default value will be used .
from scrapy,linkextractor import LinkExtractor le = LinkExtractor() links = le.extract_links(reponse1) print([link.url for link in links])
LinkExtractor Parameters of :
——allow Accept a regular expression or a list of regular expressions , Extract absolute url Regular expression matching chainers , If the parameter is empty , Just extract all the links .
# Extract page example.html The path in is marked with /info At the beginning from scrapy.linkExtractors import LinkExtractor pattern = '/info/.+\.html$' le = LinkExtractor(allow=patten) links = le.extract_links(response1) print(link.url for link in links)
——deny Accept a regular expression or a list of regular expressions , And allow contrary , Exclude absolute and url Matching links
# extract example.html1 Medium is all off-site links, excluding on-site links from scrapy.linkExtractors import LinkExtractor from urllib.parse import urlparse pattern = patten = '^'+urlparse(reponse1.url).geturl() print(pattern)
——allow_domains Accept a domain name or a list of domain names , Extract the link to the specified domain
# Extract page example1.html From all to GitHub and stackoverflow.com Links to these two web sites from scrapy.linkextractors import LinkExtarcot domains = ['github.com','stackoverflow.com'] le = LinkExtractor(allow_domains=domains) links = le.extract_links(reponse1) print([link.url link in links])
——deny_domains Accept a domain name or a list of domain names , And allow_domains contrary , Exclude links to the specified domain
# Extract page example,html In addition to GitHub.com Links outside the domain from scrapy.linkextractors import LinkExtractor le = LinkExtractor(deny_domains='github.com') links = le.extract_links(response1) print([link.url for link in links])
——restrict_xpaths Accept one Xpath An expression or a Xpath A list of , extract Xpath The link under the region in the expression
# Extract page example.html in <div id="top"> Link under element : from scrapy.linkextractors import LinkExtractor le = LinkExtractor(restrict_xpath='//div[@id="top"]') links = le.extract_links(response) print([link.url for link in links])
——restrict_css Accept one CSS Or an expression CSS A list of , extract CSS Area link under expression
# Extract the page example.html in <div id="botton"> Element from scrapy.linkextractors import LinkExtractor le = LinkExtractor(restrict_css='div#bottom') links = le.extract_links(response1) print([link.url for link in links])
——tags Receive a tag ( character string ) Or a list of tags , The default value for extracting links within the specified tag is ['a','area']
——attrs Receive an attribute ( character string ) Or a list of attributes , Extract the link in the specified attribute , The default is ['href']
# Extract page example2.hml I quote JavaScript The file link from scrapy.linkextractors import LinkExtractor le = LinkExtractor(tags='script',attrs='sec')
——process_value The receiving form is like func(value) Callback function for . If this function is used ,LinkExtractor The function will be called back , For each link extracted ( Such as a Of harf) To deal with , The callback function should normally return a string , That is, the result of processing , When you want to discard the link you are working on , return None.
import re def process(value): m = re.search("javascript:goTopage\('(.*?))'",value) if m: value = m.group() return value from scrapy.linkextractors import LinkExtractor le = LinkExtractor(process_value=porcess) links = le.extract_link(response2) print([link.url for link in links])
scrapy The component responsible for exporting data in is called Exporter( Exporter ),scrapy There are multiple exporters in , Each corresponds to an export of a data format , The supported data are as follows
(1)JSON ——JsonItemExporter
(2)JSON Lines ——JsonLinesItemExporter
(3)CSV ——CsItemExpoter
(4)XML——XmlItemEporter
(5)Pickle——PickleItemExporter
(6)Marshal——MarshalExpotrer
When exporting data , We need to Scapy Provide , The path of the exported file and the exported data format , That is to say, what kind of people should be , Shout out from that place .
It can be specified through command line parameters or configuration file
Running scrapy crawl You can go through -o and -t Parameter specifies the path of the exported file and the format of the data of the exported file .
scrapy crawl books -o books.csv
Among them -o books.csv Specifies the export path of the document , Although not used -t Parameter specifies the format of the exported data , however Scrapy The crawler infers from the file suffix that we use csv Export data in the format of , alike , If you change the data to -o books.json Will be with json Export data in the format of . Use... When you need to specify an export path explicitly -t Parameters .
scrapy crawl books -t csv -o book1.data scrapy crawl books -t json -o book2.data scrapy crawl books -t xml -o book3.data
In the configuration Dictionary FEED_EXPORTERS Mid search Exporter,PEED_EXPORTERS Merge the contents of , In the default configuration file FRRD_EXPORTERS_BASE, In the user profile FEED_EXPORTERS.
The former includes internally supported export data formats , The latter includes exporting user-defined export data formats . If the user adds a new export data format ( That is, a new exporter), You can ask the price in the configuration settings.py In the definition of FEED_EXPORTERS for example :
FEED_EXPORTERS={'excel':'my_project.my_exporters.ExcellemExporter'}
When specifying the file path , You can also use %(name)s and %(time)s Two special variables ,%(name)s Will be replaced by Spider Name ,%(time)s Will be replaced by the piece creation time .
The configuration file , See how to export data in the configuration file ,
FEED_URL export path
FEED_URL= 'exporter_data/%(name)s.data'
FEED_FORMAT Export format
FEED_FORMAT= 'csv'
FEED_EXPORT_ENCODING Export file code , Note that by default json Documents are digitally encoded , Other uses utf-8 code
FEED_EXPORT_ENCODING = 'gbk'
FEED_EXPORT_FILEDS The fields contained in the exported data , All fields are exported by default , And specify the order
FEED_EXPORT_FILEDSv={'name','author','price'}
FEED_EXPORTERS User defined data format
FEED_EXPORTERS= {'excel':'my_project.my_exporters.ExcelItemExporter'}
exporter_item(self,item) Responsible for exporting and crawling to each data , Parameters item by A crawled data , Each subclass must implement the method .
start_exporting(self) Called at the beginning of the export , You can perform some initialization in this method
finish_exporting(self) Called when the export is complete , Some cleanup work can be performed in this method
With JsonItemExporter As an example :
To make the final export in a json List in , stay start_exporting and finish_exporting Method to write... To the file respectively b"[\n,b"\n]".
stay export_item In the method , call self.encoder.encode Method to convert a term number to json strand , Then write the file .
# Create one in the project my_exportes.py( And settings.py At the same directory ), In which ExcelItemExporter from scrapy.linkexporters import BaseItemExporter import xlwt class ExcelItemExporter(BaseItemExporter): def __init__(self,file,**kwargs): self._configure(kwargs) self.file = file self.wbook = xlwt.Workbook() self.wsheet = self.wbook.add_sheet('scrapy') self.row = 0 def finish_exporting(self): self.wbook.save(self.file) def export_item(self,item): filed = self._get_serialized_fileds(item) for col,v in enumerate(x for _,x in fields): self.wsheet.write(self.row,col,v) self.row += 1
Code interpretation :
Using third party libraries xlwt Write data to Excel file .
Create... In constructor methods Workbook Objects and Worksheet object , And initializes the self.row.
stay exporter_item Method calls the base class. _get_serialized_fileds Method , get item Iterators for all fields , And then call self.wsheet.write Method to write each field to excel form .
finish_exporting Method when all data is written Excel The table is called , Call in this method self.wbook.save Methods will Excel Table write Excel file
After writing the data export settings.py Add the configuration to the file
FEED_EXPORTERS = {'excle':'example.my_exporters.ExecelitemExporter'}
Run the crawler from the command line
scrapy crawl books -t excel -o books.xls
demand : Access to web site “http://books.toscrape.com/” Book information of the website . The information includes : Title ‘、 Price 、 Evaluate stars 、 Number of evaluations 、 Book code and inventory . See for detailed source code GitHub Finally, output the crawled data to book.csv In the document .
stay scrapy There are two Item Pipeline, Specifically for downloading files and pictures Files Pipeline and ImagesPipeline, We can combine these two Item Pipeline As a special downloader , User pass item A special field of the file information or picture information to be downloaded url Pass it on to them , They will automatically download files or pictures to the local , And save the downloaded result information into item Another field for , So that users can refer to... In the export file .
FilePipeline Usage method :
1. Enable... In the configuration file FilesPipeline, Usually put it in another Item Pipeline Before
ITEM_PIPELINS = {'scarpy pipelines.files.FilesPipeline'}
2. Use... In the configuration file FIFLES_STORE Specify the directory to download the file . for example
FILE_PIPELINS = {'/home/lxs/Dowload/scrapy'}
3. stay spider When parsing a page that contains a file download link, all the files that need to be downloaded are url To collect a list of addresses , Assign a value to item Of file_url Field (item['file_url']).FIlePipeline In dealing with each item Item when , Will read item['file_urls'], For each of them url Download .
class DownloadBookSpider(scrapy.Spider): pass def parse(response): item = {} # Download list item['file_urls'] = [] for url in response.xpath('//[email protected]').extract(): download_url = response.urljoin(url) # take url Fill in the list item['file_urls'].append(download_url) yield item
When FilePipeLine Download the item['file_urls'] After all the files in , The download result information of each file will be collected into a list , Assign each to item Of files Field (item[files]). Download results include :Path The relative path to download the file locally ,checksum Inspection and of documents ,url Of documents url Address .
``