您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python crawler_ Scrape (I)

編輯：Python

Python Reptiles _Scrapy

- One 、Scrapy sketch
- Two 、58 Case of local project
- 3、 ... and 、 Car home case
- Four 、scrapy shell

One 、Scrapy sketch

It's for crawling website data , Application framework for extracting structural data , It can be applied in data mining 、 In a series of programs such as information processing or storing historical data

（ One ） install

pip install scrapy -i https://pypi.douban.com/simple

Report errors ：

WARNING: You are using pip version 21.3.1; however, version 22.1.2 is available.
You should consider upgrading via the 'D:\PythonCode\venv\Scripts\python.exe -m pip install --upgrade pip' command.

terms of settlement : function python -m pip install --upgrade pip
（ Two ） Basic use

Create a crawler project ：scrapy startproject scrapy_baidu_01
Be careful ：
（1） It is necessary to enter the installation site scrapy.exe Folder to run ;
（2） The name of the crawler item created cannot start with a number , Nor can it contain Chinese
（3） Be sure to configure scrapy.exe Environment variables of , At the same time, restart the computer to run D:\P ythonCode\venv\Scripts
Create crawler file stay spider Folder to create D:\PythonCode\venv\Scripts\scrapy_baidu_01\scrapy_baidu_01\spiders>
Create instructions ：scrapy genspider Crawler file name Page to crawl
Run the crawler code scrapy crawl The name of the reptile The name is [name = ‘baidu’]

 import scrapy
class BaiduSpider(scrapy.Spider):
# The name of the reptile , Value used 
name = 'baidu'
# Allowed access to the domain name 
allowed_domains = ['www.baidu.com']
# Initial url Address The domain name visited for the first time 
# start_urls really allowed_domains Add a http://, Added after /
start_urls = ['http://www.baidu.com/']
# Yes start_urls The method of execution after , Methods response Is the returned object , amount to 
# response = urllib.request.urlopen()
# response = requests.get()
def parse(self, response):
print('ssssss')

Two 、58 Case of local project

1. scrapy Project structure

2. response Properties and methods of
response.text Get the response string
response.body Get binary data
response.xpath You can use it directly xpath Method to parse response The content in
response.extract() For extraction seletor Object data Property value
response.extract_first() extract seletor The first data in the list

3、 ... and 、 Car home case

import scrapy
class CarSpider(scrapy.Spider):
name = 'car'
allowed_domains = ['car.autohome.com.cn/price/brand-15.html']
start_urls = ['https://car.autohome.com.cn/price/brand-15.html']
def parse(self, response):
name_list = response.xpath('//div[@class="main-title"]/a/text()')
price_list = response.xpath('//div[@class="main-lever"]//span/span/text()')
for i in range(len(name_list)):
name = name_list[i].extract()
price = price_list[i].extract()
price(name,price)

scrapy working principle 【 Ash often matters ！！】

Engine direction spiders request url;
The engine will ask for url Pass it to the scheduler
The scheduler will url The generated request object is placed in the specified queue
Dequeue a request from the queue
The engine passes the request to the downloader for processing
The downloader sends a request to get internet data
The downloader returns the data to the engine
The engine hands over the data to spiders
spiders adopt xpath Parse the data , Get data or url
spiders Give the parsing result to the engine
If the parsing result is data , Then hand it over to the pipeline ; If the parsing result is url, Then it is handed over to the scheduler to enter the next cycle .