程序師世界是廣大編程愛好者互助、分享、學習的平台,程序師世界有你更精彩!
首頁
編程語言
C語言|JAVA編程
Python編程
網頁編程
ASP編程|PHP編程
JSP編程
數據庫知識
MYSQL數據庫|SqlServer數據庫
Oracle數據庫|DB2數據庫
您现在的位置: 程式師世界 >> 編程語言 >  >> 更多編程語言 >> Python

Django builds an integrated script for Blogs

編輯:Python

demand

The reason is that I saw one based on PHP Write Douban plug-in , So I also want to synchronize my Douban shadow list and book list to the blog website , But Douban does not provide such an interface to allow you to obtain your own shadow list and book list data . There is no way but to climb the data by yourself , I happen to have some research on reptiles , This is not difficult to achieve . My Douban shadow list | My Douban book list

Why integrate scrapy

In fact, there is no need to scrapy To crawl , pure python Can also be realized , But I insist on django Project Integration scrapy, There are three advantages :

  • Ensure the integrity of the whole project , All the contents are put in the current django Under the project , Easy for later migration .
  • utilize scrapy-djangoitem It's a good way to pass django Of orm Complete data storage , Because the background reading is also based on model To operate the , Also avoid implementing sql Insert statements to complete the data insertion .
  • With scrapy This reptile frame , Later, when the site needs to crawl , Add crawlers directly to the framework .

How to solve timing synchronization

How to solve the problem of synchronization ? I want to use celery It's OK to execute the crawler regularly , At the beginning, I directly put scrapy crawl xxx Write it in a function , Then execute the function regularly , Find it doesn't work , Will prompt scrapy Command not found , Maybe it's because of my scrapy The project is also placed in the root directory of my blog project . Then I looked for any other startup scrapy The reptile way , Baidu has a lot of information, but it is similar to django It doesn't seem to work together , Finally, I found scrapyd, adopt scrapyd Of api It can also be executed scrapy Reptilian , and scrapyd You can also deploy crawler projects , In this way, if I need to crawl something later in my blog , It will be convenient to deploy at that time .

scrapy brief introduction

scrapyd Is to manage scrapy The deployment and operation of a service program ,scrapyd Let's go through a simple Json API To complete scrapy Operation of the project 、 stop it 、 End or delete , Of course, it can also manage multiple crawlers at the same time . So we deploy scrapy It is convenient to control the crawler and view the crawler log .

scrapyd Installation

 pip install scrapyd
pip install scrapyd-client

windows What we need to pay attention to

After normal installation scrapyd,linux There's no problem next time , however windows You also need to implement a bat Script , Because in python Installation directory scripts There's one in the file scrapyd-deploy( No extension ), This file is the startup file , But in windows It doesn't work under the same conditions , Only in linux function . In the above scrapyd-deploy Create a directory named scrapyd-deploy.bat The file of , The contents are as follows :

@echo off
"F:\develop\Python\envs\blog\Scripts\python.exe" "F:\develop\Python\envs\blog\Scripts\scrapyd-deploy" %1 %2 %3 %4 %5 %6 %7 %8 %9

Pay attention to the following two points

  • The two paths must be double quotes and there is a space between the two paths ;
  • Fill in the first path python.exe The path of ;
  • For the second, fill in the above scrapyd-deploy Path to file ;

scrapy Deployment project

  • Create a new blog under the current blog root directory scrapyd Folder ( Used for deployment scrapy project )

  • edit scrapy Under the project of scrapy.cfg file

    [settings]
    default = blog.settings
    [deploy:test]
    url = http://localhost:6800/
    project = blog
    

    deploy Followed by the deployment name here , Just pick it up ,project It is the project name, not necessarily scrapy Project name , Other names are OK .

  • stay terminal Interface Switch to scrapyd Catalog Give orders scrapyd
    Remember that only when you switch to scrapyd The content will only be stored in the directory when it is deployed in this way

  • Release project to scrapyd( The parameters here must follow cfg The configuration inside is consistent )
    scrapyd-deploy test -p blog

    • target: Department signature
    • prject: Project name
  • stay scrapyd There will be three more folders under the directory

  • Browser address bar input http://127.0.0.1:6800/
    You can see that there is one available project blog, It shows that the deployment is successful .

  • Run crawler

    curl http://localhost:6800/schedule.json -d project=blog -d spider=book
    
  • Click... From the above interface jobs It is shown as follows
    You can see that a crawler has finished running , This is a reptile that I crawled to get the Douban book list , Click on the last of the crawler logs You can see the following :
    Essentially, the content is not garbled , Because I saw that the content I saved in the database is normal . because scrapyd in the light of logs The page request response header has no encoding format , The content is garbled , There are also several online solutions , I use it directly here chrome Modify the page encoding plug-in , Change the code to utf-8, give the result as follows :
    In fact, it can also be through post-man etc. restAPI Access request , The content will not be garbled .

scrapyd management API

  • To obtain state
    http://127.0.0.1:6800/daemonstatus.json

  • Get item list
    http://127.0.0.1:6800/listprojects.json

  • Get the list of published crawlers under a project
    http://127.0.0.1:6800/listspiders.json?project=myproject

  • Get the list of published versions of crawlers under the project
    http://127.0.0.1:6800/listversions.json?project=myproject

  • Get crawler running status
    http://127.0.0.1:6800/listjobs.json?project=myproject

  • Start a crawler on the server ( Need to be a crawler deployed to the server )
    curl sentence

    curl http://localhost:6800/scedule.json -d project= Project name -d spider= Project name
    

    python sentence

    import requests
    req=requests.post(url="http://localhost:6800/schedule.json",data={
    "project":"blog","spider":"book"})
    
  • Delete a version of crawler
    curl sentence

    import requests
    http://localhost:6800/delproject.json -d project=scrapy Project name -d version=scrapy Project version No
    

    python sentence

    import requests
    req=requests.post(url="http://localhost:6800/delproject.json",data={
    "project":" Project name ","version":" Project version No "})
    
  • Delete a project , Including all versions of reptiles under the project
    curl sentence

    import requests
    http://localhost:6800/delproject.json -d project=scrapy Project name
    

    python sentence

    req=requests.post(url="http://localhost:6800/delproject.json",data={
    "project":" Project name "})
    
  • Cancel the reptile
    curl sentence

    import requests
    http://localhost:6800/cancel.json -d project=scrapy Project name -d job=jobID
    

    python sentence

    import requests
    req=requests.post(url="http://localhost:6800/cancel.json",data={
    "project":" Project name ","job":"jobID"})
    

celery Regularly execute the crawler

  • Create a new... In any application directory under the root directory tasks.py file

    from __future__ import absolute_import
    from celery import shared_task
    import requests
    @shared_task
    def main():
    req=requests.post(url="http://localhost:6800/schedule.json",data={
    "project":"blog","spider":"movie"})
    print(req.text)
    
  • Switch to scrapyd In the deployment Directory cd scrapyd

  • start-up scrapydscrapyd

  • Next, just set the scheduled task and execute the just tasks.py Inside main Method on how to configure celery You can refer to my other article about timed tasks django Integrate celery.

In addition, what I want to say is that the above is only suitable for local execution , And you need to pay attention to a certain order , Need to drive several terminal Run respectively celery-worker,celery-beat,scrapyd, If used redis, I have to open another one redis, My blog has been deployed and launched , And Douban shadow single book list synchronization online test is normal , adopt supervisor It is convenient to manage these commands . There is also something about scrapyd I haven't opened Internet access either , It is also for the sake of safety , That is to say, it cannot be deployed to through the Internet scrapyd Under the table of contents , But it's okay , If you add another crawler , It can be deployed locally , Go online to overwrite the original scrapyd Contents under the directory , Restart the scrapyd that will do .


  1. 上一篇文章:
  2. 下一篇文章:
Copyright © 程式師世界 All Rights Reserved