Search engine ( Baidu 、 Google 、360 Search, etc ).
Bole Online .
Huihui shopping assistant .
Data analysis and research ( Data iceberg column ).
Ticket grabbing software, etc .
What is a web crawler :
Easy to understand : A crawler is a program that simulates human behavior in requesting a website . You can automatically request web pages 、 And capture the data , Then use certain rules to extract valuable data .
Professional introduction : Baidu Encyclopedia .
General purpose reptiles and focus reptiles :
Universal crawler : General crawler is a search engine crawling system ( Baidu 、 Google 、 Sogou etc. ) An important part of . It mainly downloads the web pages on the Internet to the local , Form a mirror backup of Internet content .
Focus on reptiles : Is a web crawler program for specific needs , The difference between him and the universal reptile is : The focused crawler will filter and process the content when implementing web page crawling , Try to ensure that only the web page information related to the requirements is captured .
Why Python Write a crawler program :
PHP:PHP Is the best language in the world , But he wasn't born to do this , And for multithreading 、 Asynchronous support is not very good , Weak concurrent processing ability . Reptiles are instrumental programs , High speed and efficiency requirements .
Java: The ecosystem is perfect , yes Python Reptile's biggest competitor . however Java The language itself is heavy , A lot of code . Reconstruction costs are high , Any modification will result in a lot of code changes . Crawlers often have to modify the collection code .
C/C++: Operating efficiency is invincible . But the cost of learning and development is high . It may take more than half a day to write a small crawler program .
Python: The grammar is beautiful 、 The code is concise 、 High development efficiency 、 There are many supported modules . dependent HTTP Request module and HTML Parsing modules are very rich . also Scrapy and Scrapy-redis The framework makes it extremely easy for us to develop crawlers .
The preparation of the instruments :
Python3.6 development environment .
Pycharm 2017 professional edition .
A virtual environment .`virtualenv/virtualenvwrapper`.