Hello everyone .
Today I will introduce a simple 、 Automatic and quick Python Reptile tools SmartScraper.SmartScraper Make it easy to grab page data , No longer need to learn things like pyquery、beautifulsoup Equal positioning package , We just need to provide url And data to ta Just learn the rules of web page positioning .
pip install smartscraper
for example We want to get from Douban studies - A novel Page access 20 The title and publication information of this book
We use P1 Link training Title 、 Publish information These two fields
from smartscraper import SmartScraper # Links to web pages to be trained url = 'https://book.douban.com/tag/ A novel ?start=0&type=T' # Definition Desired field wanted_dict = {"title":[" Alive "], "pub": [" Yuhua / Writers press / 2012-8-1 / 20.00 element "] } # Training / stay url Search the corresponding page wanted_dict law scraper = SmartScraper() results = scraper.build(url, wanted_dict=wanted_dict) print(results)
Run code , Collected results as follows
{'title': [' Alive ', ' Fang Siqi's first love paradise ', ' White night line ', ' Solaris ', ' despise ', ...], 'pub': [' Yuhua / Writers press / 2012-8-1 / 20.00 element ', ' Lin Yihan / Beijing joint publishing company / 2018-2 / 45.00 element ', '[ Japan ] Guiwu Dongye / Liuzijun / Nanhai publishing company / 2013-1-1 / CNY 39.50', '[ wave ] Stanislaw · Lyme / Jingzhenzhong / Yilin Translation Publishing House / 2021-8 / 49.00 element ', '[ It means ] Alberto · Moravia / Shensepmei 、 Liuxirong / Jiangsu Phoenix literature and art press / 2021-7 / 62.00', ...] }
Use the one you just trained scraper Try from P2 link Get the title and Publication Information
scraper.get_result_similar('https://book.douban.com/tag/ A novel ?start=20&type=T')
Trained smartscraper Models can be saved , Subsequent direct calls
scraper.save('douban_Book.pkl')
Model import code
scraper.load('douban_Book.pkl')