简单快捷的 Python 爬虫工具：SmartScraper

作者: Alex是大佬 | 来源:发表于2022-01-13 17:10 被阅读0次

简单快捷的 Python 爬虫工具：SmartScraper
Python网络爬虫
各语言简单爬虫
在python3中如何引用BeautifuSoup4
6张脑图系统讲透python爬虫和数据分析、数据挖掘
java爬虫与python爬虫谁更强？
Python爬虫入门(01) -- 10行代码实现一个爬虫
Python3爬虫工具-MiniSpider
利用python爬虫可视化分析当当网的图书数据！
python爬虫系列（1）- 概述

大家好。

今天给大家介绍一款简单、自动且快捷的Python爬虫工具SmartScraper。SmartScraper使页面数据抓取变得容易，不再需要学习诸如pyquery、beautifulsoup等定位包，我们只需要提供的url和数据给ta学习网页定位规律即可。

一、安装

pip install smartscraper

二、快速上手

2.1 获取相似结果

例如我们想从 豆瓣读书-小说 页面获得20本书的书名和出版信息

P1 https://book.douban.com/tag/小说?start=0&type=T

P2 https://book.douban.com/tag/小说?start=20&type=T

我们使用P1链接训练书名、出版信息这两个字段

fromsmartscraperimportSmartScraper

# 待训练的网页链接

url ='https://book.douban.com/tag/小说?start=0&type=T'

#定义想要的字段

wanted_dict = {"title":["活着"],

"pub": ["余华 / 作家出版社 / 2012-8-1 / 20.00元"]

}

# 训练/在url对应的页面中寻找wanted_dict规律

scraper = SmartScraper()

results = scraper.build(url, wanted_dict=wanted_dict)

print(results)

运行代码，采集到的results如下

{'title': ['活着',

'房思琪的初恋乐园',

'白夜行',

'索拉里斯星',

'鄙视',

...],

'pub': ['余华 / 作家出版社 / 2012-8-1 / 20.00元',

'林奕含 / 北京联合出版公司 / 2018-2 / 45.00元',

'[日] 东野圭吾 / 刘姿君 / 南海出版公司 / 2013-1-1 / CNY 39.50',

'[波] 斯坦尼斯瓦夫·莱姆 / 靖振忠 / 译林出版社 / 2021-8 / 49.00元',

'[意] 阿尔贝托·莫拉维亚 / 沈萼梅、刘锡荣 / 江苏凤凰文艺出版社 / 2021-7 / 62.00',

...]

}