Web Scrapper 爬虫工具

作者: 北冥Master | 来源:发表于2018-08-06 15:50 被阅读1次

Web Scrapper 爬虫工具
爬虫—Web Scraper（一）
Kali—Dirbuster工具用法
DirBuster使用
5️⃣、🚁AppScan🌍腹有诗书
WebScraper
网络爬虫组合
selenium使用
20200713-Web Scraper网页爬虫工具
nginx从入门到实践-基础篇

如果你想快速抓取网站数据，又不想自己写爬虫代码，可以使用Web Scrapper神器。

Web Scrapper是一款Chrome插件，可以让你以“所见即所得”的方式挑选要提取的网页数据，形成模版，以后可以随时执行该模版，并且执行结果可以导出成Csv格式。工具非常便捷&强大，谁用谁知道。

下载地址：https://www.webscraper.io/

Web Scrapper 实际上生成了一个Sitemap文件（Json格式），该文件保存了数据节点的选择路径（XPath），点击抽取(Scrape)的时候，Web Scrapper通过模拟用户操作打开网页，等待加载完毕，提取数据。

简单实践了一下，58同城，知乎，亚马逊，链家以及微博等网站的数据都是可以抓取的，这里面微博是相对比较复杂一点，需要拖动鼠标不断往下滑才能抓取全部数据，但是也是可以用Web Scrapper实现抓取的。

Selector 选择器

Sitemap中总共定义了11种选择器：

Text (文本)
Link （链接）
Popup Link （弹出页）
Image （图片）
Table （表格）
Element Attribute （？）
HTML （网页）
Element （普通结构）（抓取区域）
Element Scroll Down （下拉结构，滑动刷新，类似微博内容）
Element Click （点击结构，分页无规律，需点击分页，类似微博评论）
Grouped （？）

Element + X Selector
一般都是先定义一个Element（选定要抓取的区域），然后再定义实际要抓取的数据内容。

Sitemap 样例

{"_id":"fangchan","startUrl":["http://sh.58.com/ershoufang/?key=%E8%A3%95%E9%B8%BF%E4%BD%B3%E8%8B%91"],
"selectors":
[
  {"id":"housesource","type":"SelectorElement","selector":"div.list-info","parentSelectors":      ["_root"],"multiple":false,"delay":0},
  {"id":"desc","type":"SelectorText","selector":"h2.title a","parentSelectors":["housesource"],"multiple":false,"regex":"","delay":0},
  {"id":"area","type":"SelectorText","selector":"p.baseinfo span:nth-of-type(2)","parentSelectors":["housesource"],"multiple":false,"regex":"","delay":0},
  {"id":"detail","type":"SelectorLink","selector":"h2.title a","parentSelectors":["housesource"],"multiple":false,"delay":0},
  {"id":"agent","type":"SelectorText","selector":"span.anxuan-qiye-text:nth-of-type(2)","parentSelectors":["detail"],"multiple":false,"regex":"","delay":"1000"},
  {"id":"name","type":"SelectorText","selector":"div.agent-name a.c_000","parentSelectors":["detail"],"multiple":false,"regex":"","delay":0},        
  {"id":"telclick","type":"SelectorElementClick","selector":"p.phone-num","parentSelectors":["detail"],"multiple":false,"delay":"1000","clickElementSelector":"div.chat-phone-layer show-phone","clickType":"clickOnce","discardInitialElements":false,"clickElementUniquenessType":"uniqueText"},  
  {"id":"phone","type":"SelectorText","selector":"p.phone-num","parentSelectors":["detail"],"multiple":false,"regex":"","delay":0}
]
}

自动化

美中不足的是，每次需要手工执行抓取&导出。Web Scrapper官网提供抓取云服务，但是属于收费服务。
有没有可能让抓取导出工作自动化呢？当然是可能的。通过Python可以调用Chrome的功能，要实现自动化，工作就集中在解析Sitemap，并且按照Sitemap中定义的不同选择器进行抓取。11种选择器都要完美实现还是需要花很多心思和时间。

Python中操作Chrome的例子代码：
需先下载ChromeDriver，下载地址：http://chromedriver.chromium.org/

from selenium import webdriver
import time

options = webdriver.ChromeOptions()
options.add_argument('window-size=800x600')
#options.add_argument('headless')
driver = webdriver.Chrome('/usr/local/bin/chromedriver', chrome_options=options)
driver.get('https://reddit.com')
time.sleep(2)
topLinks = driver.find_elements_by_xpath("//div/span/a[contains(@class, 'SQnoC3ObvgnGjWt90zD9Z')]")
for link in topLinks:
  print ('Title: ', link.text)
driver.quit()