如果你想快速抓取网站数据,又不想自己写爬虫代码,可以使用Web Scrapper神器。
Web Scrapper是一款Chrome插件,可以让你以“所见即所得”的方式挑选要提取的网页数据,形成模版,以后可以随时执行该模版,并且执行结果可以导出成Csv格式。工具非常便捷&强大,谁用谁知道。
下载地址:https://www.webscraper.io/
Web Scrapper 实际上生成了一个Sitemap文件(Json格式),该文件保存了数据节点的选择路径(XPath),点击抽取(Scrape)的时候,Web Scrapper通过模拟用户操作打开网页,等待加载完毕,提取数据。
简单实践了一下,58同城,知乎,亚马逊,链家以及微博等网站的数据都是可以抓取的,这里面微博是相对比较复杂一点,需要拖动鼠标不断往下滑才能抓取全部数据,但是也是可以用Web Scrapper实现抓取的。
Selector 选择器
Sitemap中总共定义了11种选择器:
Text (文本)
Link (链接)
Popup Link (弹出页)
Image (图片)
Table (表格)
Element Attribute (?)
HTML (网页)
Element (普通结构)(抓取区域)
Element Scroll Down (下拉结构,滑动刷新,类似微博内容)
Element Click (点击结构,分页无规律,需点击分页,类似微博评论)
Grouped (?)
Element + X Selector
一般都是先定义一个Element(选定要抓取的区域),然后再定义实际要抓取的数据内容。
Sitemap 样例
{"_id":"fangchan","startUrl":["http://sh.58.com/ershoufang/?key=%E8%A3%95%E9%B8%BF%E4%BD%B3%E8%8B%91"],
"selectors":
[
{"id":"housesource","type":"SelectorElement","selector":"div.list-info","parentSelectors": ["_root"],"multiple":false,"delay":0},
{"id":"desc","type":"SelectorText","selector":"h2.title a","parentSelectors":["housesource"],"multiple":false,"regex":"","delay":0},
{"id":"area","type":"SelectorText","selector":"p.baseinfo span:nth-of-type(2)","parentSelectors":["housesource"],"multiple":false,"regex":"","delay":0},
{"id":"detail","type":"SelectorLink","selector":"h2.title a","parentSelectors":["housesource"],"multiple":false,"delay":0},
{"id":"agent","type":"SelectorText","selector":"span.anxuan-qiye-text:nth-of-type(2)","parentSelectors":["detail"],"multiple":false,"regex":"","delay":"1000"},
{"id":"name","type":"SelectorText","selector":"div.agent-name a.c_000","parentSelectors":["detail"],"multiple":false,"regex":"","delay":0},
{"id":"telclick","type":"SelectorElementClick","selector":"p.phone-num","parentSelectors":["detail"],"multiple":false,"delay":"1000","clickElementSelector":"div.chat-phone-layer show-phone","clickType":"clickOnce","discardInitialElements":false,"clickElementUniquenessType":"uniqueText"},
{"id":"phone","type":"SelectorText","selector":"p.phone-num","parentSelectors":["detail"],"multiple":false,"regex":"","delay":0}
]
}
自动化
美中不足的是,每次需要手工执行抓取&导出。Web Scrapper官网提供抓取云服务,但是属于收费服务。
有没有可能让抓取导出工作自动化呢?当然是可能的。通过Python可以调用Chrome的功能,要实现自动化,工作就集中在解析Sitemap,并且按照Sitemap中定义的不同选择器进行抓取。11种选择器都要完美实现还是需要花很多心思和时间。
Python中操作Chrome的例子代码:
需先下载ChromeDriver,下载地址:http://chromedriver.chromium.org/
from selenium import webdriver
import time
options = webdriver.ChromeOptions()
options.add_argument('window-size=800x600')
#options.add_argument('headless')
driver = webdriver.Chrome('/usr/local/bin/chromedriver', chrome_options=options)
driver.get('https://reddit.com')
time.sleep(2)
topLinks = driver.find_elements_by_xpath("//div/span/a[contains(@class, 'SQnoC3ObvgnGjWt90zD9Z')]")
for link in topLinks:
print ('Title: ', link.text)
driver.quit()
网友评论