python 爬虫

作者: 志明S | 来源:发表于2017-02-08 15:35 被阅读225次

3分钟带你了解世界第一语言Python 入门上手也这么简单！
Python网络爬虫（八） - 利用有道词典实现一个简单翻译程序
Python网络爬虫（七）- 深度爬虫CrawlSpider
Python网络爬虫（二）- urllib爬虫案例
Python网络爬虫（一）- 入门基础
Python网络爬虫（四）- XPath
Python网络爬虫（三）- 爬虫进阶
Python网络爬虫（六）- Scrapy框架
Python网络爬虫（五）- Requests和Beautifu
Python网络爬虫实战之十四：Scrapy结合scrapy-s

最近爬取天眼查的企业数据，天眼查的页面用的js技术，所以用requests已经不能爬了，所以想了两种办法

1.用selenium+Phantomjs模拟浏览器
用这种方法，成功的爬到了想要的数据，缺陷就是爬取速度慢，平均爬一条数据几十秒，下边是代码

dcap = dict(DesiredCapabilities.PHANTOMJS)
    dcap["phantomjs.page.settings.userAgent"] = (
        "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Mobile Safari/537.36"
    )

    driver = webdriver.PhantomJS(desired_capabilities=dcap)
    driver.get(url)
    #print (driver.page_source)
    soup = BeautifulSoup(driver.page_source, 'lxml')
    driver.quit()