美文网首页
信息获取工具

信息获取工具

作者: 诺之林 | 来源:发表于2020-03-01 22:42 被阅读0次

    工具

    Requests

    Requests: 让 HTTP 服务人类

    import requests
    
    res = requests.get(url='https://www.baidu.com/')
    txt = res.text
    print(txt)
    

    Selenium

    Selenium automates browsers

    from selenium import webdriver
    
    browser = webdriver.Chrome()
    browser.get('https:www.baidu.com')
    print(browser.page_source)
    browser.close()
    

    Chrome浏览器安装ChromeDriver Firefox浏览器安装geckodriver

    Pyppeteer

    Unofficial Python port of puppeteer JavaScript (headless) chrome/chromium browser automation library

    import asyncio
    from pyppeteer import launch
    
    async def main():
        browser = await launch()
        page = await browser.newPage()
        await page.goto('https://www.baidu.com')
        await page.screenshot({'path': 'baidu.png'})
        await browser.close()
    
    asyncio.get_event_loop().run_until_complete(main())
    

    Splash

    Lightweight, scriptable browser as a service with an HTTP API

    docker run --name py-splash -p 8050:8050 -d scrapinghub/splash
    
    pipenv run scrapy startproject splash_demo
    
    cd splash_demo
    
    vim splash_demo/settings.py
    
    ROBOTSTXT_OBEY = False
    
    SPLASH_URL = 'http://localhost:8050'
    
    DOWNLOADER_MIDDLEWARES = {
        'scrapy_splash.SplashCookiesMiddleware': 723,
        'scrapy_splash.SplashMiddleware': 725,
        'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
    }
    
    SPIDER_MIDDLEWARES = {
        'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
    }
    
    DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
    
    HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
    
    vim splash_demo/spiders/TaobaoSpider.py
    
    import scrapy
    from scrapy_splash import SplashRequest
    
    class TaobaoSpider(scrapy.Spider):
        name = "taobao"
        allowed_domains = ["www.taobao.com"]
        start_urls = ['https://s.taobao.com/search?q=坚果&s=880&sort=sale-desc']
    
        def start_requests(self):
            for url in self.start_urls
                yield SplashRequest(url, self.parse, args={'wait': 0.5})
    
        def parse(self, response):
            print(response.text)
    
    pipenv run scrapy crawl taobao
    

    参考

    相关文章

      网友评论

          本文标题:信息获取工具

          本文链接:https://www.haomeiwen.com/subject/flgphhtx.html