美文网首页python 爬虫办公效率
爬虫请求、解析、js渲染于一体---requests-html库

爬虫请求、解析、js渲染于一体---requests-html库

作者: 越大大雨天 | 来源:发表于2019-06-04 13:49 被阅读289次

    常用方法介绍

    一、请求url获取基本响应对象

    与我们熟悉的requests请求不同的是,requests-html默认使用session保持的请求方式,且其返回内容是一个带有丰富方法的对象。

    1. 基本请求网址方式:HTMLSession.get()
    import  requests_html
    
    session = requests_html.HTMLSession()
    r = session.get('https://python.org/')
    

    返回的r是一个对象,可对其调用多种方法操作网页响应。
    该r对象有两个基本类,分别是HTML类和Element类.

    1. 获取一个随机User-Agent
      不用每次在请求头里面去复制user-agent
    # 自动生成一个useragent,默认为谷歌浏览器风格
    user_agent = requests_html.user_agent()
    

    二. 响应对象基本操作方法

    HTML类应该会是最常用的类,其可对对网页源码调用多种解析方法。

    • r.html.links:以列表形式提取、返回响应源码中的所有url链接
    >>> r.html.links
    ['//docs.python.org/3/tutorial/', '/about/apps/', 'https://github.com/python/pythondotorg/issues', '/accounts/login/', '/dev/peps/', '/about/legal/', '//docs.python.org/3/tutorial/introduction.html#lists', '/download/alternatives', 'http://feedproxy.google.com/~r/PythonInsider/~3/kihd2DW98YY/python-370a4-is-available-for-testing.html', '/download/other/', '/downloads/windows/', 'https://mail.python.org/mailman/listinfo/python-dev', '/doc/av', 'https://devguide.python.org/', '/about/success/#engineering', 'https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event', 'https://www.openstack.org', '/about/gettingstarted/', 'http://feedproxy.google.com/~r/PythonInsider/~3/AMoBel8b8Mc/python-3.html', '/success-stories/industrial-light-magic-runs-python/', 'http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator', '/', 'http://pyfound.blogspot.com/', '/events/python-events/past/', '/downloads/release/python-2714/', 'https://wiki.python.org/moin/PythonBooks', 'http://plus.google.com/+Python', 'https://wiki.python.org/moin/', 'https://status.python.org/', '/community/workshops/', '/community/lists/', 'http://buildbot.net/', '/community/awards', 'http://twitter.com/ThePSF', 'https://docs.python.org/3/license.html', '/psf/donations/', 'http://wiki.python.org/moin/Languages', '/dev/', '/events/python-user-group/', 'https://wiki.qt.io/PySide', '/community/sigs/', 'https://wiki.gnome.org/Projects/PyGObject', 'http://www.ansible.com', 'http://www.saltstack.com', 'http://planetpython.org/', '/events/python-events', '/about/help/', '/events/python-user-group/past/', '/about/success/', '/psf-landing/', '/about/apps', '/about/', 'http://www.wxpython.org/', '/events/python-user-group/665/', 'https://www.python.org/psf/codeofconduct/', '/dev/peps/peps.rss', '/downloads/source/', '/psf/sponsorship/sponsors/', 'http://bottlepy.org', 'http://roundup.sourceforge.net/', 'http://pandas.pydata.org/', 'http://brochure.getpython.info/', 'https://bugs.python.org/', '/community/merchandise/', 'http://tornadoweb.org', '/events/python-user-group/650/', 'http://flask.pocoo.org/', '/downloads/release/python-364/', '/events/python-user-group/660/', '/events/python-user-group/638/', '/psf/', '/doc/', 'http://blog.python.org', '/events/python-events/604/', '/about/success/#government', 'http://python.org/dev/peps/', 'https://docs.python.org', 'http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html', '/users/membership/', '/about/success/#arts', 'https://wiki.python.org/moin/Python2orPython3', '/downloads/', '/jobs/', 'http://trac.edgewall.org/', 'http://feedproxy.google.com/~r/PythonInsider/~3/wh73_1A-N7Q/python-355rc1-and-python-348rc1-are-now.html', '/privacy/', 'https://pypi.python.org/', 'http://www.riverbankcomputing.co.uk/software/pyqt/intro', 'http://www.scipy.org', '/community/forums/', '/about/success/#scientific', '/about/success/#software-development', '/shell/', '/accounts/signup/', 'http://www.facebook.com/pythonlang?fref=ts', '/community/', 'https://kivy.org/', '/about/quotes/', 'http://www.web2py.com/', '/community/logos/', '/community/diversity/', '/events/calendars/', 'https://wiki.python.org/moin/BeginnersGuide', '/success-stories/', '/doc/essays/', '/dev/core-mentorship/', 'http://ipython.org', '/events/', '//docs.python.org/3/tutorial/controlflow.html', '/about/success/#education', '/blogs/', '/community/irc/', 'http://pycon.blogspot.com/', '//jobs.python.org', 'http://www.pylonsproject.org/', 'http://www.djangoproject.com/', '/downloads/mac-osx/', '/about/success/#business', 'http://feedproxy.google.com/~r/PythonInsider/~3/x_c9D0S-4C4/python-370b1-is-now-available-for.html', 'http://wiki.python.org/moin/TkInter', 'https://docs.python.org/faq/', '//docs.python.org/3/tutorial/controlflow.html#defining-functions']
    
    • r.html.absolute_links:与前者相同,不过返回的链接自动转为为绝对路径
    >>> r.html.absolute_links
    ['https://github.com/python/pythondotorg/issues', 'https://docs.python.org/3/tutorial/', 'https://www.python.org/about/success/', 'http://feedproxy.google.com/~r/PythonInsider/~3/kihd2DW98YY/python-370a4-is-available-for-testing.html', 'https://www.python.org/dev/peps/', 'https://mail.python.org/mailman/listinfo/python-dev', 'https://www.python.org/doc/', 'https://www.python.org/', 'https://www.python.org/about/', 'https://www.python.org/events/python-events/past/', 'https://devguide.python.org/', 'https://wiki.python.org/moin/PythonEventsCalendar#Submitting_an_Event', 'https://www.openstack.org', 'http://feedproxy.google.com/~r/PythonInsider/~3/AMoBel8b8Mc/python-3.html', 'https://docs.python.org/3/tutorial/introduction.html#lists', 'http://docs.python.org/3/tutorial/introduction.html#using-python-as-a-calculator', 'http://pyfound.blogspot.com/', 'https://wiki.python.org/moin/PythonBooks', 'http://plus.google.com/+Python', 'https://wiki.python.org/moin/', 'https://www.python.org/events/python-events', 'https://status.python.org/', 'https://www.python.org/about/apps', 'https://www.python.org/downloads/release/python-2714/', 'https://www.python.org/psf/donations/', 'http://buildbot.net/', 'http://twitter.com/ThePSF', 'https://docs.python.org/3/license.html', 'http://wiki.python.org/moin/Languages', 'https://docs.python.org/faq/', 'https://jobs.python.org', 'https://www.python.org/about/success/#software-development', 'https://www.python.org/about/success/#education', 'https://www.python.org/community/logos/', 'https://www.python.org/doc/av', 'https://wiki.qt.io/PySide', 'https://www.python.org/events/python-user-group/660/', 'https://wiki.gnome.org/Projects/PyGObject', 'http://www.ansible.com', 'http://www.saltstack.com', 'https://www.python.org/dev/peps/peps.rss', 'http://planetpython.org/', 'https://www.python.org/events/python-user-group/past/', 'https://docs.python.org/3/tutorial/controlflow.html#defining-functions', 'https://www.python.org/community/diversity/', 'https://docs.python.org/3/tutorial/controlflow.html', 'https://www.python.org/community/awards', 'https://www.python.org/events/python-user-group/638/', 'https://www.python.org/about/legal/', 'https://www.python.org/dev/', 'https://www.python.org/download/alternatives', 'https://www.python.org/downloads/', 'https://www.python.org/community/lists/', 'http://www.wxpython.org/', 'https://www.python.org/about/success/#government', 'https://www.python.org/psf/', 'https://www.python.org/psf/codeofconduct/', 'http://bottlepy.org', 'http://roundup.sourceforge.net/', 'http://pandas.pydata.org/', 'http://brochure.getpython.info/', 'https://www.python.org/downloads/source/', 'https://bugs.python.org/', 'https://www.python.org/downloads/mac-osx/', 'https://www.python.org/about/help/', 'http://tornadoweb.org', 'http://flask.pocoo.org/', 'https://www.python.org/users/membership/', 'http://blog.python.org', 'https://www.python.org/privacy/', 'https://www.python.org/about/gettingstarted/', 'http://python.org/dev/peps/', 'https://www.python.org/about/apps/', 'https://docs.python.org', 'https://www.python.org/success-stories/', 'https://www.python.org/community/forums/', 'http://feedproxy.google.com/~r/PythonInsider/~3/zVC80sq9s00/python-364-is-now-available.html', 'https://www.python.org/community/merchandise/', 'https://www.python.org/about/success/#arts', 'https://wiki.python.org/moin/Python2orPython3', 'http://trac.edgewall.org/', 'http://feedproxy.google.com/~r/PythonInsider/~3/wh73_1A-N7Q/python-355rc1-and-python-348rc1-are-now.html', 'https://pypi.python.org/', 'https://www.python.org/events/python-user-group/650/', 'http://www.riverbankcomputing.co.uk/software/pyqt/intro', 'https://www.python.org/about/quotes/', 'https://www.python.org/downloads/windows/', 'https://www.python.org/events/calendars/', 'http://www.scipy.org', 'https://www.python.org/community/workshops/', 'https://www.python.org/blogs/', 'https://www.python.org/accounts/signup/', 'https://www.python.org/events/', 'https://kivy.org/', 'http://www.facebook.com/pythonlang?fref=ts', 'http://www.web2py.com/', 'https://www.python.org/psf/sponsorship/sponsors/', 'https://www.python.org/community/', 'https://www.python.org/download/other/', 'https://www.python.org/psf-landing/', 'https://www.python.org/events/python-user-group/665/', 'https://wiki.python.org/moin/BeginnersGuide', 'https://www.python.org/accounts/login/', 'https://www.python.org/downloads/release/python-364/', 'https://www.python.org/dev/core-mentorship/', 'https://www.python.org/about/success/#business', 'https://www.python.org/community/sigs/', 'https://www.python.org/events/python-user-group/', 'http://ipython.org', 'https://www.python.org/shell/', 'https://www.python.org/community/irc/', 'https://www.python.org/about/success/#engineering', 'http://www.pylonsproject.org/', 'http://pycon.blogspot.com/', 'https://www.python.org/about/success/#scientific', 'https://www.python.org/doc/essays/', 'http://www.djangoproject.com/', 'https://www.python.org/success-stories/industrial-light-magic-runs-python/', 'http://feedproxy.google.com/~r/PythonInsider/~3/x_c9D0S-4C4/python-370b1-is-now-available-for.html', 'http://wiki.python.org/moin/TkInter', 'https://www.python.org/jobs/', 'https://www.python.org/events/python-events/604/']
    
    • r.html.find() :使用css解析器
    >>>about = r.html.find('#about',first=True) # 获取符合css表达式的第一个元素对象
    >>>about 
    <Element 'li' aria-haspopup='true' class=('tier-1', 'element-1') id='about'>
    

    find参数分别为:css解析式、first规定是否只返回第一个结果。
    使用解析器获取的子元素对象可以继续嵌套使用其父级元素解析方法。

    • r.html.xpath():使用xpath解析器
    >>> r.html.xpath('a')
    [<Element 'a' class='btn' href='https://help.github.com/articles/supported-browsers'>]
    
    
    • Element元素对象进一步操作方法
      .text:获取文本内容
      .attrs:字典形式获取所有属性
      .html:获取该段源码
      .links.absolute_links:获取所有链接及绝对链接
      .searchsearch_all:按{}位置搜索匹配第一个或全部,与正则表达式的find和findall类似。
      .find.xpath:进行下一级查询
    >>>about = r.html.find('#about',first=True)
    
    >>>about.text  # 获取about元素对象内所有文本内容
    'About\nApplications\nQuotes\nGetting Started\nHelp\nPython Brochure'
    
    >>> about.attrs  # 字典形式获取about元素对象内所有属性键值对
    {'id': 'about', 'class': ('tier-1', 'element-1'), 'aria-haspopup': 'true'}
    
    >>> about.html # 渲染出一个Element对象的HTML内容
    '<li aria-haspopup="true" class="tier-1 element-1 " id="about">\n<a class="" href="/about/" title="">About</a>\n<ul aria-hidden="true" class="subnav menu" role="menu">\n<li class="tier-2 element-1" role="treeitem"><a href="/about/apps/" title="">Applications</a></li>\n<li class="tier-2 element-2" role="treeitem"><a href="/about/quotes/" title="">Quotes</a></li>\n<li class="tier-2 element-3" role="treeitem"><a href="/about/gettingstarted/" title="">Getting Started</a></li>\n<li class="tier-2 element-4" role="treeitem"><a href="/about/help/" title="">Help</a></li>\n<li class="tier-2 element-5" role="treeitem"><a href="http://brochure.getpython.info/" title="">Python Brochure</a></li>\n</ul>\n</li>'
    
    >>> about.absolute_links # 查找一个Element对象内的绝对路径链接
    {'http://brochure.getpython.info/', 'https://www.python.org/about/gettingstarted/', 'https://www.python.org/about/', 'https://www.python.org/about/quotes/', 'https://www.python.org/about/help/', 'https://www.python.org/about/apps/'}
    
    >>> r.html.search('Python is a {} language')[0] # 在获取的页面中查找{}位置匹配文本
    programming
    

    支持JavaScript渲染

    这是requests-html库的一个重磅功能,内置支持js渲染,对以往需要使用Selenium获取源码的页面可不再额外使用使用Selenium库操作。

    • 使用方法:
      reder()方法可用参数:
      • retries - 在Chromium里加载页面的重试次数
      • script - 执行页面上的JavaScript(可选参数)
      • wait - 页面加载前的等待时间,防止超时(单位:秒,可选参数)
      • scrolldown - 接收整数参数n。如果提供参数n,表示向后翻n页
      • sleep - 接收整数参数n。如果提供参数n,则在render初始化后,程序会暂停n秒
      • reload - 如果为False,则不会重新从浏览器加载内容,而是读取内存里的内容
      • keep_page - 如果为True,将会允许你通过r.html.page与浏览器页面交互

    如果scrolldown和sleep都指定,那么程序会在暂停相应时间后,再往后翻页面(如:scrolldown=10, sleep=1)

    from requests_html import HTMLSession
    
    >>> session = HTMLSession()
    >>> r = session.get('http://python-requests.org/')
    
    >>> r.html.render()
    
    

    render方法将会渲染页面的JavaScript,返回渲染后的数据。
    特别注意的是:

    • 在初次使用该功能的时候会自动下载支持包:Chromium,但是由于国内网网络的原因,首次使用时需要搭好梯子,否则将无法安装。(民主自由和谐爱国)


      render首次运行时自动下载支持
    • 脱离reqeusts-html库也可单独使用其render方法和解析方法,只需导入内部的HTML类即可,类似在scrapy外可使用Selector类一样:
    from requests_html import HTML
    >>> doc = """<a href='https://httpbin.org'>"""
    
    >>> html = HTML(html=doc)
    >>> html.links
    {'https://httpbin.org'}
    
    >>> script = """
            () => {
                return {
                    width: document.documentElement.clientWidth,
                    height: document.documentElement.clientHeight,
                    deviceScaleFactor: window.devicePixelRatio,
                }
            }
        """
    >>> val = html.render(script=script, reload=False)
    
    >>> print(val)
    {'width': 800, 'height': 600, 'deviceScaleFactor': 1}
    
    >>> print(html.html)
    <html><head></head><body><a href="https://httpbin.org"></a></body></html>
    

    实战部分

    以简书主页为目标,仅使用equests-html库完成请求头构造、请求、解析嵌套、链接提取、JavaScript渲染、滚动翻页功能。
    见下文:爬虫请求、解析、js渲染于一体---requests-html库实例:https://www.jianshu.com/p/d23741341795

    相关文章

      网友评论

        本文标题:爬虫请求、解析、js渲染于一体---requests-html库

        本文链接:https://www.haomeiwen.com/subject/owdkxctx.html