最新的爬虫库requests-html

作者: 何苦_python_java | 来源:发表于2018-07-16 19:25 被阅读0次

前语为了照顾有英语障碍的朋友,部分文字是经过翻译的

image.png

在使解析HTML（例如，抓取Web）尽可能简单直观。

使用此库时，您会自动获得：

完整的JavaScript支持！

CSS Selectors（又名jQuery风格，感谢PyQuery）。
XPath Selectors，对于胆小的人来说。
模拟用户代理（如真实的Web浏览器）。
自动跟踪重定向。
连接池和cookie持久性。
请求体验您熟悉和喜爱，具有神奇的解析能力。

教程和用法

使用请求向'python.org'发出GET请求：>>> from requests_html import `

>>> from requests_html import HTMLSession 
>>> session = HTMLSession（）

>>> r = session.get（' https://python.org/ '）

按原样获取页面上所有链接的列表（不包括锚点）：

r.html.links {'//docs.python.org/3/tutorial/'，'/ about / apps

image.png

以绝对形式获取页面上所有链接的列表（不包括锚点）：

>>> r.html.absolute_links

{'https://github.com/python/pythondotorg/issues'，'https://docs.python.org/3/tutorial/'，'....}

选择带有CSS Selector的元素：

 >>> about = r.html.find（' #about '，first = True）

抓取元素的文本内容：

>>> print(about.text)
关于
应用程序
报价
入门
帮助
Python手册

反思Element的属性：

>>> about.attrs 
{'id'：'about'，'class':('tier-1'，'element-1'），'aria-haspopup'：'true'}

渲染元素的HTML：

>>> about.html 
'<li aria-haspopup =“true”class =“tier-1 element-1”id =“about”> \ n <a class =“”href     =“/ about /

选择元素中的元素：

>>> about.find（' a '）
[< Element'a'href ='/ about /'title =''class =''>，<Element'a'href ='/ about / apps /'title = ''>，<Element'a'href ='/ about / quotes /'title =''>，<Element'a'href ='/ about / gettingstarted /'title =''>，<Element'a'href ='/ about / help /'title =''>，<Element'a'href ='http：//brochure.getpython.info/'title =''>]

搜索元素中的链接：

>>> about.absolute_links 
  {'http://brochure.getpython.info/'，'https://www.python.org/about/gettingstarted/'，'https：//www.python.org/about/ '，'https://www.python.org/about/quotes/'，'https://www.python.org/about/help/'，'https：//www.python.org/about/apps /'}

在页面上搜索文字：

>>> r.html.search('Python is a {} language')[0]
programming

更复杂的CSS Selector示例（从Chrome开发工具复制）：

>>> r = session.get('https://github.com/')
>>> sel = 'body > div.application-main > div.jumbotron.jumbotron-codelines > div       
> div > div.col-md-7.text-center.text-md-left > p'

 >>> print(r.html.find(sel, first=True).text)
GitHub is a development platform inspired by the way you work. From open source to business, you can host and review code, manage projects, and build software alongside millions of other developers.

XPath is also supported:

  >>> r.html.xpath('/html/body/div[1]/a')
[<Element 'a' class=('px-2', 'py-4', 'show-on-focus', 'js-skip-to-content') href='#start-of-content' tabindex='1'>]

JavaScript Support

让我们抓一些由JavaScript呈现的文本：

  >>> r = session.get('http://python-requests.org')

  >>> r.html.render()

  >>> r.html.search('Python 2 will retire in only {months} months!')['months']
  '<time>25</time>'

请注意，第一次运行该render()方法时，它会将Chromium下载到您的主目录（例如~/.pyppeteer/）。这只发生过一次。

使用不带请求

您也可以在没有请求的情况下使用此库：

>>> from requests_html import  HTML 
>>> doc =  “”“ <a href='https://httpbin.org'> ”“”

>>> html = HTML（html = doc）
>>> html.links 
{ 'https://httpbin.org'}

网友评论

本文标题：最新的爬虫库requests-html

本文链接：https://www.haomeiwen.com/subject/aizbpftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

最新的爬虫库requests-html

前语为了照顾有英语障碍的朋友,部分文字是经过翻译的

在使解析HTML（例如，抓取Web）尽可能简单直观。

教程和用法

使用请求向'python.org'发出GET请求：>>> from requests_html import `

按原样获取页面上所有链接的列表（不包括锚点）：

以绝对形式获取页面上所有链接的列表（不包括锚点）：

选择带有CSS Selector的元素：

抓取元素的文本内容：

反思Element的属性：

渲染元素的HTML：

选择元素中的元素：

搜索元素中的链接：

在页面上搜索文字：

更复杂的CSS Selector示例（从Chrome开发工具复制）：

XPath is also supported:

JavaScript Support

让我们抓一些由JavaScript呈现的文本：

使用不带请求

您也可以在没有请求的情况下使用此库：

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读