python 爬虫 requests-html

作者: 逍遥_yjz | 来源:发表于2021-06-07 16:22 被阅读0次

python 爬虫 requests-html
Python爬虫实战——requests-html
requests-html请求需要render渲染时，mac自动
Python 爬虫实战（二）：使用 requests-html
Python进阶实战爬虫：爬虫最新的库requests-html
3分钟带你了解世界第一语言Python 入门上手也这么简单！
Python网络爬虫（八） - 利用有道词典实现一个简单翻译程序
Python网络爬虫（七）- 深度爬虫CrawlSpider
Python网络爬虫（二）- urllib爬虫案例
Python网络爬虫（一）- 入门基础

一介绍

使用Python开发的同学一定听说过Requsts库，它是一个用于发送HTTP请求的测试。如比我们用Python做基于HTTP协议的接口测试，那么一定会首选Requsts，因为它即简单又强大。现在作者Kenneth Reitz 又开发了requests-html 用于做爬虫。

官网告诉我们，它比原来的requests模块更加强大，并且为我们提供了一些新的功能!

支持JavaScript
支持CSS选择器（又名jQuery风格, 感谢PyQuery）
支持Xpath选择器
可自定义模拟User-Agent（模拟得更像真正的web浏览器）
自动追踪重定向
连接池与cookie持久化
支持异步请求

二安装

GiHub项目地址：

https://github.com/kennethreitz/requests-html

安装requests-html非常简单，一行命令即可做到。需要注意一点就是，requests-html只支持Python 3.6或以上的版本，所以使用老版本的Python的同学需要更新一下Python版本了。

pip install requests-html

三如何使用requests-html？

requests-html和其他解析HTML库最大的不同点在于HTML解析库一般都是专用的，所以我们需要用另一个HTTP库先把网页下载下来，然后传给那些HTML解析库。而requests-html自带了这个功能，所以在爬取网页等方面非常方便。

1. 基本使用

from requests_html import HTMLSession

# 获取请求对象
session = HTMLSession()

# 往新浪新闻主页发送get请求
sina = session.get('https://news.sina.com.cn/')
# print(sina.status_code)
sina.encoding = 'utf-8'

# 获取响应文本信息，与requests无区别
print(sina.text)
打印的数据是一个网页的html结构数据，和requests无区别

2.获取链接（links与abolute_links）

from requests_html import HTMLSession

# 获取请求对象
session = HTMLSession()

# 往京东主页发送get请求
jd = session.get('https://jd.com/')

# 得到京东主页所有的链接，返回的是一个set集合
print(jd.html.links)
print('*' * 1000)

# 若获取的链接中有相对路径，我们还可以通过absolute_links获取所有绝对链接
print(jd.html.absolute_links)

输出：

{'https://www.jd.com/hprm/737f6420ddea33b9f11.html', '//jdwx.jd.com', '//shangling.jd.com/', '//book.jd.com/', '//channel.jd.com/1672-2599.html', '//www.jd.com/sptopic/117296ca19d46b24dfeb3.html', 'https://www.jd.com/zxnews/1230cc31cb8193a1.html', '//channel.jd.com/1315-1345.html', 'https://www.healthjd.com/', 'https://www.jd.hk/', '//jipiao.jd.com/', 'https://www.jd.com/jxinfo/b02d72a3a86c6482.html', 'https://www.jd.com/nrjs/d38e83915c8abe8d.html', 'https://jiayouka.jd.com/', 'https://m.healthjd.com', '//phat.jd.com/10-184.html', 'https://yp.jd.com/737b483efcc87a9d40f.html', 'https://www.jd.com/book/73733a8f60691c31ae4.html', '//beauty.jd.com/', '//channel.jd.com/pet.html', 'https://fresh.jd.com/', '//diannao.jd.com/', '//jzt.jd.com/school/marketing/calendar', 'https://www.jd.com/zuozhe/7377a05de5fb71894c5.html', '//home.jd.com/', '//jos.jd.com/', '//vip.jd.com/', '//channel.jd.com/kitchenware.html', '//car.jd.com/', 'https://www.jd.com/tupian/7372701d36eaef71c49.html', 'https://www.qingzhouip.com/', 'https://www.jdcloud.com/cn/activity/618?utm_source=MO_jd618&utm_medium=banner&utm_campaign=ggw&utm_term=NA', '//health.jd.com', '//www.jd.com', '//cleanclean.jd.com', 'https://www.jd.com/hotitem/737c2bd91581b9e08ad.html', '//www.jd.com/hprm/62331efefe1affa158ff.html', '//union.jd.com', 'https://jdx.com', '//shuma.jd.com/', 'https://www.jd.com/jiage/737c6f7cc59aa07bc53.html', 'https://chaoshi.jd.com/', '//fresh.jd.com', 'https://www.jd.com/phb/737ec3ec0faa7fd8bf0.html', '//channel.jd.com/furniture.html', '//phat.jd.com/10-603.html', 'https://jr.jd.com/', '//www.jd.com/brand/9987019f0bd7d403e3de.html', '//baitiao.jd.com', '//game.jd.com/', '//china.jd.com', '//mvd.jd.com/', 'https://jiadian.jd.com/', 'https://o.jd.com/market/index.action', '//www.jd.com/sptopic/1316b578fe1de22368e4.html', '//www.jd.com/cppf/9847333cd3d99d6886d9.html', '//mro.jd.com/', 'https://a.jd.com/', '//jzjc.jd.com/', '//www.jd.com/hotitem/9855fbd5a67b591890f1.html', 'https://licai.jd.com/?from=jrscyn_20161', '//channel.jd.com/home.html', '//www.jd.com/nrjs/3246ed949ba174ea.html', 'https://red.jd.com/', 'https://b.jd.com/', '//phat.jd.com/10-507.html', 'https://www.jd.com/phb/key_7371687f795f334fa89.html', 'https://www.jd.com/xinghao/737b323dec8be82ac47.html', 'https://paimai.jd.com/', '//shouji.jd.com/', 'https://www.jd.com/xinkuan/7379ebf25db31baf2d4.html', 'https://www.jd.com/phb/zhishi/1287a7973c0c0950.html', 'https://www.jd.com/jxinfo/8fe2ee88e406dd56.html', '//phat.jd.com/10-156.html', 'https://baitiao.jd.com/?from=jrscyn_20160', '//2.jd.com/', '//toy.jd.com/', '//jiu.jd.com', '//e.jd.com/ebook.html', '//art.jd.com', '//order.jd.com/center/list.action', '//anzhuang.jd.com', '//channel.jd.com/jewellery.html', '//passport.jd.com/uc/login?ReturnUrl=https%3A%2F%2Fwww.jd.com%2F', 'https://plus.jd.com/index?flow_system=appicon&flow_entrance=appicon11&flow_channel=pc', '//food.jd.com/', '//nong.jd.com', '//channel.jd.com/9192-9196.html', '//phat.jd.com/10-183.html', '//channel.jd.com/watch.html', '//phat.jd.com/10-272.html', 'https://miaosha.jd.com/', '//trip.jd.com/', '//wt.jd.com', '//che.jd.com/', '//www.jd.com/zuozhe/7378d855fa5f85d59a5.html', '//hotel.jd.com/', '//itb2b.jd.com/', '//b.jd.com/', '//phat.jd.com/10-185.html', '//ish.jd.com/', '//xinfang.jd.com/', 'https://movie.jd.com/index.html', 'https://www.jd.com/cppf/737f395415e20bb7cf7.html', '//channel.jd.com/beauty.html', '//phat.jd.com/10-109.html', 'https://train.jd.com/', '//bg.jd.com', '//baby.jd.com', '//jiadian.jd.com', '//z.jd.com/', '//chongzhi.jd.com/', '//reg.jd.com/reg/person?ReturnUrl=https%3A//www.jd.com/', '//licai.jd.com/', '//www.jd.com/book/737280eea8ac7dfea03.html', '//education.jd.com', 'https://www.jd.com/phb/zhishi/05cf807c181d474a.html', '//fresh.jd.com/shengxian/12218e48f879c700b44c1.html', '//cobrand.jd.com/', 'https://www.jd.com/sptopic/737e06d122fde17d1a5.html', 'https://www.jd.com/brand/737355d688c2bfc7f4a.html', '//bao.jd.com/', '//cart.jd.com/cart.action'}
****************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************************
{'https://channel.jd.com/1315-1345.html', 'https://phat.jd.com/10-507.html', 'https://phat.jd.com/10-109.html', 'https://www.jd.com/hprm/737f6420ddea33b9f11.html', 'https://www.jd.com/cppf/9847333cd3d99d6886d9.html', 'https://www.jd.com/nrjs/3246ed949ba174ea.html', 'https://book.jd.com/', 'https://nong.jd.com', 'https://www.jd.com/zxnews/1230cc31cb8193a1.html', 'https://home.jd.com/', 'https://www.healthjd.com/', 'https://www.jd.hk/', 'https://shuma.jd.com/', 'https://www.jd.com/jxinfo/b02d72a3a86c6482.html', 'https://jiadian.jd.com', 'https://www.jd.com/nrjs/d38e83915c8abe8d.html', 'https://m.healthjd.com', 'https://jiayouka.jd.com/', 'https://che.jd.com/', 'https://cart.jd.com/cart.action', 'https://passport.jd.com/uc/login?ReturnUrl=https%3A%2F%2Fwww.jd.com%2F', 'https://yp.jd.com/737b483efcc87a9d40f.html', 'https://jos.jd.com/', 'https://www.jd.com/book/73733a8f60691c31ae4.html', 'https://z.jd.com/', 'https://jiu.jd.com', 'https://licai.jd.com/', 'https://health.jd.com', 'https://fresh.jd.com/', 'https://www.jd.com/zuozhe/7377a05de5fb71894c5.html', 'https://ish.jd.com/', 'https://e.jd.com/ebook.html', 'https://phat.jd.com/10-156.html', 'https://www.jd.com/brand/9987019f0bd7d403e3de.html', 'https://www.jd.com/tupian/7372701d36eaef71c49.html', 'https://www.qingzhouip.com/', 'https://cobrand.jd.com/', 'https://www.jdcloud.com/cn/activity/618?utm_source=MO_jd618&utm_medium=banner&utm_campaign=ggw&utm_term=NA', 'https://art.jd.com', 'https://www.jd.com/hotitem/737c2bd91581b9e08ad.html', 'https://jdwx.jd.com', 'https://channel.jd.com/beauty.html', 'https://jdx.com', 'https://shouji.jd.com/', 'https://www.jd.com/jiage/737c6f7cc59aa07bc53.html', 'https://chaoshi.jd.com/', 'https://www.jd.com/phb/737ec3ec0faa7fd8bf0.html', 'https://jr.jd.com/', 'https://cleanclean.jd.com', 'https://anzhuang.jd.com', 'https://channel.jd.com/home.html', 'https://phat.jd.com/10-272.html', 'https://o.jd.com/market/index.action', 'https://jiadian.jd.com/', 'https://toy.jd.com/', 'https://phat.jd.com/10-183.html', 'https://www.jd.com/zuozhe/7378d855fa5f85d59a5.html', 'https://jzt.jd.com/school/marketing/calendar', 'https://phat.jd.com/10-184.html', 'https://a.jd.com/', 'https://www.jd.com/book/737280eea8ac7dfea03.html', 'https://licai.jd.com/?from=jrscyn_20161', 'https://red.jd.com/', 'https://www.jd.com/sptopic/1316b578fe1de22368e4.html', 'https://fresh.jd.com', 'https://trip.jd.com/', 'https://chongzhi.jd.com/', 'https://jipiao.jd.com/', 'https://www.jd.com/hprm/62331efefe1affa158ff.html', 'https://b.jd.com/', 'https://order.jd.com/center/list.action', 'https://www.jd.com/phb/key_7371687f795f334fa89.html', 'https://www.jd.com/xinghao/737b323dec8be82ac47.html', 'https://beauty.jd.com/', 'https://paimai.jd.com/', 'https://www.jd.com/sptopic/117296ca19d46b24dfeb3.html', 'https://channel.jd.com/kitchenware.html', 'https://diannao.jd.com/', 'https://phat.jd.com/10-185.html', 'https://channel.jd.com/furniture.html', 'https://www.jd.com/xinkuan/7379ebf25db31baf2d4.html', 'https://www.jd.com/phb/zhishi/1287a7973c0c0950.html', 'https://www.jd.com/hotitem/9855fbd5a67b591890f1.html', 'https://www.jd.com/jxinfo/8fe2ee88e406dd56.html', 'https://education.jd.com', 'https://car.jd.com/', 'https://baitiao.jd.com/?from=jrscyn_20160', 'https://fresh.jd.com/shengxian/12218e48f879c700b44c1.html', 'https://channel.jd.com/jewellery.html', 'https://baitiao.jd.com', 'https://china.jd.com', 'https://channel.jd.com/watch.html', 'https://xinfang.jd.com/', 'https://plus.jd.com/index?flow_system=appicon&flow_entrance=appicon11&flow_channel=pc', 'https://union.jd.com', 'https://shangling.jd.com/', 'https://bg.jd.com', 'https://itb2b.jd.com/', 'https://miaosha.jd.com/', 'https://bao.jd.com/', 'https://baby.jd.com', 'https://vip.jd.com/', 'https://channel.jd.com/9192-9196.html', 'https://mvd.jd.com/', 'https://movie.jd.com/index.html', 'https://mro.jd.com/', 'https://www.jd.com/cppf/737f395415e20bb7cf7.html', 'https://train.jd.com/', 'https://hotel.jd.com/', 'https://www.jd.com', 'https://2.jd.com/', 'https://wt.jd.com', 'https://reg.jd.com/reg/person?ReturnUrl=https%3A//www.jd.com/', 'https://game.jd.com/', 'https://phat.jd.com/10-603.html', 'https://jzjc.jd.com/', 'https://www.jd.com/phb/zhishi/05cf807c181d474a.html', 'https://www.jd.com/sptopic/737e06d122fde17d1a5.html', 'https://food.jd.com/', 'https://www.jd.com/brand/737355d688c2bfc7f4a.html', 'https://channel.jd.com/pet.html', 'https://channel.jd.com/1672-2599.html'}

3. 智能翻页（待改进）

这是我看到的最亮的功能，但是实际使用还是有问题的，但是我仍要把ta列在第一个要讲的内容。平常我们写静态网页的爬虫前，需要先发现网址规律，如

第一页 https://book.douban.com/tag/小说
第二页 https://book.douban.com/tag/小说?start=20&type=T
第三页 https://book.douban.com/tag/小说?start=40&type=T
第四页 https://book.douban.com/tag/小说?start=60&type=T

# 内容页面通常都是分页的，一次抓取不了太多，这个库可以获取分页信息：
r = session.get('https://book.douban.com/tag/小说')

print(r.html)
 # 比较一下
for url in r.html:
    print(url)

输出：

<HTML url='https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4'>
<HTML url='https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4'>

但是实际使用过程中该方法并未奏效，kennethreitz也在文档中提到
There’s also intelligent pagination support (always improving)

4. CSS选择器与XPATH

request-html支持CSS选择器和XPATH两种语法来选取HTML元素。首先先来看看CSS选择器语法，它需要使用HTML的 find 函数来查找元素。

'''
　　CSS选择器 and XPATH
    　　1.通过css选择器选取一个Element对象
    　　2.获取一个Element对象内的文本内容
    　　3.获取一个Element对象的所有attributes
   　　 4.渲染出一个Element对象的HTML内容
    　　5.获取Element对象内的特定子Element对象，返回列表
    　　6.在获取的页面中通过search查找文本
   　　 7.支持XPath
   　　 8.获取到只包含某些文本的Element对象

. CSS 简单规则
标签名 h1

id 使用 #id 表示

class 使用 .class_name 表示

谓语表示：h1[prop=value]
'''

实例1

from requests_html import HTMLSession

session = HTMLSession()
url = "https://www.qiushibaike.com/text/"

# 获取响应数据对象
obj = session.get(url)

# 1.通过css选择器选取一个Element对象
# 获取id为content-left的div标签，并且返回一个对象
content = obj.html.find('div#content', first=True)

# 2.获取一个Element对象内的文本内容
# 获取content内所有文本
print(content.text)
#
# # 3.获取一个Element对象的所有attributes
# # 获取content内所有属性
print(content.attrs)

# 4.渲染出一个Element对象的完整的HTML内容
html = content.html
print(html)

上面的代码例子，了解即可。

实例2

from requests_html import HTMLSession
session = HTMLSession()
r = session.get("https://news.cnblogs.com/n/recommend")
# 通过CSS找到新闻标签
news = r.html.find('h2.news_entry > a')
for new in news:
    print(new)
    print(new.text) # 获得新闻标题
    print(new.absolute_links) # 获得新闻链接

执行结果：

<Element 'a' href='/n/695164/' target='_blank'>
“中了一个亿”的支付宝锦鲤信小呆谈现状：没钱没工作、查出抑郁症
{'https://news.cnblogs.com/n/695164/'}
<Element 'a' href='/n/695110/' target='_blank'>
GCC 9.4发布：不再强制要求代码贡献版权转让给FSF
{'https://news.cnblogs.com/n/695110/'}
<Element 'a' href='/n/695095/' target='_blank'>
清华大学迎来中国首个原创虚拟学生：智商和情商双高颜值出众
{'https://news.cnblogs.com/n/695095/'}
<Element 'a' href='/n/695021/' target='_blank'>
今年618，我选择“躺平”
{'https://news.cnblogs.com/n/695021/'}
<Element 'a' href='/n/695001/' target='_blank'>
阿里平头哥发力RISC-V！三款开发板齐发：搭载玄铁910、906处理器
{'https://news.cnblogs.com/n/695001/'}
<Element 'a' href='/n/694949/' target='_blank'>

北大数学大神手提馒头矿泉水接受采访走红：连拿两届国际奥赛冠军
{'https://news.cnblogs.com/n/694949/'}
<Element 'a' href='/n/694943/' target='_blank'>
被拒40次，用行李箱装卫星中国商业航天的6年蜕变
{'https://news.cnblogs.com/n/694943/'}
<Element 'a' href='/n/694911/' target='_blank'>
我国“天舟二号”货运飞船发射成功！为空间站运送生活物资
{'https://news.cnblogs.com/n/694911/'}
<Element 'a' href='/n/694823/' target='_blank'>
阿里云：飞天操作系统正全面兼容X86、ARM、RISC-V
{'https://news.cnblogs.com/n/694823/'}
<Element 'a' href='/n/694821/' target='_blank'>
蚂蚁集团自研数据库OceanBase决定开放源码：测评性能10倍于微软
{'https://news.cnblogs.com/n/694821/'}
<Element 'a' href='/n/694747/' target='_blank'>
全球Top 1000计算机科学家h指数公布：中国53位学者
{'https://news.cnblogs.com/n/694747/'}
<Element 'a' href='/n/694701/' target='_blank'>
杨振宁先生捐赠清华大学“杨振宁资料室”
{'https://news.cnblogs.com/n/694701/'}
<Element 'a' href='/n/694696/' target='_blank'>
华为：首台HI版极狐阿尔法S生产线验证样车下线
{'https://news.cnblogs.com/n/694696/'}
<Element 'a' href='/n/694649/' target='_blank'>
微软Build大会：怎么“淘汰”程序员、怎么让人长期宅家
{'https://news.cnblogs.com/n/694649/'}
<Element 'a' href='/n/694642/' target='_blank'>
快看有你的机型没？华为大批机型开启鸿蒙 2.0消费者尝鲜
{'https://news.cnblogs.com/n/694642/'}
<Element 'a' href='/n/694614/' target='_blank'>
2021款理想ONE发布：续航突破1000km，辅助驾驶大升级，理想的「理想」还有多远？
{'https://news.cnblogs.com/n/694614/'}
<Element 'a' href='/n/694590/' target='_blank'>
互联网行业，再卷就卷没了…
{'https://news.cnblogs.com/n/694590/'}
<Element 'a' href='/n/694535/' target='_blank'>
鸿蒙手机操作系统6月2日见！华为EMUI微博更名HarmonyOS
{'https://news.cnblogs.com/n/694535/'}

实例3：

from requests_html import HTMLSession

session = HTMLSession()

def parse():
    r = session.get('http://www.qdaily.com/')
    # 获取首页新闻标签、图片、标题、发布时间
    print(r.html.xpath('/html/body/div[2]/div[1]/div[2]/div/div[3]')[0].text)
    print(r.html.xpath('/html/body/div[2]/div[1]/div[2]/div/div[2]')[0].text)
                       #'/html/body/div[2]/div[1]/div[2]/div/div[2]'
    for x in r.html.find('.packery-item'):
        yield {
            'tag': x.find('.category')[0].text,
            'image': x.find('.lazyload')[0].attrs['data-src'],
            'title': x.find('.smart-dotdotdot')[0].text if x.find('.smart-dotdotdot') else x.find('.smart-lines')[0].text,
            'addtime': x.find('.smart-date')[0].attrs['data-origindate'][:-6]
        }
def main():
    for x in parse():
        print(x)
main()

通过简短的几行代码，就可以把整个首页的文章抓取下来，下面我来分别介绍一下案例中使用的几个方法

find( ) 可以接收两个参数

第一个参数可以是class名称或ID
第二个参数first=True时，只选取第一条数据

text 获取元素的文本内容
attrs 获取元素的属性，返回值是个字典。如：

{'class': ('smart-date',), 'data-origindate': '2018-11-02 10:27:10 +0800'}

html 获取元素的html内容
此外还支持xpath选择器，使用方法也很简单

r.html.xpath('/html/body/div[2]/div[1]/div[2]/div/div[3]')[0].text

红框的copy Xpath
'登录\n登录查看你的好奇心指数'
执行结果：

登录
登录查看你的好奇心指数

{'tag': '商业', 'image': 'http://img.qdaily.com/article/article_show/20201112192124OZTGPbJDB1rNUjVk.jpg?imageMogr2/auto-orient/thumbnail/!500x185r/gravity/Center/crop/500x185/ignore-error/1', 'title': '大公司头条：拼多多年活跃买家接近阿里；三星发布新款移动处理器，抢占 5G 芯片份额', 'addtime': '2020-11-13 10:36:44'}
{'tag': '商业', 'image': 'http://img.qdaily.com/article/article_show/20201111155800kE3GdO71iXhnuDTs.jpg?imageMogr2/auto-orient/thumbnail/!500x185r/gravity/Center/crop/500x185/ignore-error/1', 'title': '大公司头条：天猫“双十一”四天的成交额达到 4982 亿；宝马公布纯电动计划，两年后发新车', 'addtime': '2020-11-12 16:22:24'}
{'tag': '商业', 'image': 'http://img.qdaily.com/article/article_show/202011101810550pmc2EUtTqNxazhC.jpg?imageMogr2/auto-orient/thumbnail/!500x185r/gravity/Center/crop/500x185/ignore-error/1', 'title': '大公司头条：中国拟将《反垄断法》的基本制度适用于互联网平台；苹果发布三款自研芯片 Mac 产品，主打长续航', 'addtime': '2020-11-11 09:57:56'}
{'tag': '商业', 'image': 'http://img.qdaily.com/article/article_show/20201109150456fywOSTt5X3qC7egY.jpg?imageMogr2/auto-orient/thumbnail/!500x185r/gravity/Center/crop/500x185/ignore-error/1', 'title': '大公司头条：食品价格继续跌，中国十月物价指数涨幅收窄至 0.5%；受辉瑞新疫苗刺激，资本市场风险偏好聚集', 'addtime': '2020-11-10 11:09:26'}
{'tag': '商业', 'image': 'http://img.qdaily.com/article/article_show/20201109073657apXhW6vHMEixQCBZ.jpg?imageMogr2/auto-orient/thumbnail/!500x185r/gravity/Center/crop/500x185/ignore-error/1', 'title': '大公司头条：恒大终止借壳重组，回 A 股计划失败；拜登宣布胜选，征税策略引发关注', 'addtime': '2020-11-09 14:15:44'}
{'tag': '商业', 'image': 'http://img.qdaily.com/article/article_show/20201105134913Z2gJwVe0ivAHNMfy.jpg?imageMogr2/auto-orient/thumbnail/!500x185r/gravity/Center/crop/500x185/ignore-error/1', 'title': '大公司头条：美股持续大涨，新能源汽车又成亮点；快手上半年收入 253 亿元，每天 3 亿人用', 'addtime': '2020-11-06 15:20:55'}
{'tag': '商业', 'image': 'http://img.qdaily.com/article/article_show/20201104132407Fnot8v4ipEQfkZyu.jpg?imageMogr2/auto-orient/thumbnail/!500x185r/gravity/Center/crop/500x185/ignore-error/1', 'title': '大公司头条：高通可能恢复对华为供货；网易严选说要“退出”双 11', 'addtime': '2020-11-05 09:53:00'}
{'tag': '商业', 'image': 'http://img.qdaily.com/article/article_show/20201103161413OkY2ZQbtuHmv0Pgf.jpg?imageMogr2/auto-orient/thumbnail/!500x185r/gravity/Center/crop/500x185/ignore-error/1', 'title': '大公司头条：蚂蚁两地上市暂缓，重启时间未定；美国总统选举的投票环节正式开始', 'addtime': '2020-11-04 10:20:38'}
{'tag': '商业', 'image': 'http://img.qdaily.com/article/article_show/20201102162643RU5AzNDbTqjcI9aX.jpg?imageMogr2/auto-orient/thumbnail/!500x185r/gravity/Center/crop/500x185/ignore-error/1', 'title': '大公司头条：蚂蚁上市前夕，央行拟增加小贷公司放贷限制；苹果下周开发布会，可能发布自研芯片 MacBook', 'addtime': '2020-11-03 13:44:17'}
{'tag': '商业', 'image': 'http://img.qdaily.com/article/article_show/20201102071245H0tpxi9gMklqjOEy.jpg?imageMogr2/auto-orient/thumbnail/!500x185r/gravity/Center/crop/500x185/ignore-error/1', 'title': '大公司头条：《英雄联盟》S10 总决赛落幕；恒大出售广汇能源股权，账面盈利 3.5 亿', 'addtime': '2020-11-02 11:11:25'}
{'tag': '商业', 'image': 'http://img.qdaily.com/article/article_show/20201029173255GvCTwouqhIr69F0m.jpg?imageMogr2/auto-orient/thumbnail/!500x185r/gravity/Center/crop/500x185/ignore-error/1', 'title': '大公司头条：科技巨头发布财报，苹果和 Google 遭遇两重天；星巴克同店销售额仍同比下降，但降幅收窄', 'addtime': '2020-10-30 13:30:58'}
{'tag': '商业', 'image': 'http://img.qdaily.com/article/article_show/202010290656548LeztXrnxwvCfGFJ.jpg?imageMogr2/auto-orient/thumbnail/!500x185r/gravity/Center/crop/500x185/ignore-error/1', 'title': '大公司头条：波音继续裁员，737 Max 复飞在即也无济于事；随着欧洲疫情加深，全球股市跌宕起伏', 'addtime': '2020-10-29 10:13:18'}
{'tag': '商业', 'image': 'http://img.qdaily.com/article/article_show/20201027112507LHE0ovROIuN3eXy2.jpg?imageMogr2/auto-orient/thumbnail/!500x185r/gravity/Center/crop/500x185/ignore-error/1', 'title': '大公司头条：SpacX 开始公测卫星上网，月费 99 美元；LV 和 Tiffany 重启并购谈判', 'addtime': '2020-10-28 11:49:11'}
{'tag': '商业', 'image': 'http://img.qdaily.com/article/article_show/20201026214059nu0cJjqL9dHEkZbT.jpg?imageMogr2/auto-orient/thumbnail/!500x185r/gravity/Center/crop/500x185/ignore-error/1', 'title': '大公司头条：前三季度工业利润降幅收窄，外资率先转正；各手机浏览器将设总编辑负责制', 'addtime': '2020-10-27 10:32:12'}
{'tag': '商业', 'image': 'http://img.qdaily.com/article/article_show/20201025195814KWd65nZgCLOE3XcH.jpg?imageMogr2/auto-orient/thumbnail/!500x185r/gravity/Center/crop/500x185/ignore-error/1', 'title': '大公司头条：三星集团会长李健熙去世，继任者未定；喀什现有 138 例新冠感染者', 'addtime': '2020-10-26 11:47:07'}
{'tag': '商业', 'image': 'http://img.qdaily.com/article/article_show/20201022214319VLhoFvGT7sZPcNyn.jpg?imageMogr2/auto-orient/thumbnail/!500x185r/gravity/Center/crop/500x185/ignore-error/1', 'title': '大公司头条：可口可乐营收下滑，跌幅环比收窄；奢侈品品牌营收回暖，中国需求强劲', 'addtime': '2020-10-23 12:08:40'}
{'tag': '商业', 'image': 'http://img.qdaily.com/article/article_show/20201021230033gvo5NT0nQxRdZ37t.jpg?imageMogr2/auto-orient/thumbnail/!500x185r/gravity/Center/crop/500x185/ignore-error/1', 'title': '大公司头条：阿里将认购蚂蚁 20% 新股；特斯拉季度利润创纪录，上海工厂越发关键', 'addtime': '2020-10-22 11:14:35'}
{'tag': '商业', 'image': 'http://img.qdaily.com/article/article_show/20201021101320w2F0yqkifncNDHKY.jpg?imageMogr2/auto-orient/thumbnail/!500x185r/gravity/Center/crop/500x185/ignore-error/1', 'title': '大公司头条：国泰航空裁员，停运旗下港龙航空；美国司法部对 Google 提起反垄断诉讼', 'addtime': '2020-10-21 10:23:14'}
{'tag': '商业', 'image': 'http://img.qdaily.com/article/article_show/20201019204831RfeL1Mt2CTwzN6ZQ.jpg?imageMogr2/auto-orient/thumbnail/!500x185r/gravity/Center/crop/500x185/ignore-error/1', 'title': '大公司头条：蚂蚁港股 IPO 获证监会批准，A 股仍在等；英特尔 90 亿美元出售闪存业务', 'addtime': '2020-10-20 11:33:05'}
{'tag': '商业', 'image': 'http://img.qdaily.com/article/article_show/20201019070336WFdwQyt7hqCS0XOx.jpg?imageMogr2/auto-orient/thumbnail/!500x185r/gravity/Center/crop/500x185/ignore-error/1', 'title': '大公司头条：阿里巴巴拟 280 亿港元增持高鑫零售；中国第三季度经济增长 4.9%', 'addtime': '2020-10-19 15:35:18'}

四 Element对象方法

r = session.get('https://github.com/')
htmlObj = r.html
htmlObj.xpath('a',first=True)

Run and output!

<Element 'a' class=('btn', 'ml-2') href='https://help.github.com/articles/supported-browsers'>

五支持JavaScript

支持JavaScript是我觉得作者更新后最为牛逼的一个地方，但是需要在第一次执行render的时候下载chromeium，然后通过它来执行js代码。

1.render的使用

from requests_html import HTMLSession

session = HTMLSession()

url = 'http://www.win4000.com/'

obj = session.get(url)

obj.encoding = 'utf-8'

obj.html.render()

注意：第一次运行render()方法时，它会将Chromium下载到您的主目录中(例如~/.pyppeteer/)。这种情况只发生一次。

2、下载Chromeium问题

因为是从国外的站点下载几分钟才3%，实在是太慢了。所以我们需要通过国内的镜像去下载！需要做以下几步:

- 手动下载Chrome
  先去国内源下载自己需要的版本，地址：https://npm.taobao.org/mirrors/chromium-browser-snapshots/

下载后之后解压，包含以下文件。
进入python安装目录下D:\Python36\Lib\site-packages\pyppeteer，并打开chromium_downloader.py文件。

from pyppeteer import __chromium_revision__, __pyppeteer_home__
import os
from pathlib import Path
DOWNLOADS_FOLDER = Path(__pyppeteer_home__) / 'local-chromium'
REVISION = os.environ.get('PYPPETEER_CHROMIUM_REVISION', __chromium_revision__)
chromiumExecutable = {

'linux': DOWNLOADS_FOLDER / REVISION / 'chrome-linux' / 'chrome',

'mac': (DOWNLOADS_FOLDER / REVISION / 'chrome-mac' / 'Chromium.app' /

'Contents' / 'MacOS' / 'Chromium'),

'win32': DOWNLOADS_FOLDER / REVISION / 'chrome-win32' / 'chrome.exe',

'win64': DOWNLOADS_FOLDER / REVISION / 'chrome-win32' / 'chrome.exe',

}
# 打印这两个变量可以知道执行的驱动具体位置
print(DOWNLOADS_FOLDER)
print(REVISION)
print(chromiumExecutable['win64'])

输出：

C:\Users\yuhai\AppData\Local\pyppeteer\pyppeteer\local-chromium
588429
C:\Users\yuhai\AppData\Local\pyppeteer\pyppeteer\local-chromium\588429\chrome-win32\chrome.exe

将压缩包解压得到所有文件，放入到这个路径下即可：C:\Users\yuhai\AppData\Local\pyppeteer\pyppeteer\local-chromium\588429\chrome-win32\

查看电脑仅有C:\Users\yuhai\AppData\Local\pyppeteer\pyppeteer，然后创建local-chromium\588429\chrome-win32，最后才移动数据。588429此文件名随着版本更新有可能会变，注意一下。
render函数还有一些参数，顺便介绍一下（这些参数有的还有默认值，直接看源代码方法参数列表即可）：

retries: 加载页面失败的次数
script: 页面上需要执行的JS脚本（可选）
wait: 加载页面钱的等待时间（秒），防止超时（可选）
scrolldown: 页面向下滚动的次数
sleep: 在页面初次渲染之后的等待时间
reload: 如果为假，那么页面不会从浏览器中加载，而是从内存中加载
keep_page: 如果为真，允许你用r.html.page访问页面
比如说简书的用户页面上用户的文章列表就是一个异步加载的例子，初始只显示最近几篇文章，如果想爬取所有文章，就需要使用scrolldown配合sleep参数模拟下滑页面，促使JS代码加载所有文章。

六自定义User-Agent

有些网站会使用User-Agent来识别客户端类型，有时候需要伪造UA来实现某些操作。如果查看文档的话会发现HTMLSession上的很多请求方法都有一个额外的参数**kwargs，这个参数用来向底层的请求传递额外参数。我们先向网站发送一个请求，看看返回的网站信息。

from requests_html import HTMLSession
# pprint可以把数据打印得更整齐
from pprint import pprint
import json
get_url = 'http://httpbin.org/get'

session = HTMLSession()

# 返回的是当前系统的headers信息
res = session.get(get_url)
pprint(json.loads(res.html.html))
print('*'*20)
# 可以在发送请求的时候更换user-agent
ua = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0'
post_url = 'http://httpbin.org/get'
res = session.get(post_url, headers={'user-agent': ua})
pprint(json.loads(res.html.html))# 如果你有需要可以在header中修改其他参数。

七模拟表单提交（POST）

HTMLSession封装了一整套的HTTP方法，包括get、post、delete等, 对应HTTP中各个方法。

import json

session = HTMLSession()
# 表单登录
r = session.post('http://httpbin.org/post', data={'username': 'tank_jam', 'password': 'tank9527'})
print(json.loads(r.html.html))
''' # 打印结果
{'args': {},
 'data': '',
 'files': {},
 'form': {'password': 'tank9527', 'username': 'tank_jam'},
 'headers': {'Accept': '*/*',
             'Accept-Encoding': 'gzip, deflate',
             'Content-Length': '35',
             'Content-Type': 'application/x-www-form-urlencoded',
             'Host': 'httpbin.org',
             'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) '
                           'AppleWebKit/603.3.8 (KHTML, like Gecko) '
                           'Version/10.1.2 Safari/603.3.8'},
 'json': None,
 'origin': '112.65.61.109, 112.65.61.109',
 'url': 'https://httpbin.org/post'}
'''

八支持异步请求

requests-html内部就封装好了aynsc异步请求的功能，可以提高我们的爬虫效率。

from requests_html import AsyncHTMLSession
from requests_html import HTMLSession
import time

# 使用异步发送请求
async_session = AsyncHTMLSession()


async def get_baidu():
    url = 'https://www.baidu.com/'
    res = await async_session.get(url)
    print(res.html.absolute_links)


async def get_sougou():
    url = 'https://www.sogou.com/'
    res = await async_session.get(url)
    print(res.html.links)


start_time = time.time()
async_session.run(get_baidu, get_sougou)
print('耗时：', time.time() - start_time)


# 同步发送请求
session = HTMLSession()

start_time = time.time()
res = session.get('https://www.baidu.com/')
print(res.html.links)
res = session.get('https://www.sogou.com/')
print(res.html.absolute_links)
print('耗时：', time.time() - start_time)