python2.7用requests和xpath抓取煎蛋网段子

作者: suntio | 来源:发表于2017-11-26 19:22 被阅读0次

python2.7用requests和xpath抓取煎蛋网段子
个人常用Python库简介
爬虫常用代码
爬煎蛋网妹子图
Python爬取花瓣网美女图片（动态网站）
2020-11-28
4 使用xpath解析豆瓣短评
Python现学现用xpath爬取豆瓣音乐！
requests、xpath使用-简单爬虫入门
使用requests爬取拉勾网python职位数据

这里我简单的爬取了煎蛋网的段子，煎蛋网有些段子会被屏蔽的现象产生，所以要对这块东西进行处理。

屏蔽段子处理

下面就是按常规去处理，附上具体代码

import requests

froml xml import etree

url='http://jandan.net/duan'

headers={

'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',

'Accept-Encoding':'gzip, deflate',

'Accept-Language':'zh-CN,zh;q=0.8',

'Cache-Control':'no-cache',

'Connection':'keep-alive',

'Host':'jandan.net',

'Pragma':'no-cache',

'Referer':'http://jandan.net/qa',

'Upgrade-Insecure-Requests':'1',

'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3100.0 Safari/537.36',

}

html=requests.get(url,headers=headers);

html.encoding="utf-8"

root=etree.HTML(html.text)

result=root.xpath("//div[@class='row']")

for i in range(len(result)):

author=result[i].xpath(".//div[@class='author']/strong/text()")

text=re sult[i].xpath(".//div[@class='text']")[0]

if(text.xpath("./p[@class='bad_content']")):

text=result[i].xpath(".//div[@class='text']/p[2]/text()")

else:

text=result[i].xpath(".//div[@class='text']/p/text()")

print '作者',author[0],'内容',text[0]

上面的xpath上的.//div[@class='author']/strong/text()解释，就是在class为row的div下找到class为author的div，再在strong标签下，得到标签中的字。

网友评论

爬虫专题

本文标题：python2.7用requests和xpath抓取煎蛋网段子

本文链接：https://www.haomeiwen.com/subject/ogdnvxtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

python2.7用requests和xpath抓取煎蛋网段子

相关文章