解析网页内容

作者: 唯师默蓝 | 来源:发表于2018-01-21 19:59 被阅读0次

（2019-03-20）requests+beautifulso
Python实战计划学习笔记（2）网页解析
解析网页内容
Jsoup解析HTML基础用法
2021-12-10 js中的v8 引擎
scrapy 下载蜂鸟网美图
【HtmlUnit】网页爬虫进阶篇
网页渲染流程
Python爬虫练习（2）——我的学习笔记
python网页解析器

首先，在py3.6的环境下，先解析网页源代码（注意编码格式）：
from bs4 import BeautifulSoup
with open("C:/Users/Administrator/Desktop/award/index.html",'r',encoding= 'utf-8') as web_data:
Soup = BeautifulSoup(web_data,'lxml')
print(Soup)

爬取网络中的页面元素内容：
出现nth-of-type错误，则需要进行修改：
titles = soup.select('#serOnline-container320 > div > div:nth-child(1) > a')改为
titles = soup.select('#serOnline-container320 > div > div:nth-of-type(1) > a')

爬取标题：
在浏览器中查看所选标题的代码，看属于哪一个div下的哪一个class下的哪一个标签，以div.[class]的格式对标题进行爬取。代码如下：
from bs4 import BeautifulSoup
import requests
url = 'http://www.hsdjixie.com/'
web_data = requests.get(url)
soup = BeautifulSoup(web_data.text,'lxml')
titles = soup.select('div.serOnline-list-v > a')
print(titles)

爬取图片：
在浏览器中查看所选图片的代码，看img标签下的宽/高是多少，以img['img[width="90"]']的格式对图片进行爬取。代码如下：
imgs= soup.select('img[width="90"]')
print(imgs)

爬取标签，比如图片名或一些文字介绍，依旧查看是什么div下的，此div是父类div，div.class,代码如下：
cates = soup.select('div.parametersDiv')
print(cates)

将爬取后的信息制作成字典：
for title,cates in zip(titles,cates):
data = {
'title':title.get_text(),
'cate':list(cates.stripped_strings),
}
print(data)

headers是构造提交给网页的信息

headers = {
'User-Agent' : ' ',
'Cookie' : ' ',
}

网友评论

本文标题：解析网页内容

本文链接：https://www.haomeiwen.com/subject/ienhfxtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

解析网页内容

headers是构造提交给网页的信息

相关文章

（2019-03-20）requests+beautifulso

Python实战计划学习笔记（2）网页解析

解析网页内容

Jsoup解析HTML基础用法

2021-12-10 js中的v8 引擎

scrapy 下载蜂鸟网美图

【HtmlUnit】网页爬虫进阶篇

网页渲染流程

Python爬虫练习（2）——我的学习笔记

python网页解析器

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读