第二章复杂HTML解析

作者: VB过得VB | 来源:发表于2017-02-04 22:12 被阅读39次

第二章复杂HTML解析
第二章_复杂HTML解析
记一次jsoup的使用
lxml的使用方法
阻塞解析与阻塞渲染
重绘和重排（回流）
虚拟DOM是啥？
浏览器渲染原理
python bs4的坑
android利用jsoup抓取数据

1、获取指定标签内容

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
bs0bj = BeautifulSoup(html, 'lxml')
namelist = bs0bj.findAll('span', {'class': 'green'}) # 获取页面所有指定标签
for name in namelist:
    print(name.get_text())

2、处理子标签

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bs0bj = BeautifulSoup(html, 'lxml')

for child in bs0bj.find('table', {'id': 'giftList'}).children:
    print(child)

3、处理兄弟标签

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html, 'lxml')
for sibling in bsObj.find("table", {"id": "giftList"}).tr.next_siblings:
    print(sibling)
# previous_siblings 前一组
# next_siblings 后一组
# previous_sibling前一个
# next_siblings后一个

4、父标签处理

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html, 'lxml')
print(bsObj.find("img", {"src": "../img/gifts/img1.jpg"
                         }).parent.previous_sibling.get_text()
      )
---------------------------------------------------------------------------
#打印输出
$15.00
---------------------------------------------------------------------------
(1) 选择图片标签src="../img/gifts/img1.jpg"；
(2) 选择图片标签的父标签（在示例中是<td> 标签）；
(3) 选择<td> 标签的前一个兄弟标签previous_sibling（在示例中是包含美元价格的<td>
标签）；
(4) 选择标签中的文字，“$15.00”。

5、正则表达式

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html, 'lxml')
images = bsObj.findAll("img", {"src": re.compile(r"../img/gifts/img.*.jpg")})
for image in images:
    print(image["src"])

网友评论

本文标题：第二章复杂HTML解析

本文链接：https://www.haomeiwen.com/subject/pdjsittx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

第二章复杂HTML解析

1、获取指定标签内容

2、处理子标签

3、处理兄弟标签

4、父标签处理

5、正则表达式

相关文章

第二章复杂HTML解析

第二章_复杂HTML解析

记一次jsoup的使用

lxml的使用方法

阻塞解析与阻塞渲染

重绘和重排（回流）

虚拟DOM是啥？

浏览器渲染原理

python bs4的坑

android利用jsoup抓取数据

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

第二章 复杂HTML解析

1、获取指定标签内容

2、处理子标签

3、处理兄弟标签

4、父标签处理

5、正则表达式

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

第二章复杂HTML解析