html解析-BeautilfulSoup

作者: 非鱼2018 | 来源:发表于2019-11-30 11:58 被阅读0次

BeautilfulSoup是比较常用的一个html解析库
安装：pip install beautilfulsoup4
4个常用解析库
1.html.prase
2.xml
3.lxml
4.htm5lib

from bs4 import BeautilfulSoup as BS
soup=BS('result.html','lxml')
soup=soup.prettifiy()
print(soup.title.string)

dmo操作
p.contents\p.children\p.silbing,p.parent\p.desantoranr\p.anstoer等等

print(soup.body.contents)
for i in soup.div.children:
    print(i)

获取属性，文本，

print soup.a.ge_text
print soup.a.attrs['href']

最最常用的css选择器：soup.select()

使用bs修改html文件

soup=BS('result.html','lxml')
soup=soup.prettifiy()
print(soup.title.string)
bs=soup.select('#id')
bs.attrs['style']="mic-width=100"
with open('result.html',,'w') as f:
    f.write(soup.prettifiy())
    f.close()

解析部分文档

only_a_tags= SoupStrainer("a")

print(bs(html_doc,"html.parser",parse_only=only_a_tags).prettify())

find all方法的使用

常用方法：find\findall

#打印第一个链接的text
print(soup.findAll('a')[1].text)
#查找属性为{'target':'_blank'}的所有元素
print(soup.findAll(attrs={'target':'_blank'}))

soup=bs(html_doc,'lxml')
links=soup.findAll('a')
for link in links：
    print(link['href'])
#查找多个标记
links_ps=soup.findAll(['a','p'])
print(links_ps)
#使用传入方法给findall
def has_class_but_no_id(tag):
    return tag.has_attr('class')and not tag.has_attr('id')

links_ps=soup.findAll(has_class_but_no_id)

网友评论

本文标题：html解析-BeautilfulSoup

本文链接：https://www.haomeiwen.com/subject/cotuectx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！