Beautifulsoup作用:
将网页(非结构化内容)转化成结构化内容
.text取得bs对象的文字内容(去除HTML标签)
现在令一个新的字符串:
html sample ='
Hello World!
将字符串转化为bs对象:
soup = BeautifulSoup(html_sample)
![](https://img.haomeiwen.com/i6661013/67da7e971483533a.png)
去除警告信息的方法:
原因:因为没有指定“剖析器”
解决:
soup = BeautifulSoup(html_sample,'html.parser')
![](https://img.haomeiwen.com/i6661013/11e1b76e600e32be.png)
找寻特定元素:
#使用select找出含有h1标签的元素
alink = soup.select('h1')
print(alink)
![](https://img.haomeiwen.com/i6661013/586fa0ee66a8995f.png)
Print(alink[0])
输出的是内容,没有中括号
![](https://img.haomeiwen.com/i6661013/aeda8c3bc88038b8.png)
print(alink[0].text)
输出标签内的文字
![](https://img.haomeiwen.com/i6661013/86081bd63221f657.png)
#使用select找出含有a标签的元素
alink = soup.select('a')
print(alink)
![](https://img.haomeiwen.com/i6661013/fa99117f7ca34ec9.png)
#使用select找出id="title"的内容(id前面需要加上#)感觉就类似CSS语法
alink = soup.select('#title')
print(alink)
![](https://img.haomeiwen.com/i6661013/e6d41d63defca532.png)
#使用select找出class="link"的内容(class前面需要加上.)
alink = soup.select('.link')
print(alink)
![](https://img.haomeiwen.com/i6661013/ae2e4be7e4776795.png)
我们看到输出多个结果,以列表的形式存储
相信这样大家就会更加清晰明白了:
![](https://img.haomeiwen.com/i6661013/a3a0ec5d5864141e.png)
循环输出:
for link in alink:
print(link)
![](https://img.haomeiwen.com/i6661013/54c1d29c0e1ca115.png)
.text的利用
![](https://img.haomeiwen.com/i6661013/52a2c4eee14c8335.png)
获取a标签的href属性:
for link in alink:
print(link['href'])
![](https://img.haomeiwen.com/i6661013/87fd5075e32dffb8.png)
会把href等属性包装成一个字典故很方便地取到数据!
取属性值:
html_sample2 = ' hello world! hello world2!'
soup2 = BeautifulSoup(html_sample2,'html.parser')
print(soup2.select('a')[0]['id'])
print(soup2.select('a')[1]['id'])
![](https://img.haomeiwen.com/i6661013/ce69d03afecaa053.png)
注:本文属于原创文章,转载请注明本文地址!
作者QQ:1099718640
CSDN博客主页:http://blog.csdn.net/dyboy2017
Github开源项目:https://github.com/dyboy2017/spider
网友评论