Beautiful Soup4
正在学习网页相关的Python模块,一起学习下这个“美丽的汤”
功能简介
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库。它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.
安装beautiful soup:
$ easy_install beautifulsoup4
$ pip install beautifulsoup4
安装解析器:
$ easy_install lxml
$ pip install lxml
流程:
1.requests库获取网页->2.Beautifulsoup创建soup对象->使用bs4解析得到相应的内容。
示例
#coding:utf-8
from bs4 import BeautifulSoup
doc = ['<html><head><title>Page title</title></head>',
'<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
'<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
'</html>']
soup = BeautifulSoup(''.join(doc)) #''.join(doc) 将list doc 转换成字符串
print (soup.title)
print (soup.title.name)
print (soup.title.string)
print (soup.p)
print (soup.p['id'])
print (soup.find_all('p'))
print (soup.find_all(id = "secondpara" ))
print (soup.get_text())
执行结果:
C:\Python\Python36\python.exe D:/2.codes/PycharmProjects/PyReptilian/beautysoap.py
<title>Page title</title>
title
Page title
<p align="center" id="firstpara">This is paragraph <b>one</b>.</p>
firstpara
[<p align="center" id="firstpara">This is paragraph <b>one</b>.</p>, <p align="blah" id="secondpara">This is paragraph <b>two</b>.</p>]
[<p align="blah" id="secondpara">This is paragraph <b>two</b>.</p>]
Page titleThis is paragraph one.This is paragraph two.
Process finished with exit code 0
获取一个网页的相关信息示例,参考网页内容:
#coding:utf-8
from bs4 import BeautifulSoup
import requests
class Html():
soup = None
def __init__(self):
url = 'http://news.baidu.com/'
html = requests.get(url).content # 获取首页的html
self.soup = BeautifulSoup (html, 'lxml') # 得到soup对象
def getTitle(self):
#title = self.soup.title #返回的结果带title标签<title> </title>
title = self.soup.title.string
return title
def getH1(self):
try:
h2 = self.soup.select("h2") # 获取h2,结果带h2标签
if (len(h2) > 1):
#print (''.join(["糟糕了 ", str(len(h2)),"个h2,不利seo"])) #list转str
print("共%d个h2"%len(h2))
except AttributeError:
return "h2不存在"
return h2
demo = Html()
print ( "标题:%s\n" % (demo.getTitle() ))
print ("h1:\n%s" %(demo.getH1()))
网友评论