Beautiful Soup4

正在学习网页相关的Python模块，一起学习下这个“美丽的汤”

功能简介

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库。它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.

安装beautiful soup:

$ easy_install beautifulsoup4

$ pip install beautifulsoup4

安装解析器：

$ easy_install lxml

$ pip install lxml

流程：

1.requests库获取网页->2.Beautifulsoup创建soup对象->使用bs4解析得到相应的内容。

示例

#coding:utf-8
from bs4  import BeautifulSoup

doc = ['<html><head><title>Page title</title></head>',
       '<body><p id="firstpara" align="center">This is paragraph <b>one</b>.',
       '<p id="secondpara" align="blah">This is paragraph <b>two</b>.',
       '</html>']
soup = BeautifulSoup(''.join(doc))  #''.join(doc) 将list doc 转换成字符串

print (soup.title)
print (soup.title.name)
print (soup.title.string)
print (soup.p)
print (soup.p['id'])
print (soup.find_all('p'))
print (soup.find_all(id = "secondpara" ))
print (soup.get_text())

执行结果：

C:\Python\Python36\python.exe D:/2.codes/PycharmProjects/PyReptilian/beautysoap.py
<title>Page title</title>
title
Page title
<p align="center" id="firstpara">This is paragraph <b>one</b>.</p>
firstpara
[<p align="center" id="firstpara">This is paragraph <b>one</b>.</p>, <p align="blah" id="secondpara">This is paragraph <b>two</b>.</p>]
[<p align="blah" id="secondpara">This is paragraph <b>two</b>.</p>]
Page titleThis is paragraph one.This is paragraph two.

Process finished with exit code 0

获取一个网页的相关信息示例，参考网页内容：

#coding:utf-8
from bs4 import BeautifulSoup
import requests

class Html():
    soup = None
    def __init__(self):
        url = 'http://news.baidu.com/' 
        html = requests.get(url).content # 获取首页的html
        self.soup = BeautifulSoup (html, 'lxml') # 得到soup对象

    def getTitle(self):
        #title = self.soup.title #返回的结果带title标签<title> </title>
        title = self.soup.title.string
        return title

    def getH1(self):
        try:
            h2 = self.soup.select("h2") # 获取h2,结果带h2标签
            if (len(h2) > 1):
               #print (''.join(["糟糕了 ", str(len(h2)),"个h2,不利seo"]))   #list转str
               print("共%d个h2"%len(h2))

        except AttributeError:
            return "h2不存在"

        return h2
demo = Html()
print ( "标题:%s\n" % (demo.getTitle() ))
print ("h1:\n%s"  %(demo.getH1()))