python beautifulSoup4

bs4介绍

安装 pip install bs4 pip lxml
Beautiful Soup是一个可以从HTML或XML文件中提取数据的Python库
解析器

解析器	使用方法	优势	劣势
Python标准库	`BeautifulSoup(markup, "html.parser")`	Python的内置标准库执行速度适中文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	`BeautifulSoup(markup, "lxml")`	速度快文档容错能力强	需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup, ["lxml-xml"])``BeautifulSoup(markup, "xml")	速度快唯一支持XML的解析器	需要安装C语言库
html5lib	`BeautifulSoup(markup, "html5lib")`	最好的容错性以浏览器的方式解析文档生成HTML5格式的文档	速度慢不依赖外部扩展

bs4使用

导入包，使用beautiful解析数据，更具源码结构提取数据

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
# 使用beautifulsoup解析,html文档与解析器
soup = BeaufulSoup(html_doc,'lxml')

bs4的四种对象

Tag对象
- 两个重要的属性name与attributes
- 使用tag.name可以提取该节点的名称
- 使用tag['class']可以提取节点中属性为class的值
BeautifulSoup对象
- BeautifulSoup 对象表示的是一个文档的全部内容.大部分时候,可以把它当作Tag 对象
- 将html用bs4解析后的对象就是BeautifulSoup对象
NavigableString可遍历的字符串对象
- 字符串常被包含在tag内.Beautiful Soup用 NavigableString 类来包装tag中的字符串
- 使用tag.string提取节点包含的NavigableString对象
Comment注释对象，特殊的NavigableString对象
- 使用tag.prettify()提取节点所包含的注释

Tag的属性

string() 获取当前标签下的内容
strings() .strings 如果tag中包含多个字符串,可以使用 .strings 来循环获取
.stirpped_strings 如果tag中包含多个字符串,可以使用 .stirpped_strings 来循环获取,去除多余空格

节点

搜索节点
子节点 contents,children,子孙节点descendants
- 包含在节点中的节点
父节点包含该节点的节点
- parent ,直接包含该节点的节点
- parents 所有父辈节点，
兄弟节点，同级节点
- next_sibling,后一个兄弟节点
- previous_sibling，前一个兄弟节点
- next_siblings 后面所有兄弟节点
- previous_siblings 前面所有兄弟节点

搜索文档数

过滤器
- 字符串过滤器---传入Tag
- 正则表达式，name = re.compile('[\w]{5}') , 使用正则中的compile()方法，选择与正则匹配的Tag中
- 列表过滤器 name=['p', 'a'] 匹配列表中的Tag
- True过滤器 find(True) 匹配任意Tag
- 方法过滤器 lambda tag:tag.has_attr('class') and not tag.has_attr('id'),tag中属性选择，方法返回bool值
find(name=None, attrs={ }, recursive=True, text=None, **kwargs) 返回查找到的第一个tag
find_all( name=None, attrs={ }, recursive=True, text=None, limit=None, **kwargs)返回查找到的所有tag
- 参数解释
  - name 过滤器
  - attrs={},以字典形似传参
    - 如果以'css属性'='str'的形式传入，class要变成class_
  - limit 限制返回条数，大于0，默认返回全部
  - text=str 查找所有NavigableString中有str的NavigableString，返回列表
  - kwarge： id='',class_=''
其他搜索方法
- find_parent(),find_parents() 搜索第一父辈和所有父辈
- find_next_siblings(). find_next_sibling()，搜索后面所有兄弟节点和第一兄弟节点
- find_previous_siblings(), find_previous_sibling()，搜索前面兄弟节点和第一兄弟节点
- find_all_next() find_next() 搜索后面返回所有符合条件的节点,方法返回第一个符合条件的节点
- find_all_previous(), find_previous() 搜索前面返回所有符合条件的节点,方法返回第一个符合条件的节点

修改文档树

修改tag名称和包含的属性值
- tag.name=str tag['attr']=str
修改string
- tag.string=str

tag内容添加

tag.append(str)
s = NavigableString(str),tag.append(s)

添加注释

from bs4 import Comment
new_comment = soup.new_string("Nice to see you.", Comment)
tag.append(new_comment)

在tag中创建tag并赋予属性

soup = BeautifulSoup("<b></b>")
original_tag = soup.b
new_tag = soup.new_tag("a", href="http://www.example.com")
original_tag.append(new_tag)
original_tag
# <b><a href="http://www.example.com"></a></b>

new_tag.string = "Link text."
original_tag
# <b><a href="http://www.example.com">Link text.</a></b>