果核后台重构中Part2BeautifulSoup的学习与使用

作者: Hansybx | 来源:发表于2019-07-26 14:08 被阅读0次

果核后台重构中Part2BeautifulSoup的学习与使用
果核后台重构中Part1项目结构
vue+webpack项目中使用dev-server搭建虚拟服务
万字详解产品经理如何承接“重构/改版”需求
后台重构随笔
vue cli2.0 + thinkphp+ mysql
使用freemarker页面静态化
vue实现动态路由前端控制
果核中的世界
在django-admin中使用django-ckeditor

起因

因为在获取教务网站的数据时，很多数据都是要求我们自己去抓取的，在这种情况下使用Python爬虫进行数据爬取成为了一个极佳的选项。而我选择了BeautifulSoup库进行爬取，原因也比较简单——它的上手比较快:joy::joy::joy:。因此，接下来我将粗略的讲讲BeautifulSoup的使用。（因为Beautiful Soup 3 目前已经停止开发,所以以下皆为Beautiful Soup 4的使用）

让我们一起干了这碗美丽的汤

简介

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.

安装

如果你用的是新版的Debain或ubuntu,那么可以通过系统的软件包管理来安装:

$ apt-get install Python-bs4

或者使用pip or easy_install 命令从PyPi上下载安装

$ easy_install beautifulsoup4
$ pip install beautifulsoup4

快速开始gkd

我们以下面这段文档为例进行学习，以下简称为爱丽丝

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup对象实例化

soup = BeautifulSoup(html_doc, 'html.parser')

html_doc对应着爬取网页的html，后面的为相应的解析器

搜索文档树

find_all方法

通常情况下我们使用find_all方法进行筛选。

name参数

它会寻找名字为name的tag，并且全部返回

简单的用法如下:

soup.find_all("title")
# [<title>The Dormouse's story</title>]

keyword参数

它会寻找具有参数的keyword的tag并且全部返回，

例如<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>中的href，class，id就是keyword参数。

soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

请注意因为class是Python的内置关键字因此在使用时应在class后添加下划线，如

soup.find_all(class_="sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

按CSS搜索

按照CSS类名搜索tag，并返回全部符合条件的语句

soup.find_all("a", attrs={"class": "sister"})
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

select方法

select方法的使用比较简单，只需要依次传入对应的标签名即可，其中搜索id为#id类名为.class

soup.select('a #link1')
#<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

输出

输出时可以直接使用.text将文本输出

总结

其实这里我讲的很少，只是一些beautifulsoup的简单应用，但是万事开头难，相信你将这些都已经弄明白后，一些更深层次的知识也已经满足不了你了。最后是beautifulsoup官方文档的传送门:即将前往官方文档

网友评论

码农路上的踩坑之谈

本文标题：果核后台重构中Part2BeautifulSoup的学习与使用

本文链接：https://www.haomeiwen.com/subject/mvnrrctx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

果核后台重构中Part2BeautifulSoup的学习与使用

起因

让我们一起干了这碗美丽的汤

简介

安装

快速开始gkd

soup对象实例化

搜索文档树

find_all方法

name参数

keyword参数

按CSS搜索

select方法

输出

总结

相关文章

果核后台重构中Part2BeautifulSoup的学习与使用

果核后台重构中Part1项目结构

vue+webpack项目中使用dev-server搭建虚拟服务

万字详解产品经理如何承接“重构/改版”需求

后台重构随笔

vue cli2.0 + thinkphp+ mysql

使用freemarker页面静态化

vue实现动态路由前端控制

果核中的世界

在django-admin中使用django-ckeditor

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

码农路上的踩坑之谈