python之BeautifulSoup模块

作者: DarknessShadow | 来源:发表于2020-04-18 20:41 被阅读0次

python之BeautifulSoup模块
bs4
Python爬虫之BeautifulSoup模块
python模块： BeautifulSoup
Python常用模块
Python爬取豆瓣读书
python shutil模块批量操作文件(移动复制打包
使用Selector提取数据的方式介绍
Python Beautifulsoup模块使用
豆瓣电影TOP250数据分析

BeautifulSoup模块

简单介绍
它就是用来从HTML源码中提取我们需要的有效数据信息的工具，效率比正则表达式高
BeautifulSoup又被称为bs4
安装
pip install BeautifulSoup
简单案例

import requests
import bs4

url = 'https://www.lagou.com/'
res = requests.get(url)
res.raise_for_status()
no = bs4.BeautifulSoup(res.text)
print(type(no))

bs4.BeautifulSoup('Html文件中的内容的字符串')：获取一个BeautifulSoup对象
上面的代码直接运行会有警告：

D:/JavaSoft/pycharm-professional-2019.3/WorkSpace/python_learning/python_base/webcrawle/webcrawle_demo3.py:11: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 11 of the file D:/JavaSoft/pycharm-professional-2019.3/WorkSpace/python_learning/python_base/webcrawle/webcrawle_demo3.py. To get rid of this warning, pass the additional argument 'features="lxml"' to the BeautifulSoup constructor.

谷歌翻译一下警告：没有显式指定解析器，所以我使用这个系统中可用的最佳HTML解析器(“lxml”)。这通常不是问题，但是如果您在另一个系统上或在不同的虚拟环境中运行这段代码，它可能会使用不同的解析器，并且行为也会有所不同。
总结：总的来说就是缺少一个html解析器，然后在程序中安装这个lxml模块，然后在初始化的时候把这个变量添加上去就可以解决了

import requests
import bs4

url = 'https://www.lagou.com/'
res = requests.get(url)
res.raise_for_status()
no = bs4.BeautifulSoup(res.text, 'lxml')
print(type(no))

select方法 select()
作用：在bs4将html源码全部加载到对象中，然后可以调用这个方法进行规则匹配寻找我们需要的元素和数据
实现机制：看源码select()方法每次匹配之后返回的是一个element模块中的ResultSet对象，这个对象继承list类，实际上返回的就是一个装有Tag对象的列表。Tag对象的值是可以传递给str()函数，这个对象有一个attrs属性，这个属性会把Tag对象中所有HTML属性作为一个字典进行存储

import requests
import bs4

# 从拉钩网把数据下载下来然后存储在本地的文件中（二进制存储）
# url = 'https://www.lagou.com/'
# res = requests.get(url)
# res.raise_for_status()
# with open('lagou.txt', 'wb') as op:
#     for line in res.iter_content(1000):
#         op.write(line)
file = open('lagou.txt', 'r', encoding='utf-8')
soup = bs4.BeautifulSoup(file, 'lxml')
print(type(soup))
elems = soup.select('#search_input')
print(elems)
print(type(elems))
print(len(elems))
print(elems[0])
print(type(elems[0]))
print(elems[0].getText())
print(elems[0].attrs)
print(elems[0].get('placeholder'))

代码执行之后的结果

<class 'bs4.BeautifulSoup'>
[<input autocomplete="off" class="search_input" id="search_input" maxlength="64" placeholder="搜索职位、公司或地点" tabindex="1" type="text" value=""/>]
<class 'bs4.element.ResultSet'>
1
<input autocomplete="off" class="search_input" id="search_input" maxlength="64" placeholder="搜索职位、公司或地点" tabindex="1" type="text" value=""/>
<class 'bs4.element.Tag'>

{'maxlength': '64', 'placeholder': '搜索职位、公司或地点', 'type': 'text', 'id': 'search_input', 'class': ['search_input'], 'autocomplete': 'off', 'tabindex': '1', 'value': ''}
搜索职位、公司或地点