python爬虫2：BeautifulSoup 初识爬虫

作者: Iphone60Plus | 来源:发表于2020-04-30 17:06 被阅读0次

python爬虫2：BeautifulSoup 初识爬虫
Python爬虫入门（urllib+Beautifulsoup）
BeautifulSoup requests 爬虫初体验
Python+PhantomJS+selenium+Beauti
Python 爬虫
男子大学生的無駄日常
bs4
bs4是非常牛逼的爬虫库！深度解析爬虫利器，轻松获得网站信息！
无标题文章
python网络爬虫-爬取网页的三种方式（2）

BeautifulSoup是什么？

BeautifulSoup是一个模块，用来解释和提取数据。

BeautifulSoup怎么用？

解析数据

image.png
有两个参数，第0个参数必须为字符串，第1个参数用python内置库：html.parser.

import requests
from bs4 import BeautifulSoup
res = requests.get('https://localprod.pandateacher.com/python-manuscript/crawler-html/spider-men5.0.html') 
soup = BeautifulSoup( res.text,'html.parser')
print(type(soup)) #查看soup的类型
print(soup) # 打印soup

提取数据

image.png

1、看举例中括号里的class_，这里有一个下划线，是为了和python语法中的类 class区分，避免程序冲突。当然，除了用class属性去匹配，还可以使用其它属性，比如style属性等。
2、其次，括号中的参数：标签和属性可以任选其一，也可以两个一起使用，这取决于我们要在网页中提取的内容。

import requests
from bs4 import BeautifulSoup
url = 'https://localprod.pandateacher.com/python-manuscript/crawler-html/spder-men0.0.html'
res = requests.get (url)
print(res.status_code)
soup = BeautifulSoup(res.text,'html.parser')
item = soup.find('div') #使用find()方法提取首个<div>元素，并放到变量item里。
print(type(item)) #打印item的数据类型
print(item)       #打印item 
#200
#<class 'bs4.element.Tag'>
#<div>大家好，我是一个块</div>

import requests
from bs4 import BeautifulSoup
url = 'https://localprod.pandateacher.com/python-manuscript/crawler-html/spder-men0.0.html'
res = requests.get (url)
print(res.status_code)
soup = BeautifulSoup(res.text,'html.parser')
items = soup.find_all('div') #用find_all()把所有符合要求的数据提取出来，并放在变量items里
print(type(items)) #打印items的数据类型
print(items)       #打印items
#200
#<class 'bs4.element.ResultSet'>
#[<div>大家好，我是一个块</div>, <div>我也是一个块</div>, <div>我还是一个块</div>]
#列表来储存

image.png

import requests
#调用requests库
from bs4 import BeautifulSoup
#调用bs4库
res = requests.get('https://localprod.pandateacher.com/python-manuscript/crawler-html/spider-men5.0.html')
#返回一个response对象，赋值给res
html = res.text
#将变量res字符串化
soup = BeautifulSoup(html,'html.parser')
#将网页解析为beautufulsoup对象
items = soup.find_all(class_='books')
#通过定位标签和属性提取我们想要的数据
for item in items:
    kind = item.find('h2')#在列表每个元素中，匹配标签h2提取出数据
    title = item.find(class_='title')#在列表每个元素中，匹配属性class_='title'提取出数据
    brief = item.find(class_='info')#在列表每个元素中，匹配属性class_='info'提取出数据
    print(kind.text,'/n',title.text,'/n',brief.text)#打印提取出的数据
    print(type(kind),type(title),type(brief))#打印提出数据类型