BeautifulSoup 爬取网络数据（1）

作者: 查德笔记 | 来源:发表于2018-02-18 14:29 被阅读35次

BeautifulSoup 爬取网络数据（1）
数据解读独角兽企业“猿辅导”（第一部分）
BeautifulSoup4爬取某社招网站数据
以人人都是产品经理网站3.6万篇文章为例阐述整个数据ETL和分析
Python爬虫--真实世界的网页解析
python 网络爬虫 - BeautifulSoup 爬取网络
爬妹子图
Python爬取豆瓣读书
Python程序设计思维练习---股票数据定向爬虫
python爬虫之单纯用find（）函数来爬取数据

0. 前言

在介绍BeautifulSoup模块前，我们先分析一下我们要爬取的网页结构是什么样的。通常网页都包含层叠样式表(英文全称：Cascading Style Sheets），例如。推荐使用谷歌浏览器或者火狐浏览器方便快捷查看网页结构。例如在chrome中百度首页右击，选择'inspect'，即可查看到网页结构，以及各个标签层级关系。

image

1. 创建爬虫爬取网页

爬取网站：url =www.pythonscraping.com/pages/warandpeace.html

网页如图所示，有红色和绿色字体。在绿色字体处右键选择“inspect”分析标签结构可知。绿色字体均包含在标签GreenText当中。

image

1.1 抓取网页


from urllib.request import urlopen

from bs4 import BeautifulSoup

url ='http://www.pythonscraping.com/pages/warandpeace.html'

html= urlopen(url) #抓取了该url网页

soup = BeautifulSoup(html) #使用BeautifulSoup对网页进行解析

name_list = soup.find_all("span",{'class': 'green'})#find_all抓取所有绿色字体，返回list

for name in name_list:

    print(name.get_text()) #get_text()函数剔除字符串中所有tag符号只保留tag中包含的文本