Python爬虫群作业-Week3-BeautifulSoup

作者: 只是不在意 | 来源:发表于2017-05-07 20:46 被阅读0次

Python爬虫群作业-Week3-BeautifulSoup
(招募结束，停止报名)『Python爬虫小分队』群招募公告
万丈高楼平地起——记python开发环境安装流程
Python入门学习指南--内附学习框架
Python爬虫群作业-Week1
3分钟带你了解世界第一语言Python 入门上手也这么简单！
三个Python爬虫版本，带你以各种方式爬取校花网，轻松入门爬虫
Python网络爬虫（八） - 利用有道词典实现一个简单翻译程序
Python网络爬虫（七）- 深度爬虫CrawlSpider
Python网络爬虫（二）- urllib爬虫案例

经过不太努力的尝试，今天把BeautifulSoup的爬虫也做出来了。
上面是向右老师的糗事百科示例：

import requests
from bs4 import BeautifulSoup

html = requests.get('http://www.qiushibaike.com/text/').content
soup = BeautifulSoup(html,'lxml')
links = soup.select('a.contentHerf > div > span')

for link in links:
    print link.get_text()

下面是我爬的煎蛋网的段子。

import requests
from bs4 import BeautifulSoup

html = requests.get('http://jandan.net/duan/').content

soup = BeautifulSoup(html,'lxml')
links = soup.find_all('div',class_="text")

for link in links:
    print link.p.get_text()

因为这儿的段子是用< p>分隔的，所以主要在最后一行打印的p.get_text()

图片.png

小结一：感觉BS的比XPath的好用一些；
小结二：还是要熟悉网页结构，避免徒劳的排列组合尝试。要好好的把html再学习一下。
<blockquote>下周计划：1. 学习html
2.继续看书和视频
3.把向右老师的三篇文章继续好好理解，争取独立自主的爬个网页
4.如果还有余力，争取学会把数据写入excel表。</blockquote>

然后就差不多大功告成了吧~