使用爬虫爬取本地网页信息

作者: 许山山 | 来源:发表于2016-07-20 21:46 被阅读0次

使用爬虫爬取本地网页信息
Python 爬虫去掉爬取数据中的\xa0 \t \n
requests爬虫改为增量爬虫的一种方法
python 爬取BOSS直聘网页信息
Java爬虫实战—利用xpath表达式抓取页面信息
网络爬虫的分类和问题以及Robots协议
正则表达式爬取网页内容
Python实战课程1-2练习：爬取商品信息
Python爬虫--认识网页的结构
简单python爬虫，爬取基金信息

一、作业描述

snapshot1.png

爬取如上本地网页上商品的图片地址、标题、价格、浏览量和评分星级。

二、作业目的

使用with函数读取本地网页
使用bs4解析本地网页
使用css selector定位网页内容
使用.select方法爬取所需内容
使用字典格式化爬取数据

三、作业代码

from bs4 import BeautifulSoup

#使用with函数会在文件命令执行完毕后自动关闭文件
with open('/home/xss/Plan-for-combating/week1/1_2/1_2answer_of_homework/index.html','r') as wb_data:
    soup = BeautifulSoup(wb_data, 'lxml')
    images  = soup.select('div.thumbnail > img')
    prices  = soup.select('div.caption > h4.pull-right')
    titles  = soup.select('div.caption > h4 > a')
    amouts  = soup.select('p.pull-right')
    ratings = soup.select('div.ratings > p:nth-of-type(2)')

info = []



for title, image, price, amout, rating in zip(titles, images, prices, amouts, ratings):

    data = {
        'title' : title.get_text(),
        'image' : image.get('src'),
        'price' : price.get_text(),
        'amout' : amout.get_text(),
        'rating' : len(rating.find_all('span', class_ = 'glyphicon glyphicon-star'))
    }
    info.append(data)

for i in info:
    if float(i['rating']) > 3:
        print(i['title'], i['price'])

作业小结：

css selector不必使用完整的，只要能唯一确定所要爬取的内容即可。
zip函数的用法

定义：zip([iterable, ...])
zip()是Python的一个内建函数，它接受一系列可迭代的对象作为参数，将对象中对应的元素打包成一个个tuple（元组），然后返回由这些 tuples组成的list（列表）。若传入参数的长度不等，则返回list的长度和参数中长度最短的对象相同。利用*号操作符，可以将list unzip（解压）。
例如

>>>> a = [1,2,3]
>>>> b = [4,5,6] 
>>>> zipped = zip(a,b) 
[(1, 4), (2, 5), (3, 6)]

find_all方法可以将符合条件的标签打包为一个列表，获得评分就是通过将所有属性为'glyphicon glyphicon-star'的span标签打包成一个列表，并用len函数返回其长度从而获取星级数。
get_text()方法用于获取文本内容，get()方法用于获得指定属性内的内容，返回值均为字符串。