美文网首页
打卡:1-2爬取自己网页的信息

打卡:1-2爬取自己网页的信息

作者: 早禾 | 来源:发表于2016-07-18 17:39 被阅读0次
    要爬取的信息来源

    爬取的信息的成果展示

    image : img/pic_0000_073a9256d9624c92a05dc680fc28865f.jpg
    price : $24.99
    view : 65 reviews
    describe : See more snippets like this online store item at web store 
    score : 5
    title : EarPod
    
    
    image : img/pic_0005_828148335519990171_c234285520ff.jpg
    price : $64.99
    view : 12 reviews
    describe : This is a short description. Lorem ipsum dolor sit amet, consectetur adipiscing elit.
    score : 4
    title : New Pocket
    
    
    image : img/pic_0006_949802399717918904_339a16e02268.jpg
    price : $74.99
    view : 31 reviews
    describe : This is a short description. Lorem ipsum dolor sit amet, consectetur adipiscing elit.
    score : 4
    title : New sunglasses
    
    
    image : img/pic_0008_975641865984412951_ade7a767cfc8.jpg
    price : $84.99
    view : 6 reviews
    describe : This is a short description. Lorem ipsum dolor sit amet, consectetur adipiscing elit.
    score : 3
    title : Art Cup
    
    
    image : img/pic_0001_160243060888837960_1c3bcd26f5fe.jpg
    price : $94.99
    view : 18 reviews
    describe : This is a short description. Lorem ipsum dolor sit amet, consectetur adipiscing elit.
    score : 4
    title : iphone gamepad
    
    
    image : img/pic_0002_556261037783915561_bf22b24b9e4e.jpg
    price : $214.5
    view : 18 reviews
    describe : This is a short description. Lorem ipsum dolor sit amet, consectetur adipiscing elit.
    score : 4
    title : Best Bed
    
    
    image : img/pic_0011_1032030741401174813_4e43d182fce7.jpg
    price : $500
    view : 35 reviews
    describe : This is a short description. Lorem ipsum dolor sit amet, consectetur adipiscing elit.
    score : 4
    title : iWatch
    
    
    image : img/pic_0010_1027323963916688311_09cc2d7648d9.jpg
    price : $15.5
    view : 8 reviews
    describe : This is a short description. Lorem ipsum dolor sit amet, consectetur adipiscing elit.
    score : 4
    title : Park tickets
    

    源代码

    from bs4 import BeautifulSoupwith open('./index.html', 'r') as wbdata:
        soup = BeautifulSoup(wbdata, 'lxml')
        images = soup.select('div > div.col-md-9 > div > div > div > img')
        titles = soup.select('div.caption > h4:nth-of-type(2) > a')
        prices = soup.select('div.caption > h4.pull-right')
        describes = soup.select('div.caption > p')
        views = soup.select(' div.ratings > p.pull-right')
        scores = soup.select('div > div.ratings > p:nth-of-type(2)')
    
    info = []
    for title, image, price, describe, view, score in zip(titles, images, prices, describes, views, scores): 
       data = {
            'title': title.get_text(),
            'image': image.get('src'),
            'price': price.get_text(),
            'describe': describe.get_text(),
            'view': view.get_text(),
            'score': len(score.find_all('span','glyphicon glyphicon-star'))
        }
        info.append(data)
    
    for i in info:
        for a in i:
            print(a, ':', i[a])
        print('\n')
    
    

    笔记

    1、Beautiful Soup不支持Nth-child语法,所以要换成nth-of type(或者去掉这个部分案啦)
    2、soup.select()尽量不用完整selector
    3、要学着自己看错题集和文档
    4、耐心看debug提示信息
    5、获得某一标签下的属性可以用get()也可以用find_all()

    相关文章

      网友评论

          本文标题:打卡:1-2爬取自己网页的信息

          本文链接:https://www.haomeiwen.com/subject/rcuyjttx.html