美文网首页Python数据采集与爬虫Python爬虫大数据 爬虫Python AI Sql
Python爬虫-爬取杀破狼豆瓣影评并小作分析~

Python爬虫-爬取杀破狼豆瓣影评并小作分析~

作者: Fitz916 | 来源:发表于2017-08-19 09:25 被阅读214次

    也是前几天看到一个公众号推了一篇文章,是爬取战狼的影评。今天自己也来试一下
    我选择爬的是《杀破狼》

    image.png

    然后就是打开短评页面,可以看到comment-item,这就是影评了

    image.png

    现在已经找到想要的了,但是这仅仅是第一页的,可以看到一共有六千多条记录,那么怎么拿到其他的呢,页面拉到下方的后页,可以看到地址栏变成了下面的这个地址

    image.png

    所以可以知道limit应该是每页记录数,start是从第几条开始,知道这个我们就知道了所有的地址啦

    url_list = ['https://movie.douban.com/subject/26826398/comments?' \
                'start={}&limit=20&sort=new_score&status=P' .format(x)for x in range(0, 6317, 20)]
    

    爬取过程就是利用bs4拿到想要的就ok

    response = requests.get(url=url, headers=header)
                response.encoding = 'utf-8'
                html = BeautifulSoup(response.text, 'html.parser')
                comment_items = html.select('div.comment-item')
                for item in comment_items:
                    comment = item.find('p')
    

    然后把爬取的文本写入txt中最后用来作数据分析

    image.png

    要作数据分析首先到网上找个停用词表,然后利用jieba来分析,代码如下(这里也是看了罗罗攀的文章:http://www.jianshu.com/p/b277199346ae)

    def fenci():
        path = '/Users/mocokoo/Documents/shapolang.txt'
        with open(path, mode='r', encoding='utf-8') as f:
            content = f.read()
            analyse.set_stop_words('/Users/mocokoo /Documents/tycibiao.txt')
            tags = analyse.extract_tags(content, topK=100, withWeight=True)
            for item in tags:
                print(item[0] + '\t' + str(int(item[1] * 1000)))
    
    image.png
    最后利用这个网站来制作一下输出结果
    https://wordart.com/create image.png

    最后附上完整代码:

    #!usr/bin/env python3
    # -*- coding:utf-8-*-
    
    import requests
    from bs4 import BeautifulSoup
    import jieba.analyse as analyse
    
    header = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Connection': 'keep-alive',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 '
                      '(KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
    }
    
    
    url_list = ['https://movie.douban.com/subject/26826398/comments?' \
                'start={}&limit=20&sort=new_score&status=P' .format(x)for x in range(0, 6317, 20)]
    
    # 爬取所有短评写入文件中
    
    
    def get_comments():
        with open(file='/Users/mocokoo/Documents/shapolang.txt', mode='w', encoding='utf-8') as f:
            i = 1
            for url in url_list:
                print('正在爬取杀破狼影评第_%d_页' % i)
                response = requests.get(url=url, headers=header)
                response.encoding = 'utf-8'
                html = BeautifulSoup(response.text, 'html.parser')
                comment_items = html.select('div.comment-item')
                for item in comment_items:
                    comment = item.find('p')
                    f.write(comment.get_text().strip() + '\n')
                print('第_%d_页完成' % i)
                i += 1
    # 分词
    
    
    def fenci():
        path = '/Users/mocokoo/Documents/shapolang.txt'
        with open(path, mode='r', encoding='utf-8') as f:
            content = f.read()
            analyse.set_stop_words('/Users/mocokoo/Documents/tycibiao.txt')
            tags = analyse.extract_tags(content, topK=100, withWeight=True)
            for item in tags:
                print(item[0] + '\t' + str(int(item[1] * 1000)))
    
    if __name__ == '__main__':
        get_comments() # 将影评写入文档中
        # fenci()
    

    相关文章

      网友评论

      • sergiojune:我现在去爬不行了,你能再爬一次?

      本文标题:Python爬虫-爬取杀破狼豆瓣影评并小作分析~

      本文链接:https://www.haomeiwen.com/subject/zabarxtx.html