美文网首页
爬取《一出好戏》影评

爬取《一出好戏》影评

作者: NJUNLP | 来源:发表于2018-08-18 17:15 被阅读0次

    一、摘要

    最近,一出好戏票房一路上涨,本人喜欢黄渤,实力派演员,闲暇之余,想看看网友对这部作品的满意度,这是黄渤导演的第一步影片,相信大家的观感还不错。本文主要爬取评论的昵称、日期、头像以及评论的内容。


    d7ede3640a1394d181c3e3afaa9625a.png

    二、运行环境

    1.Pycharm
    2.python 3.6
    3.requests
    4.lxml

    三、思路

    (1)主页链接为https://movie.douban.com/subject/26985127/comments?status=P,每一条评论都存储在<div class="comment-item">之下,每页包括20条品论。

    0b67aff969b9ebd723478e98b3d1fa5.png
    (2)我们可以通过提取每一个中“后页”这个按钮"href"的内容,快速构建下一页新的url,从而实现翻页操作。
    621f0cc588c343b66d902d789b9ec39.png

    四、实现代码

    import requests
    import random
    from lxml import etree
    
    UA_LIST = [
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
      "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
      "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
      "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
      "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
      "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
    ]
    
    headers = {
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
      'Accept-Encoding': 'gzip, deflate, br',
      'Accept-Language': 'zh-CN,zh;q=0.9',
      'Connection': 'keep-alive',
      'Host': 'movie.douban.com',
      'User-Agent': random.choice(UA_LIST)
    }
    
    def downloadHtml(url):
       try:
           r = requests.get(url, headers=headers)
           r.raise_for_status()
           r.encoding = r.apparent_encoding
           return r.text
       except:
           return ""
    
    def parse(url):
       response = downloadHtml(url)
       html = etree.HTML(response)
       try:
           #图片
           photos = html.xpath("//*[@id='comments']/div/div[1]/a/img/@src")
           #昵称
           names = html.xpath("//*[@id='comments']/div/div[2]/h3/span[2]/a/text()")
           #时间
           time = html.xpath("//*[@id='comments']/div/div[2]/h3/span[2]/span[3]/text()")
           #内容
           introduce = html.xpath("//*[@id='comments']/div/div[2]/p/span/text()")
           for photos_i, names_i, time_i, introduce_i in zip(photos, names, time, introduce):
               content = {
                   'photos': photos_i,
                   'names': names_i,
                   'time': time_i.replace('\n', '').replace(' ', ''),
                   'introduce': introduce_i
               }
               print(content)
       except:
           print("错误信息")
    
    def URL(url):
       try:
           response = downloadHtml(url)
           html = etree.HTML(response)
           new_url = html.xpath("//*[@id='paginator']/a/@href")[-1]
           return new_url
       except:
           print("执行结束")
    
    if __name__ == '__main__':
       start_url = "https://movie.douban.com/subject/26985127/comments?start=0&limit=20"
       #点击后页
       while(True):
           try:
               parse(start_url)
               n_url = "https://movie.douban.com/subject/26985127/comments" + URL(start_url)
               start_url = n_url
           except:
               break
    

    五、运行结果

    88bac8288cdc03e910c7b33d817441c.png

    相关文章

      网友评论

          本文标题:爬取《一出好戏》影评

          本文链接:https://www.haomeiwen.com/subject/dnpfiftx.html