美文网首页
爬取《一出好戏》影评

爬取《一出好戏》影评

作者: NJUNLP | 来源:发表于2018-08-18 17:15 被阅读0次

一、摘要

最近,一出好戏票房一路上涨,本人喜欢黄渤,实力派演员,闲暇之余,想看看网友对这部作品的满意度,这是黄渤导演的第一步影片,相信大家的观感还不错。本文主要爬取评论的昵称、日期、头像以及评论的内容。


d7ede3640a1394d181c3e3afaa9625a.png

二、运行环境

1.Pycharm
2.python 3.6
3.requests
4.lxml

三、思路

(1)主页链接为https://movie.douban.com/subject/26985127/comments?status=P,每一条评论都存储在<div class="comment-item">之下,每页包括20条品论。

0b67aff969b9ebd723478e98b3d1fa5.png
(2)我们可以通过提取每一个中“后页”这个按钮"href"的内容,快速构建下一页新的url,从而实现翻页操作。
621f0cc588c343b66d902d789b9ec39.png

四、实现代码

import requests
import random
from lxml import etree

UA_LIST = [
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
  "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
  "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
  "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
  "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
  "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
  "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
  "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
  "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
  "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]

headers = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
  'Accept-Encoding': 'gzip, deflate, br',
  'Accept-Language': 'zh-CN,zh;q=0.9',
  'Connection': 'keep-alive',
  'Host': 'movie.douban.com',
  'User-Agent': random.choice(UA_LIST)
}

def downloadHtml(url):
   try:
       r = requests.get(url, headers=headers)
       r.raise_for_status()
       r.encoding = r.apparent_encoding
       return r.text
   except:
       return ""

def parse(url):
   response = downloadHtml(url)
   html = etree.HTML(response)
   try:
       #图片
       photos = html.xpath("//*[@id='comments']/div/div[1]/a/img/@src")
       #昵称
       names = html.xpath("//*[@id='comments']/div/div[2]/h3/span[2]/a/text()")
       #时间
       time = html.xpath("//*[@id='comments']/div/div[2]/h3/span[2]/span[3]/text()")
       #内容
       introduce = html.xpath("//*[@id='comments']/div/div[2]/p/span/text()")
       for photos_i, names_i, time_i, introduce_i in zip(photos, names, time, introduce):
           content = {
               'photos': photos_i,
               'names': names_i,
               'time': time_i.replace('\n', '').replace(' ', ''),
               'introduce': introduce_i
           }
           print(content)
   except:
       print("错误信息")

def URL(url):
   try:
       response = downloadHtml(url)
       html = etree.HTML(response)
       new_url = html.xpath("//*[@id='paginator']/a/@href")[-1]
       return new_url
   except:
       print("执行结束")

if __name__ == '__main__':
   start_url = "https://movie.douban.com/subject/26985127/comments?start=0&limit=20"
   #点击后页
   while(True):
       try:
           parse(start_url)
           n_url = "https://movie.douban.com/subject/26985127/comments" + URL(start_url)
           start_url = n_url
       except:
           break

五、运行结果

88bac8288cdc03e910c7b33d817441c.png

相关文章

网友评论

      本文标题:爬取《一出好戏》影评

      本文链接:https://www.haomeiwen.com/subject/dnpfiftx.html