一、摘要
最近,一出好戏票房一路上涨,本人喜欢黄渤,实力派演员,闲暇之余,想看看网友对这部作品的满意度,这是黄渤导演的第一步影片,相信大家的观感还不错。本文主要爬取评论的昵称、日期、头像以及评论的内容。
d7ede3640a1394d181c3e3afaa9625a.png
二、运行环境
1.Pycharm
2.python 3.6
3.requests
4.lxml
三、思路
(1)主页链接为https://movie.douban.com/subject/26985127/comments?status=P,每一条评论都存储在<div class="comment-item">之下,每页包括20条品论。
(2)我们可以通过提取每一个中“后页”这个按钮"href"的内容,快速构建下一页新的url,从而实现翻页操作。
621f0cc588c343b66d902d789b9ec39.png
四、实现代码
import requests
import random
from lxml import etree
UA_LIST = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Connection': 'keep-alive',
'Host': 'movie.douban.com',
'User-Agent': random.choice(UA_LIST)
}
def downloadHtml(url):
try:
r = requests.get(url, headers=headers)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:
return ""
def parse(url):
response = downloadHtml(url)
html = etree.HTML(response)
try:
#图片
photos = html.xpath("//*[@id='comments']/div/div[1]/a/img/@src")
#昵称
names = html.xpath("//*[@id='comments']/div/div[2]/h3/span[2]/a/text()")
#时间
time = html.xpath("//*[@id='comments']/div/div[2]/h3/span[2]/span[3]/text()")
#内容
introduce = html.xpath("//*[@id='comments']/div/div[2]/p/span/text()")
for photos_i, names_i, time_i, introduce_i in zip(photos, names, time, introduce):
content = {
'photos': photos_i,
'names': names_i,
'time': time_i.replace('\n', '').replace(' ', ''),
'introduce': introduce_i
}
print(content)
except:
print("错误信息")
def URL(url):
try:
response = downloadHtml(url)
html = etree.HTML(response)
new_url = html.xpath("//*[@id='paginator']/a/@href")[-1]
return new_url
except:
print("执行结束")
if __name__ == '__main__':
start_url = "https://movie.douban.com/subject/26985127/comments?start=0&limit=20"
#点击后页
while(True):
try:
parse(start_url)
n_url = "https://movie.douban.com/subject/26985127/comments" + URL(start_url)
start_url = n_url
except:
break
网友评论