豆瓣电影小爬虫

作者: 西瓜三茶 | 来源:发表于2017-05-02 22:59 被阅读0次

豆瓣电影小爬虫
Python 简单爬虫之遍历豆瓣电影上所有城市的正在上映栏目
Python爬取豆瓣电影的短评数据并进行词云分析处理
Python学习
Python爬虫-豆瓣电影Top250-各项信息爬取及xls保存
Python学习
【爬虫】豆瓣电影
练习：豆瓣电影TOP250爬虫
Python第三天（spider_豆瓣）
爬虫豆瓣电影250

用比较简单的方法爬取豆瓣电影评论及电影详情页的一些内容。

爬取思路：
（1）从电影的列表页开始，获取当前页面每部电影的link；
（2）通过观察link的组成，从电影的link，获得评论页的起始页link
（3）在全部的短评页面，获取评论用户id、评分、评论内容等，同时进行翻页（在不登录的情况下，目前只能翻到第十页）。

代码效果：

优点：运行简单，不需要太多设置。
缺点：第(1)步目前还没有实现自动翻页；(2)受登录限制，目前跑一阵大概就会出现403错误。
改进方向：多设置一些except error，或者是传入cookie，或者通过模拟登录的方式，应该可以提升自动跑的效率。scrapy框架、mongodb之类也可以多研究下。

操作环境：Mac, python 3.5

import requests
import random
import time
import csv
import re
import string
import random
from bs4 import BeautifulSoup
try:
    import cookielib
except:
    import http.cookiejar as cookielib

#header设置
headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Encoding': 'gzip, deflate, sdch',
    'Accept-Language': 'zh-CN,zh;q=0.8',
    'Connection': 'keep-alive',
    'Host': 'movie.douban.com',
    'Referer' : 'https://movie.douban.com/subject/26345137/collections',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36'
}

timeout = random.choice(range(60,180))

#豆瓣评分等级
gradeDic = {
    '力荐':5,
    '推荐':4,
    '还行':3,
    '较差':2,
    '很差':1
}

#爬取的起始页：爬取思路是从电影的列表页开始
movielist = 'https://movie.douban.com/tag/2016'

#第(1)步，读取电影列表页，得到回传的可解析的内容
def get_html(url):
    while True:
        try:
            rep = requests.get(url,headers=headers,timeout=timeout)
            print(rep)
            break
        except:
            print(url,"页面访问失败")
    return rep.text

# 在电影列表页，获取每个电影的link，放在temp这个list里面
def get_movie(html):
    url_list = []
    bs4 = BeautifulSoup(html,"html.parser").body.find(class_='article')#电影列表区
    nextUrl = BeautifulSoup(html, "html.parser").body.find(class_='paginator').find(class_='next').find('a').get('href')  # 找到下一页url
    movie_list = bs4.find_all(class_='item')
    temp = []
    for movie in movie_list:
        movie_href = movie.find(class_= "pl2").find("a").get("href")
        temp.append(movie_href)
    return temp


#解析电影详情页
def get_data(html):
    final = []
    bs4 = BeautifulSoup(html,"html.parser").body.find(class_='mod-bd') #找到评论区
    movie_href = BeautifulSoup(html,"html.parser").body.find(class_='aside').find(class_ = 'pl2').find('a').get("href")#找到边栏区

    comment_lists = bs4.find_all(class_='comment-item')
    for comment in comment_lists:
        temp = []
        grade = comment.find(class_= re.compile("allstar"))
       
        #有些评分为空，忽略
        if grade is None:
            pass

        else:
            rating = grade.get('title') #获得评价
            username = comment.find(class_="avatar").find('a').get('title') #获得用户名
            datacid = comment.get('data-cid') #评论编号
            num_rating = gradeDic[rating] #评价对应的评分
            usefulness = comment.find(class_='votes').get_text() #用户给的是否有用评价
            words = comment.find(class_='comment').find('p').get_text().strip() #评论的具体内容

            if (rating is None) or (username is None) or (datacid is None) or (words is None): #如果任何一项为空，都pass
                pass

            else:
                temp.extend((username, datacid, rating, num_rating, usefulness, words, movie_href))
                final.append(temp) #添加到[]中
    return final

#翻页设置，在评论区翻页
def turn_page(temp):
#第(1)步里面爬取的页面上，有20个电影；出于爬取限制，可以写成for url in temp[0:10]，先爬取一部分
    for url in temp: 
        count = 0
        currentUrl = url + 'comments?&status=P' #通过观察，获取评论区首页的url
        while currentUrl is not None and count < 9: #出于限制设置爬取页面<10，超过第10页就会要求登录
            print (currentUrl)
            html = get_html(currentUrl) #解析页面
            bs4 = BeautifulSoup(html, "html.parser").body.find(class_='mod-bd')  # 找到评论区
            nextUrl = BeautifulSoup(html, "html.parser").body.find(id='paginator').find(class_='next').get(
                'href')  # 找到下一页url
            next_Url = url + 'comments' + nextUrl #下一页的url

            data = get_data(html) #获取需要爬取的字段
            currentUrl = next_Url
            count += 1
            print(count)
            write_data(data, "1.csv") #写入csv文件
            time.sleep(random.choice(range(1, 5)))

#写入文件
def write_data(data, name):
    file_name = name
    with open(file_name, 'a', errors='ignore', newline='', encoding='utf-8-sig') as f: #如果是windows，貌似不用写encoding='utf-9-sig'
            f_csv = csv.writer(f)
            f_csv.writerows(data)

movie_html = movie_page_html(movielist)
movie_temp = get_movie(movie_html)
turn_page(movie_temp)