把电影库搬回家---记一次爬虫之旅

作者: 随想的翅膀 | 来源:发表于2019-06-22 10:43 被阅读8次

把电影库搬回家---记一次爬虫之旅
回家记：一次惊险的回家之旅
过年往事（12）——把回家过年当作旅行（上）
这样开始，品味自然
记国庆回家之旅
好记性
搬家
说点身边的小事吧
爬虫实战（一）——爬取网络小说
回家搬砖

网络爬虫即Web信息抓取，是利用程序下载并处理来自Web的内容，我们每天都在用的百度、搜狗、Google等搜索引擎就在大量的采用这种技术，这些搜索引擎运行了许多Web抓取程序，对网页进行索引。

下面我就一步一步的用Python把豆瓣网站的电影信息抓取出来，然后建立一个电影数据库。

一、环境搭建：
首先我们需要准备几个本次Web抓取需要用到的Python模块：
1、requests：从网上下载文件和网页；

>>> import requests
>>> res=requests.get('http://www.baidu.com')
>>> print(res.text[:100])
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charse

2、Beautiful Soup：解析html，即网页编写的格式；

>>> import bs4
>>> res.encoding='utf-8'
>>> soup=bs4.BeautifulSoup(res.text)
>>> print(soup.select('#su'))
[<input class="bg s_btn" id="su" type="submit" value="百度一下"/>]

3、selenium：启动并控制一个Web 浏览器，它能够填写表单，并模拟鼠标在这个浏览器中点击，听到这大家可能明白了，这个模块主要用于Web页面的自动化测试；

>>> from selenium import webdriver
>>> browser=webdirver.FireFox()

被控制的浏览器

5、MongoDB：在任何应用中都需要持久化存储，一般持久化存储有三种机制：文件、数据库系统以及一些混合类型，数据库是持久化存储的流行解决方案。MongDB是非关系数据库（NoSQL，就是说在关系数据库中表和行、列，在这里对应的则是集合和文档、key。MongoDB的数据存储于JSON串中,json和python中的字典很相似，所以用Python操作MongoDB可以得心应手；另外在MongoDB中存储的数据其实就是一个对象，在关系数据库中几张二维表才能完成的事情，在MongoDB中一个集合就能搞定。
安装工具
1、安装python 3，此处省略5万字，出门左转就是官网；
2、安装好python后进入命令行窗口，敲入以下命令：

pip install requests  
pip install bs4
pip install pymongo
pip install selenium

二、网页分析
用chrome打开豆瓣网站，进入选电影页面，按F12打开开发者工具看看网站结构，只能看到一堆字典形式HTML代码，无法看到具体类容

网站长这个样
滚动鼠标到页面最下面，有个“加载更多”链接，点击后继续展示电影海报，连续点击几次发现每次显示20部电影海报，而且页面没有刷新，确定是通过AJAX加载的电影数据

加载更多
把开发者工具切换到'Network',点击‘XHR'查看AJAX请求，每点击一次’加载更多'就会出现一些新的请求，注意观察AJAX的请求，发现这个请求只有page_start在变化，而且表变化的规律是每次增加20。再回过头观察Query String Parameters 和requests.URL。会发现从https://movie.douban.com/j/search_subjects?后开始参数都是按顺序拼接的。

https://movie.douban.com/j/search_subjects?type=movie&tag=%E7%83%AD%E9%97%A8&sort=recommend&page_limit=20&page_start=0

AJAX请求
我们再来看看这个请求的数据预览，有title电影名称，有rate电影评分，有cover电影海报，有url电影详情页面网址

ajax请求数据预览
再看看这个请求参数都有些什么？

ajax请求参数
把电影详情页面打开，我们本次要爬取的电影信息了都在这了

电影详情
三、设计思路
1、[https://movie.douban.com/j/search_subjects？]拼接这个URL，遍历ajax请求，获取每部电影的url;
2、获取到电影详情url后，用requests获取html页面信息，然后用BeautifulSoup分析页面元素，获取电影内容。豆瓣用了反爬技术，用这种方法去访问页面很容易被封IP，所以采用多线程肯定是不行的，经过无数次测试后，改用selenium模拟人工控制浏览器访问页面，这种方法被封次数减少，豆瓣估计是用单位时间内的请求频率来反爬的，针对这种技术唯慢不破，如果要爬取海量数据这种方法肯定不行，还得研究一下分布式爬虫技术。
3、最后一步，拿到电影内容后当然是把它存进MongoDB了。
四、程序设计
1、获取ajax请求的json数据，构造request请求，由于请求只有page_start和tag会变化，所以把这连个参数做成列表类型的变量，然后来遍历这个列表，获取ajax的json数据：

page_start_list=[page_start*20 for page_start in range(page_start,page_end)]#创建一个步长为20的列表
tags=['热门','最新','经典','豆瓣高分','冷门佳片','华语','欧美', '韩国','日本','动作','喜剧','爱情','科幻','嫌疑','恐怖','成长']#创建一个类型列表

实现代码：

def get_page(page_start,tag):
    params={
        'type': 'movie',
        'tag': tag,
        'sort': 'recommend',
        'page_limit': '20',
        'page_start': page_start
    }
    url="https://movie.douban.com/j/search_subjects"
    try:
        res=requests.get(url,params=params)# 用url和params参数拼接ajax请求
        res.raise_for_status()
        if res.status_code==200:
            return res.json()
        else:
            print("网页禁止访问")
            return None
    except Exception as exc:
        print("网页打开错误:%s"%exc)
        return None
def main():
    page_start = 0
    page_end = 25
    page_start_list=[page_start*20 for page_start in range(page_start,page_end)]
    tags=['热门','最新','经典','豆瓣高分','冷门佳片','华语','欧美', '韩国','日本','动作','喜剧','爱情','科幻','嫌疑','恐怖','成长']
    for tag in tags:
        movieDB.insert_one('spiderLogSele', log)
        for page_start in page_start_list:
            json_data = get_page(page_start,tag)
            print(json_date)

获取到的json数据长这个样子：

{'subjects': [{'rate': '7.3', 'cover_x': 2150, 'title': '一首小夜曲', 'url': 'https://movie.douban.com/subject/30165542/', 'playable': False, 'cover': 'https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2556866372.jpg', 'id': '30165542', 'cover_y': 3041, 'is_new': True}

2、获取静态页面的信息：
第一种方法是用requests获取页面信息：

def get_child_page(item,tag):
    try:
        res = requests.get(item.get('url').split('\n')[0])
        res.raise_for_status()
        if res.status_code==200:
            soup=bs4.BeautifulSoup(res.text,features='lxml')
            movie_info_list=soup.select('#info')[0].get_text().strip().split('\n')
            movie_info_dict={'电影名称':item.get('title')}
            movie_info_dict['豆瓣评分']=item.get('rate')
            movie_info_dict['分类']=tag
            movie_info_dict['电影链接']=item.get('url')
            for info_item in movie_info_list[0:10]:
                try:
                    info_item_list=info_item.split(':')
                    movie_info_dict[info_item_list[0]]=info_item_list[1]
                except Exception:
                    continue
            movie_content_str=soup.select("div.indent>span")
            movie_info_dict['内容简介']=''.join(movie_content_str[0].get_text().split())
            print(movie_info_dict)
            return movie_info_dict
        else:
            return None
    except Exception as exc:
        print("Request page failed!error code:%s"%exc)
        return None

第二种方法是用selenium控制浏览器模拟人工请求网页：

def get_page_by_seleninum(item,tag):
    try:
        time.sleep(random.randint(8,15))
        browser.get(item.get('url'))
        try:
            browser.find_element_by_class_name('more-actor').click()
        except Exception:
            pass
        movie_info_list=(browser.find_element_by_id('info').text).strip().split('\n')
        movie_info_dict = {'电影名称': item.get('title')}
        movie_info_dict['豆瓣评分'] = item.get('rate')
        movie_info_dict['分类'] = tag
        movie_info_dict['电影链接'] = item.get('url')
        for info_item in movie_info_list[:10]:
            info_item_list=info_item.split(':')
            try:
                movie_info_dict[info_item_list[0]]=info_item_list[1]
            except Exception:
                continue
        movie_introduce=browser.find_element_by_class_name('indent>span')
        movie_info_dict['内容简介']=movie_introduce.text
        rating_people=browser.find_element_by_class_name('rating_people').text
        movie_info_dict['评价人数']=re.search('\d+',rating_people).group()
        rating_on_weight=browser.find_element_by_class_name('ratings-on-weight').text
        rating_on_weight_list=rating_on_weight.split('\n')
        for x in range(0,10,2):
            movie_info_dict[rating_on_weight_list[x]] = rating_on_weight_list[x + 1]
        return movie_info_dict
    except Exception as exc:
        print("Request page failed!error code:%s" % exc)
        return None

获取到的数据长这样：

{'电影名称': '西虹市首富', '豆瓣评分': '6.5', '分类': '喜剧', '电影链接': 'https://movie.douban.com/subject/27605698/', '导演': ' 闫非 / 彭大魔', '编剧': ' 闫非 / 彭大魔 / 林炳宝', '主演': ' 沈腾 / 宋芸桦 / 张一鸣 / 张晨光 / 常远 / 魏翔 / 赵自强 / 九孔 / 李立群 / 王成思 / 徐冬冬 / 艾伦 / 杨皓宇 / 黄才伦 / 王力宏 / 包贝尔 / 郎咸平 / 张绍刚 / 杨文哲 / 陶亮 / 王赞 / 黄杨 / 刘鉴 / 杨沅翰 / 林炳宝 / 骆佳 / 陈昊明 / 臧一人', '类型': ' 喜剧', '制片国家/地区': ' 中国大陆', '语言': ' 汉语普通话', '上映日期': ' 2018-07-27(中国大陆)', '片长': ' 118分钟', '又名': ' Hello Mr. Billionaire', 'IMDb链接': ' tt8529186', '内容简介': '西虹市丙级球队大翔队的守门员王多鱼（沈腾 饰）因比赛失利被教练开除，一筹莫展之际王多鱼突然收到神秘人士金老板（张晨光 饰）的邀请，被告知自己竟然是保险大亨王老太爷（李立群 饰）的唯一继承人，遗产高达百亿！但是王老太爷给出了一个非常奇葩的条件，那就是要求王多鱼在一个月内花光十亿，还不能告诉身边人，否则失去继承权。王多鱼毫不犹豫签下了“军令状”，与好友庄强（张一鸣 饰）以及财务夏竹（宋芸桦 饰）一起开启了“挥金之旅”，即将成为西虹市首富的王多鱼，第一次感受到了做富人的快乐，同时也发现想要挥金如土实在没有那么简单！', '评价人数': '660407', '5星': '9.6%', '4星': '29.1%', '3星': '44.4%', '2星': '13.1%', '1星': '3.8%'}

3、下载电影海报，电影海报是jpg格式的二进制文件，所以保存为独立文件

def save_img(item):
    time.sleep(random.randint(1, 4))
    if not os.path.exists("douban"):
        os.mkdir("douban")
    # if not os.path.exists(item.get('title')):
    #     os.mkdir(item.get('title'))
    if not os.path.isfile(item.get('title') + '.jpg'):
        try:
            img_url=item.get("img_url")
            res=requests.get(img_url)
            res.raise_for_status()
            if res.status_code==200:
                file_path='{0}/{1}.{2}'.format("douban",item.get('title'),'jpg')
                with open(file_path,'wb') as f:
                    f.write(res.content)
            else:
                print("Already downloaded",file_path)
        except Exception as exec:
            print('Faild to Save Image:%s' % exec)

电影海报

4、把电影信息写入数据库，由于爬取数据时IP经常被封，重头爬的话会出现很多重复数据，所以加入了判断，进行了数据去重处理

class MongoDB(object):
    def __init__(self):
        self.cxn=pymongo.MongoClient("mongodb://localhost:27017/")
    def connDB(self,dbName):
        self.db=self.cxn[dbName]
    def insert_update_one(self,collection,item):
        self.collection=self.db[collection]
        query={'电影链接':item['电影链接']}
        document=self.collection.find_one(query)
        if not document:
            self.collection.insert_one(item)
        else:
            if not re.search(item['分类'],document['分类']):
                newValue={'$set':{'分类':document['分类']+'/'+item['分类']}}
                print('电影已存在，请更新：%s'%newValue)
                self.collection.update_one(query,newValue)
            else:
                print('电影已经更新,现分类是：%s，更新分类是:%s'%(document['分类'],item['分类']))
    def insert_one(self, collection, item):
        self.collection = self.db[collection]
        self.collection.insert_one(item)
    def update_one(self,collection,query,newValue):
        self.collection=self.db[collection]
        self.collection.update_one(query,newValue)

豆瓣电影库

查询豆瓣评分8.0分以上的动画片

豆瓣评分8.0分以上的动画片

克里斯托弗·诺兰的电影

电影长度分析

制片国家分析

电影类型分析

五、总结一下
1、爬取一个网站数据的重点还是在于分析网页结构，定位需要爬取的字段，找出规律，工具是其次；
2、爬去方法有两种：一种是基于网页源代码，利用XPath,css选择器等获取数据，一种是基于ajax的请求，上面两种方法都用到了；
3、做大型爬虫还得使用爬虫框架scrapy, 这次实践只是用做了解爬虫原理。
六、完整的程序代码

import requests
import os
import threading
import time
import bs4
import random
import pymongo
import bs4
from selenium import webdriver
import re

def get_page(page_start,tag):
    params={
        'type': 'movie',
        'tag': tag,
        'sort': 'recommend',
        'page_limit': '20',
        'page_start': page_start
    }
    url="https://movie.douban.com/j/search_subjects"
    try:
        res=requests.get(url,params=params)
        res.raise_for_status()
        if res.status_code==200:
            return res.json()
        else:
            print("网页禁止访问")
            return None
    except Exception as exc:
        print("网页打开错误:%s"%exc)
        return None
def get_page_by_seleninum(item,tag):
    try:
        time.sleep(random.randint(8,15))
        browser.get(item.get('url'))
        try:
            browser.find_element_by_class_name('more-actor').click()
        except Exception:
            pass
        movie_info_list=(browser.find_element_by_id('info').text).strip().split('\n')
        movie_info_dict = {'电影名称': item.get('title')}
        movie_info_dict['豆瓣评分'] = item.get('rate')
        movie_info_dict['分类'] = tag
        movie_info_dict['电影链接'] = item.get('url')
        for info_item in movie_info_list[:10]:
            info_item_list=info_item.split(':')
            try:
                movie_info_dict[info_item_list[0]]=info_item_list[1]
            except Exception:
                continue
        movie_introduce=browser.find_element_by_class_name('indent>span')
        movie_info_dict['内容简介']=movie_introduce.text
        rating_people=browser.find_element_by_class_name('rating_people').text
        movie_info_dict['评价人数']=re.search('\d+',rating_people).group()
        rating_on_weight=browser.find_element_by_class_name('ratings-on-weight').text
        rating_on_weight_list=rating_on_weight.split('\n')
        for x in range(0,10,2):
            movie_info_dict[rating_on_weight_list[x]] = rating_on_weight_list[x + 1]
        return movie_info_dict
    except Exception as exc:
        print("Request page failed!error code:%s" % exc)
        return None
def get_child_page(item,tag):
    try:
        res = requests.get(item.get('url').split('\n')[0])
        res.raise_for_status()
        if res.status_code==200:
            soup=bs4.BeautifulSoup(res.text,features='lxml')
            movie_info_list=soup.select('#info')[0].get_text().strip().split('\n')
            movie_info_dict={'电影名称':item.get('title')}
            movie_info_dict['豆瓣评分']=item.get('rate')
            movie_info_dict['分类']=tag
            movie_info_dict['电影链接']=item.get('url')
            for info_item in movie_info_list[0:10]:
                try:
                    info_item_list=info_item.split(':')
                    movie_info_dict[info_item_list[0]]=info_item_list[1]
                except Exception:
                    continue
            movie_content_str=soup.select("div.indent>span")
            movie_info_dict['内容简介']=''.join(movie_content_str[0].get_text().split())
            print(movie_info_dict)
            return movie_info_dict
        else:
            return None
    except Exception as exc:
        print("Request page failed!error code:%s"%exc)
        return None
class MongoDB(object):
    def __init__(self):
        self.cxn=pymongo.MongoClient("mongodb://localhost:27017/")
    def connDB(self,dbName):
        self.db=self.cxn[dbName]
    def insert_update_one(self,collection,item):
        self.collection=self.db[collection]
        query={'电影链接':item['电影链接']}
        document=self.collection.find_one(query)
        if not document:
            self.collection.insert_one(item)
        else:
            if not re.search(item['分类'],document['分类']):
                newValue={'$set':{'分类':document['分类']+'/'+item['分类']}}
                print('电影已存在，请更新：%s'%newValue)
                self.collection.update_one(query,newValue)
            else:
                print('电影已经更新,现字符串是：%s，新字符串是:%s'%(document['分类'],item['分类']))
    def insert_one(self, collection, item):
        self.collection = self.db[collection]
        self.collection.insert_one(item)
    def update_one(self,collection,query,newValue):
        self.collection=self.db[collection]
        self.collection.update_one(query,newValue)
def save_img(item):
    time.sleep(random.randint(1, 4))
    if not os.path.exists("douban"):
        os.mkdir("douban")
    # if not os.path.exists(item.get('title')):
    #     os.mkdir(item.get('title'))
    if not os.path.isfile(item.get('title') + '.jpg'):
        try:
            img_url=item.get("img_url")
            res=requests.get(img_url)
            res.raise_for_status()
            if res.status_code==200:
                file_path='{0}/{1}.{2}'.format("douban",item.get('title'),'jpg')
                with open(file_path,'wb') as f:
                    f.write(res.content)
            else:
                print("Already downloaded",file_path)
        except Exception as exec:
            print('Faild to Save Image:%s' % exec)
def get_json(jsondata):
    if jsondata.get("subjects"):
        for item in jsondata.get("subjects"):
            title=item.get('title')
            img_url = item.get('cover')
            url = item.get('url')
            rate = item.get('rate')
            yield{
                "title":title,
                "img_url":img_url,
                "url":url,
                "rate":rate
            }
def myThread(page_start,tag):
    print(f"线程名称：{threading.current_thread().name} 开始时间：{time.strftime('%Y-%m-%d %H:%M:%S')}")
    log={"线程名称":threading.current_thread().name,'开始时间':time.strftime('%Y-%m-%d %H:%M:%S'),
         '结束时间':time.strftime('%Y-%m-%d %H:%M:%S')}
    movieDB.insert_one('spiderLog', log)
    json_data = get_page(page_start,tag)
    for item in get_json(json_data):
        print(item)
        movieDB.insert_update_one('movieInfo',get_child_page(item,tag))
        # save_img(item)
    log = {"线程名称":threading.current_thread().name, '结束时间':time.strftime('%Y-%m-%d %H:%M:%S')}
    query={"线程名称":threading.current_thread().name}
    newValue={"$set":{ "结束时间":time.strftime('%Y-%m-%d %H:%M:%S')}}
    movieDB.update_one('spiderLog',query,newValue)
    print(f"线程名称：{threading.current_thread().name}  结束时间：{time.strftime('%Y-%m-%d %H:%M:%S')}")
def main():
    print(f"豆瓣爬虫开始时间：{time.strftime('%Y-%m-%d %H:%M:%S')}")
    page_start = 0
    page_end = 25
    page_start_list=[page_start*20 for page_start in range(page_start,page_end)]
    # tags=['热门','最新','经典','豆瓣高分','冷门佳片','华语','欧美',
    #       '韩国','日本','动作','喜剧','爱情','科幻','嫌疑','恐怖','成长']
    tags = [ '喜剧','爱情']
    # tags=['豆瓣高分']
    for tag in tags:
        log = {"分类标签": tag, '开始时间': time.strftime('%Y-%m-%d %H:%M:%S'),
               '结束时间': time.strftime('%Y-%m-%d %H:%M:%S')}
        movieDB.insert_one('spiderLogSele', log)
        for page_start in page_start_list:
            # time.sleep(random.randint(1, 10))
            json_data = get_page(page_start,tag)
            print(json_data)
            for item in get_json(json_data):
                # print(get_child_page(item))
                # time.sleep(random.randint(1,10))
                # movie_info_dict=get_child_page(item,tag)#用rquests库爬去数据，速度块
                movie_info_dict=get_page_by_seleninum(item,tag)#用selenium库模拟人工爬取数据，速度慢
                if movie_info_dict:
                    print(movie_info_dict)
                    # movieDB.insert_one('movieInfo',movie_info_dict)
                    movieDB.insert_update_one('movieInfoByselenium',movie_info_dict)
                # time.sleep(random.randint(1,10))
                # save_img(item)
        log = {"分类标签": tag, '结束时间': time.strftime('%Y-%m-%d %H:%M:%S')}
        query = {"分类标签": tag}
        newValue = {"$set": {"结束时间": time.strftime('%Y-%m-%d %H:%M:%S')}}
        movieDB.update_one('spiderLogSele', query, newValue)
    # thread_list=[]
    # thread_child_list=[]
    # index=0
    # for tag in tags:
    #     thread_child_list.clear()
    #     for page_start in page_start_list:
    #         p=threading.Thread(target=myThread,args=[page_start,tag])
    #         thread_child_list.append(p)
    #     thread_list.append(thread_child_list)
    #     for thread in thread_list[index]:
    #         time.sleep(random.randint(1,10))
    #         thread.start()
    #     for thread in thread_list[index]:
    #         thread.join()
    #     index+=1

    print(f"豆瓣爬虫结束时间：{time.strftime('%Y-%m-%d %H:%M:%S')}")
if __name__=="__main__":
    browser=webdriver.Firefox()
    movieDB = MongoDB()
    movieDB.connDB('doubanMovie')
    main()