美文网首页
Scrapy_redis分布式爬取某电影网站(断点下载+下载进度

Scrapy_redis分布式爬取某电影网站(断点下载+下载进度

作者: 艾胖胖胖 | 来源:发表于2018-11-01 16:54 被阅读0次

    一、背景介绍

    • 操作系统及环境
    操作系统:Win10(主)、Ubuntu(从)
    Python版本:Python3.6
    Scrapy版本:Scrapy1.5.1
    scrapy_redis:两台电脑都需要安装
    redis数据库:主服务器的redis数据库要运行远程连接
    

    因为只是为了分享如何进行简单的分布式爬取,所以选取了一个结构比较简单的网站(网址不适合公开,仅作学习用途)

    二、代码

    • 主要思路
      使用scrapy_redis的框架来实现该网站的分布式爬取。总共分成如下几个步骤:
      1、第一个爬虫抓取需要下载的url信息存入reids数据库的队列(只需要放在主服务器)。从机通过redis数据库的队列来获取需要去抓取的url
      2、第二个爬虫获取电影的信息,并将信息放回pipelines进行持久化存储
      3、下载电影时配置断点下载以及进度条的显示

    • 项目目录结构

    image.png
      - crawlall.py文件:负责启动多个爬虫
      - crawl_url.py文件:负责抓取url,保存到redis队列
      - video_6969.py文件:爬取电影
      - items.py文件:保存电影字段
      - pipelines.py文件:下载电影、断点下载、下载进度条、保存到redis数据库
      - settings.py文件:配置信息
    
    • 先配置我们的settings.py文件
    # -*- coding: utf-8 -*-
    
    # Scrapy settings for Video_6969 project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://doc.scrapy.org/en/latest/topics/settings.html
    #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'Video_6969'
    
    SPIDER_MODULES = ['Video_6969.spiders']
    NEWSPIDER_MODULE = 'Video_6969.spiders'
    
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    #USER_AGENT = 'Video_6969 (+http://www.yourdomain.com)'
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    CONCURRENT_REQUESTS = 150
    
    # Configure a delay for requests for the same website (default: 0)
    # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    DOWNLOAD_DELAY = 0
    # The download delay setting will honor only one of:
    CONCURRENT_REQUESTS_PER_DOMAIN = 200
    #CONCURRENT_REQUESTS_PER_IP = 16
    
    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False
    
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    
    # Override the default request headers:
    #DEFAULT_REQUEST_HEADERS = {
    #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    #   'Accept-Language': 'en',
    #}
    
    # Enable or disable spider middlewares
    # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    #    'Video_6969.middlewares.Video6969SpiderMiddleware': 543,
    #}
    
    # Enable or disable downloader middlewares
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #DOWNLOADER_MIDDLEWARES = {
    #    'Video_6969.middlewares.Video6969DownloaderMiddleware': 543,
    #}
    
    # Enable or disable extensions
    # See https://doc.scrapy.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    #    'scrapy.extensions.telnet.TelnetConsole': None,
    #}
    
    # Configure item pipelines
    # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {
       # 分布式爬虫的数据可以不通过本地的管道(数据不需要存在本地)。数据要存入到redis数据库中,所以这里需要加入一个reids数据库的管道组件
       'Video_6969.pipelines.Video6969Pipeline': 300,
       "scrapy_redis.pipelines.RedisPipeline": 100,  # item数据会报错到redis
       "Video_6969.pipelines.CrawlUrls": 50,
       # 'Video_6969.pipelines.Video6969Info': 200,
    }
    
    
    # 指定Redis数据库相关的配置
    # Redis的主机地址
    REDIS_HOST = '10.36.133.11'  # 主机
    REDIS_PORT = 6379  # 端口
    # REDIS_PARAMS = {"password": "xxxx"}  # 密码
    
    
    # 调度器需要切换成Scrapy_Redis的调度器(是Scrapy_Redis组件对原生调度器的重写,加入了一些分布式调度的算法)
    SCHEDULER = "scrapy_redis.scheduler.Scheduler"
    
    # 加入scrapy_redis的去重组件
    DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
    
    # 爬取过程中是否运行暂停
    SCHEDULER_PERSIST = True
    
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    
    # 日志
    # 关闭日志或调整Debug级别
    # LOG_ENABLED = False
    # LOG_LEVEL = 'ERROR'
    
    
    LOG_LEVEL = 'DEBUG'
    """
    CRITICAL - 严重错误
    ERROR - 一般错误
    WARNING - 警告信息
    INFO - 一般信息
    DEBUG - 调试信息
    """
    
    # 日志文件
    LOG_FILE = '6969.log'
    
    # 是否启用日志(创建日志后,不需开启,进行配置)
    LOG_ENABLED = True  # (默认为True,启用日志)
    
    # 如果是True ,进程当中,所有标准输出(包括错误)将会被重定向到log中
    LOG_STDOUT = False
    
    # 日志编码
    LOG_ENCODING = 'utf-8'
    
    
    # 配置启动所有爬虫
    COMMANDS_MODULE = 'Video_6969.commands'
    
    # MongoDB配置
    MONGO_HOST = "127.0.0.1"  # 主机IP
    MONGO_PORT = 27017  # 端口号
    MONGO_DB = "6969"  # 库名
    MONGO_COLL = "ViodeInfo"  # collection名
    # 如果有用户名和密码
    # MONGO_USER = "zhangsan"
    # MONGO_PSW = "123456"
    
    

    注意:现在爬虫要继承自RedisCrawlSpider,且urls要从redis数据库中根据redis_key配置的值进行获取,所以我们要将start_urls注释。后面我们将在redis配置我们的起始url。

    • crawl_url.py文件
    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import Rule, CrawlSpider
    from scrapy_redis.spiders import RedisCrawlSpider
    from Video_6969.items import Video6969Item, UrlItem
    
    
    class Video6969(CrawlSpider):
        name = 'crawl_urls'
        start_urls = ['https://www.6969qq.com']
        rules = (
            Rule(LinkExtractor(allow=r'/html/\d+/'), follow=True),  # 分类
            Rule(LinkExtractor(allow=r'/vod/\d+/.+?html'), callback='video_info', follow=True),  # 更多
        )
    
        def video_info(self, response):
            item = UrlItem()
            item['html_url'] = response.url
            yield item
    
    

    crawl_url.py文件负责抓取我们需要下载的url页面,再通过pipelines存储到redis队列中。(也可以直接在crawl_url里进行持久化存储)

    • video_6969.py文件
    # -*- coding: utf-8 -*-
    
    from scrapy_redis.spiders import RedisCrawlSpider
    from Video_6969.items import Video6969Item
    
    
    class Video6969(RedisCrawlSpider):
        name = 'video_6969'
    
        redis_key = "video6969:start_urls"
    
        def parse(self, response):
            item = Video6969Item()
            item['html_url'] = response.url
            item['name'] = response.xpath("//h1/text()").extract_first()
            item['video_type'] = response.xpath("//div[@class = 'play_nav hidden-xs']//a/@title").extract_first()
            item['video_url'] = response.selector.re("(https://\w+.xia12345.com/.+?mp4)")[0]
            yield item
    
    

    其它的从机是不需要crawl_url文件的,它们通过此文件来匹配到电影信息进行下载

    • item.py文件
    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class Video6969Item(scrapy.Item):
        video_type = scrapy.Field()
        name = scrapy.Field()
        html_url = scrapy.Field()
        video_url = scrapy.Field()
    
    
    class UrlItem(scrapy.Item):
        html_url = scrapy.Field()
    
    
    • pipelines.py文件
    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    import os
    import pymongo
    import redis
    import requests
    import sys
    
    
    from Video_6969.items import UrlItem, Video6969Item
    
    
    #  电影下载
    class Video6969Pipeline(object):
        dir_path = r'G:\Video_6969'
    
        def process_item(self, item, spider):
            if isinstance(item, Video6969Item):
                type_path = os.path.join(self.dir_path, item['video_type'])
                if not os.path.exists(type_path):
                    os.makedirs(type_path)
                name_path = os.path.join(type_path, item['name'])
                path = name_path + item['name'] + ".mp4"
    
                try:
                    headers = {
                        "User-Agent": "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.3.2.1000 Chrome/30.0.1599.101 Safari/537.36"
                    }
                    # now_length = 0  # 已下载大小
                    # 循环接收视频数据
                    while True:
                        # 若文件已经存在,则断点续传,设置接收来需接收数据的位置
                        if os.path.exists(path):
                            now_length = os.path.getsize(path)
                            print("网络波动继续下载 。已下载:{}MB".format(now_length // 1024 // 1024))
                            headers['Range'] = 'bytes=%d-' % now_length  # 获得本地文件的大小作为续传的起点,还有就是按bytes
                        else:
                            now_length = 0  # 已下载大小
                        res = requests.get(item['video_url'], stream=True,
                                           headers=headers)  # stream设置为True,可以直接访问Response.content属性
                        total_length = int(res.headers['Content-Length'])  # 内容体总大小
                        print("准备下载:【{}】{} {}MB".format(item["video_type"], item["name"], total_length // 1024 // 1024))
                        # 若当前报文长度小于前次报文长度,或者已接收文件等于当前报文长度,则可以认为视频接收完成
                        if total_length < now_length or (
                                os.path.exists(path) and os.path.getsize(path) >= total_length):
                            # print("文件下载完成:【{}】{} {}MB".format(item["video_type"], item["name"], total_length % 1024 % 1024))
                            break
    
                        # 写入收到的视频数据
                        with open(path, 'ab') as file:
                            for chunk in res.iter_content(chunk_size=1024):
                                # if chunk:
                                file.write(chunk)
                                now_length += len(chunk)
                                # 实时保证一点点的写入
                                file.flush()
                                # 下载实现进度显示
                                done = int(50 * now_length / total_length)
                                sys.stdout.write(
                                    "\r【%s%s】%d%%" % ('█' * done, ' ' * (50 - done), 100 * now_length / total_length))
                                sys.stdout.flush()
                        print()
    
                except Exception as e:
                    print(e)
                    raise IOError
    
                print("【{}】{}下载完毕:{}MB".format(item["video_type"], item["name"], now_length // 1024 // 1024))
                return item
    
    
    # 存储MongoDB
    class Video6969Info(object):
    
        def __init__(self, mongo_host, mongo_db, mongo_coll):
            self.mongo_host = mongo_host
            self.mongo_db = mongo_db
            self.mongo_coll = mongo_coll
            self.count = 0
    
    
        @classmethod
        def from_crawler(cls, crawler):
            return cls(
                mongo_host=crawler.settings['MONGO_HOST'],
                mongo_db=crawler.settings['MONGO_DB'],
                mongo_coll=crawler.settings['MONGO_COLL']
            )
    
        def open_spider(self, spider):
            #  连接数据库
            self.client = pymongo.MongoClient(self.mongo_host)
            self.db = self.client[self.mongo_db]  # 获得数据库的句柄
            self.coll = self.db[self.mongo_coll]  # 获得collection的句柄
    
        def close_spider(self, spider):
            self.client.close()  # 关闭数据库
    
        def process_item(self, item, spider):
            data = dict(item)  # 把item转换成字典形式
            try:
                self.coll.insert(data)  # 插入
                self.count += 1
            except:
                raise IOError
            if not self.count % 100:
                print("已获取数据:%d条" % self.count)
            return item
    
    
    # 压入Redis队列
    class CrawlUrls(object):
        def process_item(self, item, spider):
            rds = redis.StrictRedis(host='10.36.133.11', port=6379, db=0)
            if isinstance(item, UrlItem):
                rds.lpush("video6969:start_urls", item['html_url'])
            return item
    

    这里用request请求获取我的电影的二进制数据,并进行写入。因为网络波动很容易造成视频文件损坏,所以我又在这里进行了断点下载

    • crawlall.py文件
    from scrapy.commands import ScrapyCommand
    
    
    class Command(ScrapyCommand):
        requires_project = True
    
        def syntax(self):
            return '[options]'
    
        def short_desc(self):
            return 'Runs all of the spiders'
    
        def run(self, args, opts):
            spider_list = self.crawler_process.spiders.list()
            for name in spider_list:
                self.crawler_process.crawl(name, **opts.__dict__)
            self.crawler_process.start()
    
    • 启动
    scrapy crawlall
    

    启动后crawl_url爬虫会去爬取url存入redis队列,其它从机获取到url以后开始下载。当然你也可以通过其它的办法来进行分布式的爬取。
    注意:保存电影的时候,要注意你是否对改目录有读写的权限

    相关文章

      网友评论

          本文标题:Scrapy_redis分布式爬取某电影网站(断点下载+下载进度

          本文链接:https://www.haomeiwen.com/subject/pintxqtx.html