Scrapy_redis分布式爬取某电影网站（断点下载+下载进度

作者: 艾胖胖胖 | 来源:发表于2018-11-01 16:54 被阅读0次

Scrapy_redis分布式爬取某电影网站（断点下载+下载进度
python爬取某网站电影下载地址
Node.JS开发爬虫工具爬取某电影下载网站
用python制作表情包，领略scrapy框架的魅力！
Android下载文件（一）下载进度&断点续传
34.scrapy_redis原理分析并实现断点续爬以及分布式爬
Selenium小例子
分布式爬取
ios 断点续传遇到的问题
iOS-NSURLSession的理解和总结

一、背景介绍

操作系统及环境

操作系统：Win10（主）、Ubuntu（从）
Python版本：Python3.6
Scrapy版本：Scrapy1.5.1
scrapy_redis：两台电脑都需要安装
redis数据库：主服务器的redis数据库要运行远程连接

因为只是为了分享如何进行简单的分布式爬取，所以选取了一个结构比较简单的网站（网址不适合公开，仅作学习用途）

二、代码

主要思路
使用scrapy_redis的框架来实现该网站的分布式爬取。总共分成如下几个步骤：
1、第一个爬虫抓取需要下载的url信息存入reids数据库的队列（只需要放在主服务器）。从机通过redis数据库的队列来获取需要去抓取的url
2、第二个爬虫获取电影的信息，并将信息放回pipelines进行持久化存储
3、下载电影时配置断点下载以及进度条的显示
项目目录结构

image.png

  - crawlall.py文件：负责启动多个爬虫
  - crawl_url.py文件：负责抓取url，保存到redis队列
  - video_6969.py文件：爬取电影
  - items.py文件：保存电影字段
  - pipelines.py文件：下载电影、断点下载、下载进度条、保存到redis数据库
  - settings.py文件：配置信息

先配置我们的settings.py文件

# -*- coding: utf-8 -*-

# Scrapy settings for Video_6969 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'Video_6969'

SPIDER_MODULES = ['Video_6969.spiders']
NEWSPIDER_MODULE = 'Video_6969.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Video_6969 (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 150

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 0
# The download delay setting will honor only one of:
CONCURRENT_REQUESTS_PER_DOMAIN = 200
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'Video_6969.middlewares.Video6969SpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'Video_6969.middlewares.Video6969DownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   # 分布式爬虫的数据可以不通过本地的管道（数据不需要存在本地）。数据要存入到redis数据库中，所以这里需要加入一个reids数据库的管道组件
   'Video_6969.pipelines.Video6969Pipeline': 300,
   "scrapy_redis.pipelines.RedisPipeline": 100,  # item数据会报错到redis
   "Video_6969.pipelines.CrawlUrls": 50,
   # 'Video_6969.pipelines.Video6969Info': 200,
}


# 指定Redis数据库相关的配置
# Redis的主机地址
REDIS_HOST = '10.36.133.11'  # 主机
REDIS_PORT = 6379  # 端口
# REDIS_PARAMS = {"password": "xxxx"}  # 密码


# 调度器需要切换成Scrapy_Redis的调度器（是Scrapy_Redis组件对原生调度器的重写，加入了一些分布式调度的算法）
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# 加入scrapy_redis的去重组件
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

# 爬取过程中是否运行暂停
SCHEDULER_PERSIST = True


# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

# 日志
# 关闭日志或调整Debug级别
# LOG_ENABLED = False
# LOG_LEVEL = 'ERROR'


LOG_LEVEL = 'DEBUG'
"""
CRITICAL - 严重错误
ERROR - 一般错误
WARNING - 警告信息
INFO - 一般信息
DEBUG - 调试信息
"""

# 日志文件
LOG_FILE = '6969.log'

# 是否启用日志（创建日志后，不需开启，进行配置）
LOG_ENABLED = True  # （默认为True，启用日志）

# 如果是True ，进程当中，所有标准输出（包括错误）将会被重定向到log中
LOG_STDOUT = False

# 日志编码
LOG_ENCODING = 'utf-8'


# 配置启动所有爬虫
COMMANDS_MODULE = 'Video_6969.commands'

# MongoDB配置
MONGO_HOST = "127.0.0.1"  # 主机IP
MONGO_PORT = 27017  # 端口号
MONGO_DB = "6969"  # 库名
MONGO_COLL = "ViodeInfo"  # collection名
# 如果有用户名和密码
# MONGO_USER = "zhangsan"
# MONGO_PSW = "123456"

注意：现在爬虫要继承自RedisCrawlSpider，且urls要从redis数据库中根据redis_key配置的值进行获取，所以我们要将start_urls注释。后面我们将在redis配置我们的起始url。

crawl_url.py文件

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
from scrapy_redis.spiders import RedisCrawlSpider
from Video_6969.items import Video6969Item, UrlItem


class Video6969(CrawlSpider):
    name = 'crawl_urls'
    start_urls = ['https://www.6969qq.com']
    rules = (
        Rule(LinkExtractor(allow=r'/html/\d+/'), follow=True),  # 分类
        Rule(LinkExtractor(allow=r'/vod/\d+/.+?html'), callback='video_info', follow=True),  # 更多
    )

    def video_info(self, response):
        item = UrlItem()
        item['html_url'] = response.url
        yield item

crawl_url.py文件负责抓取我们需要下载的url页面，再通过pipelines存储到redis队列中。（也可以直接在crawl_url里进行持久化存储）

video_6969.py文件

# -*- coding: utf-8 -*-

from scrapy_redis.spiders import RedisCrawlSpider
from Video_6969.items import Video6969Item


class Video6969(RedisCrawlSpider):
    name = 'video_6969'

    redis_key = "video6969:start_urls"

    def parse(self, response):
        item = Video6969Item()
        item['html_url'] = response.url
        item['name'] = response.xpath("//h1/text()").extract_first()
        item['video_type'] = response.xpath("//div[@class = 'play_nav hidden-xs']//a/@title").extract_first()
        item['video_url'] = response.selector.re("(https://\w+.xia12345.com/.+?mp4)")[0]
        yield item

其它的从机是不需要crawl_url文件的，它们通过此文件来匹配到电影信息进行下载

item.py文件

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class Video6969Item(scrapy.Item):
    video_type = scrapy.Field()
    name = scrapy.Field()
    html_url = scrapy.Field()
    video_url = scrapy.Field()


class UrlItem(scrapy.Item):
    html_url = scrapy.Field()

pipelines.py文件

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import os
import pymongo
import redis
import requests
import sys


from Video_6969.items import UrlItem, Video6969Item


#  电影下载
class Video6969Pipeline(object):
    dir_path = r'G:\Video_6969'

    def process_item(self, item, spider):
        if isinstance(item, Video6969Item):
            type_path = os.path.join(self.dir_path, item['video_type'])
            if not os.path.exists(type_path):
                os.makedirs(type_path)
            name_path = os.path.join(type_path, item['name'])
            path = name_path + item['name'] + ".mp4"

            try:
                headers = {
                    "User-Agent": "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.3.2.1000 Chrome/30.0.1599.101 Safari/537.36"
                }
                # now_length = 0  # 已下载大小
                # 循环接收视频数据
                while True:
                    # 若文件已经存在，则断点续传，设置接收来需接收数据的位置
                    if os.path.exists(path):
                        now_length = os.path.getsize(path)
                        print("网络波动继续下载 。已下载：{}MB".format(now_length // 1024 // 1024))
                        headers['Range'] = 'bytes=%d-' % now_length  # 获得本地文件的大小作为续传的起点，还有就是按bytes
                    else:
                        now_length = 0  # 已下载大小
                    res = requests.get(item['video_url'], stream=True,
                                       headers=headers)  # stream设置为True，可以直接访问Response.content属性
                    total_length = int(res.headers['Content-Length'])  # 内容体总大小
                    print("准备下载：【{}】{} {}MB".format(item["video_type"], item["name"], total_length // 1024 // 1024))
                    # 若当前报文长度小于前次报文长度，或者已接收文件等于当前报文长度，则可以认为视频接收完成
                    if total_length < now_length or (
                            os.path.exists(path) and os.path.getsize(path) >= total_length):
                        # print("文件下载完成：【{}】{} {}MB".format(item["video_type"], item["name"], total_length % 1024 % 1024))
                        break

                    # 写入收到的视频数据
                    with open(path, 'ab') as file:
                        for chunk in res.iter_content(chunk_size=1024):
                            # if chunk:
                            file.write(chunk)
                            now_length += len(chunk)
                            # 实时保证一点点的写入
                            file.flush()
                            # 下载实现进度显示
                            done = int(50 * now_length / total_length)
                            sys.stdout.write(
                                "\r【%s%s】%d%%" % ('█' * done, ' ' * (50 - done), 100 * now_length / total_length))
                            sys.stdout.flush()
                    print()

            except Exception as e:
                print(e)
                raise IOError

            print("【{}】{}下载完毕：{}MB".format(item["video_type"], item["name"], now_length // 1024 // 1024))
            return item


# 存储MongoDB
class Video6969Info(object):

    def __init__(self, mongo_host, mongo_db, mongo_coll):
        self.mongo_host = mongo_host
        self.mongo_db = mongo_db
        self.mongo_coll = mongo_coll
        self.count = 0


    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_host=crawler.settings['MONGO_HOST'],
            mongo_db=crawler.settings['MONGO_DB'],
            mongo_coll=crawler.settings['MONGO_COLL']
        )

    def open_spider(self, spider):
        #  连接数据库
        self.client = pymongo.MongoClient(self.mongo_host)
        self.db = self.client[self.mongo_db]  # 获得数据库的句柄
        self.coll = self.db[self.mongo_coll]  # 获得collection的句柄

    def close_spider(self, spider):
        self.client.close()  # 关闭数据库

    def process_item(self, item, spider):
        data = dict(item)  # 把item转换成字典形式
        try:
            self.coll.insert(data)  # 插入
            self.count += 1
        except:
            raise IOError
        if not self.count % 100:
            print("已获取数据：%d条" % self.count)
        return item


# 压入Redis队列
class CrawlUrls(object):
    def process_item(self, item, spider):
        rds = redis.StrictRedis(host='10.36.133.11', port=6379, db=0)
        if isinstance(item, UrlItem):
            rds.lpush("video6969:start_urls", item['html_url'])
        return item

这里用request请求获取我的电影的二进制数据，并进行写入。因为网络波动很容易造成视频文件损坏，所以我又在这里进行了断点下载

crawlall.py文件

from scrapy.commands import ScrapyCommand


class Command(ScrapyCommand):
    requires_project = True

    def syntax(self):
        return '[options]'

    def short_desc(self):
        return 'Runs all of the spiders'

    def run(self, args, opts):
        spider_list = self.crawler_process.spiders.list()
        for name in spider_list:
            self.crawler_process.crawl(name, **opts.__dict__)
        self.crawler_process.start()

启动

scrapy crawlall

启动后crawl_url爬虫会去爬取url存入redis队列，其它从机获取到url以后开始下载。当然你也可以通过其它的办法来进行分布式的爬取。
注意：保存电影的时候，要注意你是否对改目录有读写的权限

Scrapy_redis分布式爬取某电影网站（断点下载+下载进度
一、背景介绍操作系统及环境因为只是为了分享如何进行简单的分布式爬取，所以选取了一个结构比较简单的网站（网址不适...
python爬取某网站电影下载地址
背景：自己有台电脑要给老爸用，老爷子喜欢看一些大片，但是家里网络环境不好，就想批量下载一些存到电脑里。但是目前大部...
Node.JS开发爬虫工具爬取某电影下载网站
node.js爬取某电影下载网站小项目本项目采用以下几个node库 require请求库类似http(请注意目前...
用python制作表情包，领略scrapy框架的魅力！
先上图： scrapy框架爬取某表情网站表情图【源码+GIF表情包下载】 python源代码 mport scra...
Android下载文件（一）下载进度&断点续传
索引 Android下载文件（一）下载进度&断点续传 Android下载文件（二）多线程并发&断点续传（待续） A...
34.scrapy_redis原理分析并实现断点续爬以及分布式爬
scrapy_redis原理分析并实现断点续爬以及分布式爬虫学习目标了解 scrapy实现去重的原理了解 s...
Selenium小例子
爬取腾讯动漫爬取某网站漫画爬取拉勾网
分布式爬取
分布式爬取需要安装pip3 install scrapy_redis 首先修改setings.py文件： 1.设置...
ios 断点续传遇到的问题
1、单个文件断点续传。 2、下载管理类。 3、下载进度时时保存。 ............
iOS-NSURLSession的理解和总结
-Session会话分类-请求数据-文件下载-下载进度-断点续传 -下载的暂停取消继续-后台下载-文件上传-重点...