美文网首页
同时运行多个scrapy爬虫的几种方法(自定义scrapy项目命

同时运行多个scrapy爬虫的几种方法(自定义scrapy项目命

作者: 玢仼 | 来源:发表于2018-01-18 10:31 被阅读0次

    试想一下,前面做的实验和例子都只有一个spider。然而,现实的开发的爬虫肯定不止一个。既然这样,那么就会有如下几个问题:1、在同一个项目中怎么创建多个爬虫的呢?2、多个爬虫的时候是怎么将他们运行起来呢?

      说明:本文章是基于前面几篇文章和实验的基础上完成的。如果您错过了,或者有疑惑的地方可以在此查看:

      安装python爬虫scrapy踩过的那些坑和编程外的思考

      scrapy爬虫成长日记之创建工程-抽取数据-保存为json格式的数据

      scrapy爬虫成长日记之将抓取内容写入mysql数据库

      如何让你的scrapy爬虫不再被ban

      一、创建spider

      1、创建多个spider,scrapy genspider spidername domain

    scrapy genspider CnblogsHomeSpider cnblogs.com

      通过上述命令创建了一个spider name为CnblogsHomeSpider的爬虫,start_urls为http://www.cnblogs.com/的爬虫

      2、查看项目下有几个爬虫scrapy list

    [root@bogon cnblogs]# scrapy listCnblogsHomeSpider

    CnblogsSpider

      由此可以知道我的项目下有两个spider,一个名称叫CnblogsHomeSpider,另一个叫CnblogsSpider。

      更多关于scrapy命令可参考:http://doc.scrapy.org/en/latest/topics/commands.html

      二、让几个spider同时运行起来

      现在我们的项目有两个spider,那么现在我们怎样才能让两个spider同时运行起来呢?你可能会说写个shell脚本一个个调用,也可能会说写个python脚本一个个运行等。然而我在stackoverflow.com上看到。的确也有不上前辈是这么实现。然而官方文档是这么介绍的。

      1、Run Scrapy from a script

    import scrapyfrom scrapy.crawler import CrawlerProcessclass MySpider(scrapy.Spider):

        # Your spider definition    ...

    process = CrawlerProcess({

        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'})

    process.crawl(MySpider)

    process.start() # the script will block here until the crawling is finished

      这里主要通过scrapy.crawler.CrawlerProcess来实现在脚本里运行一个spider。更多的例子可以在此查看:https://github.com/scrapinghub/testspiders

      2、Running multiple spiders in the same process

    通过CrawlerProcess

    import scrapyfrom scrapy.crawler import CrawlerProcessclass MySpider1(scrapy.Spider):

        # Your first spider definition    ...class MySpider2(scrapy.Spider):

        # Your second spider definition    ...

    process = CrawlerProcess()

    process.crawl(MySpider1)

    process.crawl(MySpider2)

    process.start() # the script will block here until all crawling jobs are finished

    通过CrawlerRunner

    import scrapyfrom twisted.internet import reactorfrom scrapy.crawler import CrawlerRunnerfrom scrapy.utils.log import configure_loggingclass MySpider1(scrapy.Spider):

        # Your first spider definition    ...class MySpider2(scrapy.Spider):

        # Your second spider definition    ...

    configure_logging()

    runner = CrawlerRunner()

    runner.crawl(MySpider1)

    runner.crawl(MySpider2)

    d = runner.join()

    d.addBoth(lambda _: reactor.stop())

    reactor.run() # the script will block here until all crawling jobs are finished

    通过CrawlerRunner和链接(chaining) deferred来线性运行

    from twisted.internet import reactor, deferfrom scrapy.crawler import CrawlerRunnerfrom scrapy.utils.log import configure_loggingclass MySpider1(scrapy.Spider):

        # Your first spider definition    ...class MySpider2(scrapy.Spider):

        # Your second spider definition    ...

    configure_logging()

    runner = CrawlerRunner()

    @defer.inlineCallbacksdef crawl():

        yield runner.crawl(MySpider1)

        yield runner.crawl(MySpider2)

        reactor.stop()

    crawl()

    reactor.run() # the script will block here until the last crawl call is finished

      这是官方文档提供的几种在script里面运行spider的方法。

      三、通过自定义scrapy命令的方式来运行

      创建项目命令可参考:http://doc.scrapy.org/en/master/topics/commands.html?highlight=commands_module#custom-project-commands

      1、创建commands目录

    mkdir commands

      注意:commands和spiders目录是同级的

      2、在commands下面添加一个文件crawlall.py

      这里主要通过修改scrapy的crawl命令来完成同时执行spider的效果。crawl的源码可以在此查看:https://github.com/scrapy/scrapy/blob/master/scrapy/commands/crawl.py

    from scrapy.commands import ScrapyCommand from scrapy.crawler import CrawlerRunnerfrom scrapy.utils.conf import arglist_to_dictclass Command(ScrapyCommand):

        requires_project = True

        def syntax(self): 

            return '[options]' 

        def short_desc(self): 

            return 'Runs all of the spiders' 

        def add_options(self, parser):

            ScrapyCommand.add_options(self, parser)

            parser.add_option("-a", dest="spargs", action="append", default=[], metavar="NAME=VALUE",

                              help="set spider argument (may be repeated)")

            parser.add_option("-o", "--output", metavar="FILE",

                              help="dump scraped items into FILE (use - for stdout)")

            parser.add_option("-t", "--output-format", metavar="FORMAT",

                              help="format to use for dumping items with -o")

        def process_options(self, args, opts):

            ScrapyCommand.process_options(self, args, opts)

            try:

                opts.spargs = arglist_to_dict(opts.spargs)

            except ValueError:

                raise UsageError("Invalid -a value, use -a NAME=VALUE", print_help=False)

        def run(self, args, opts):

            #settings = get_project_settings()        spider_loader = self.crawler_process.spider_loader

            for spidername in args or spider_loader.list():

                print "*********cralall spidername************" + spidername

                self.crawler_process.crawl(spidername, **opts.spargs)

            self.crawler_process.start()

      这里主要是用了self.crawler_process.spider_loader.list()方法获取项目下所有的spider,然后利用self.crawler_process.crawl运行spider

      3、commands命令下添加__init__.py文件

    touch __init__.py

      注意:这一步一定不能省略。我就是因为这个问题折腾了一天。囧。。。就怪自己半路出家的吧。

      如果省略了会报这样一个异常

    Traceback (most recent call last):

      File "/usr/local/bin/scrapy", line 9, in

        load_entry_point('Scrapy==1.0.0rc2', 'console_scripts', 'scrapy')()

      File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 122, in execute

        cmds = _get_commands_dict(settings, inproject)

      File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 50, in _get_commands_dict

        cmds.update(_get_commands_from_module(cmds_module, inproject))

      File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 29, in _get_commands_from_module

        for cmd in _iter_command_classes(module):

      File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 20, in _iter_command_classes

        for module in walk_modules(module_name):

      File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/utils/misc.py", line 63, in walk_modules

        mod = import_module(path)

      File "/usr/local/lib/python2.7/importlib/__init__.py", line 37, in import_module

        __import__(name)

    ImportError: No module named commands

      一开始怎么找都找不到原因在哪。耗了我一整天,后来到http://stackoverflow.com/上得到了网友的帮助。再次感谢万能的互联网,要是没有那道墙该是多么的美好呀!扯远了,继续回来。

      4、settings.py目录下创建setup.py(这一步去掉也没影响,不知道官网帮助文档这么写有什么具体的意义。)

    from setuptools import setup, find_packages

    setup(name='scrapy-mymodule',

      entry_points={

        'scrapy.commands': [

          'crawlall=cnblogs.commands:crawlall',

        ],

      },

    )

      这个文件的含义是定义了一个crawlall命令,cnblogs.commands为命令文件目录,crawlall为命令名。

      5. 在settings.py中添加配置:

    COMMANDS_MODULE = 'cnblogs.commands'

      6. 运行命令scrapy crawlall

      最后源码更新至此:https://github.com/jackgitgz/CnblogsSpider

    相关文章

      网友评论

          本文标题:同时运行多个scrapy爬虫的几种方法(自定义scrapy项目命

          本文链接:https://www.haomeiwen.com/subject/mbmioxtx.html