通过核心ＡＰＩ启动单个或多个scrapy爬虫

作者: BABYMISS | 来源:发表于2020-05-07 16:06 被阅读0次

通过核心ＡＰＩ启动单个或多个scrapy爬虫
通过核心ＡＰＩ启动单个或多个scrapy爬虫
如何获取指定模块下所有的类
设置pycharm调试scrapy框架爬虫
Scrapy笔记
爬虫框架常见命令（善忘者）
Scrapy学习篇（二）之命令行工具
Scrapy同时启动多个爬虫
Scrapy同时启动多个爬虫
小爬虫实践项目-爬取伯乐在线全部文章信息

1. 可以使用API从脚本运行Scrapy，而不是运行Scrapy的典型方法scrapy crawl；Scrapy是基于Twisted异步网络库构建的，因此需要在Twisted容器内运行它，可以通过两个API来运行单个或多个爬虫scrapy.crawler.CrawlerProcess、scrapy.crawler.CrawlerRunner。

2. 启动爬虫的的第一个实用程序是scrapy.crawler.CrawlerProcess 。该类将为您启动Twisted reactor，配置日志记录并设置关闭处理程序，此类是所有Scrapy命令使用的类。

示例运行单个爬虫：

交流群：313074041 源码、素材学习资料

import scrapy

from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):

# Your spider definition

...

process = CrawlerProcess({

'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'

})

process.crawl(MySpider)

process.start() # the script will block here until the crawling is finished

通过CrawlerProcess传入参数，并使用get_project_settings获取Settings 项目设置的实例。

from scrapy.crawler import CrawlerProcess

from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())

# 'followall' is the name of one of the spiders of the project.

process.crawl('followall', domain='scrapinghub.com')

process.start() # the script will block here until the crawling is finished

还有另一个Scrapy实例方式可以更好地控制爬虫运行过程：scrapy.crawler.CrawlerRunner。此类封装了一些简单的帮助程序来运行多个爬虫程序，但它不会以任何方式启动或干扰现有的爬虫。

使用此类，显式运行reactor。如果已有爬虫在运行想在同一个进程中开启另一个Scrapy，建议您使用CrawlerRunner 而不是CrawlerProcess。

注意，爬虫结束后需要手动关闭Twisted reactor，通过向CrawlerRunner.crawl方法返回的延迟添加回调来实现。

下面是它的用法示例，在MySpider完成运行后手动停止容器的回调。

from twisted.internet import reactor

import scrapy

from scrapy.crawler import CrawlerRunner

from scrapy.utils.log import configure_logging

class MySpider(scrapy.Spider):

# Your spider definition

...

configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})

runner = CrawlerRunner()

d = runner.crawl(MySpider)

d.addBoth(lambda _: reactor.stop())

reactor.run() # the script will block here until the crawling is finished

在同一个进程中运行多个蜘蛛

默认情况下，Scrapy在您运行时为每个进程运行一个蜘蛛。但是，Scrapy支持使用内部API为每个进程运行多个蜘蛛。

这是一个同时运行多个蜘蛛的示例：

import scrapy

from scrapy.crawler import CrawlerProcess

class MySpider1(scrapy.Spider):

# Your first spider definition

...

class MySpider2(scrapy.Spider):

# Your second spider definition

...

process = CrawlerProcess()

process.crawl(MySpider1)

process.crawl(MySpider2)

process.start() # the script will block here until all crawling jobs are finished

使用CrawlerRunner示例：

import scrapy

from twisted.internet import reactor

from scrapy.crawler import CrawlerRunner

from scrapy.utils.log import configure_logging

class MySpider1(scrapy.Spider):

# Your first spider definition

...

class MySpider2(scrapy.Spider):

# Your second spider definition

...

configure_logging()

runner = CrawlerRunner()

runner.crawl(MySpider1)

runner.crawl(MySpider2)

d = runner.join()

d.addBoth(lambda _: reactor.stop())

reactor.run() # the script will block here until all crawling jobs are finished

相同的示例，但通过异步运行爬虫蛛：

from twisted.internet import reactor, defer

from scrapy.crawler import CrawlerRunner

from scrapy.utils.log import configure_logging

class MySpider1(scrapy.Spider):

# Your first spider definition

...

class MySpider2(scrapy.Spider):

# Your second spider definition

...

configure_logging()

runner = CrawlerRunner()

@defer.inlineCallbacks

def crawl():

yield runner.crawl(MySpider1)

yield runner.crawl(MySpider2)

reactor.stop()

crawl()

reactor.run() # the script will block here until the last crawl call is finished