需求中希望scrapy的spider能够一直循环从Redis、接口中获取任务,要求spider不能close。
一版实现在start_requests中:
def start_requests(self):
......
while True:
yield scrapy.Request(url, dont_filter=True)
......
但是这种写法会导致任务被频繁的获取就是向下一步执行。
后用signals实现:
from scrapy import signals
from scrapy.exceptions import DontCloseSpider
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(AutoengSpider, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_idle, signal=signals.spider_idle)
return spider
def start_requests(self):
yield self.next_req()
def spider_idle(self, spider):
request = self.next_req()
if request:
self.crawler.engine.schedule(request, self)
else:
time.sleep(2)
raise DontCloseSpider()
网友评论