美文网首页Python从0到1我爱编程
(2018-05-22.Python从Zero到One)6、(爬

(2018-05-22.Python从Zero到One)6、(爬

作者: lyh165 | 来源:发表于2018-05-22 23:19 被阅读0次

    pipelines.py
    这是是用来实现分布式处理的作用。它将Item存储在redis中以实现分布式处理。由于在这里需要读取配置,所以就用到了from_crawler()函数。

    from scrapy.utils.misc import load_object
    from scrapy.utils.serialize import ScrapyJSONEncoder
    from twisted.internet.threads import deferToThread

    from . import connection

    default_serialize = ScrapyJSONEncoder().encode

    class RedisPipeline(object):
    """Pushes serialized item into a redis list/queue"""

    def __init__(self, server,
                 key='%(spider)s:items',
                 serialize_func=default_serialize):
        self.server = server
        self.key = key
        self.serialize = serialize_func
    
    @classmethod
    def from_settings(cls, settings):
        params = {
            'server': connection.from_settings(settings),
        }
        if settings.get('REDIS_ITEMS_KEY'):
            params['key'] = settings['REDIS_ITEMS_KEY']
        if settings.get('REDIS_ITEMS_SERIALIZER'):
            params['serialize_func'] = load_object(
                settings['REDIS_ITEMS_SERIALIZER']
            )
    
        return cls(**params)
    
    @classmethod
    def from_crawler(cls, crawler):
        return cls.from_settings(crawler.settings)
    
    def process_item(self, item, spider):
        return deferToThread(self._process_item, item, spider)
    
    def _process_item(self, item, spider):
        key = self.item_key(item, spider)
        data = self.serialize(item)
        self.server.rpush(key, data)
        return item
    
    def item_key(self, item, spider):
        """Returns redis key based on given spider.
        Override this function to use a different key depending on the item
        and/or spider.
        """
        return self.key % {'spider': spider.name}
    

    pipelines文件实现了一个item pipieline类,和scrapy的item pipeline是同一个对象,通过从settings中拿到我们配置的REDIS_ITEMS_KEY作为key,把item串行化之后存入redis数据库对应的value中(这个value可以看出出是个list,我们的每个item是这个list中的一个结点),这个pipeline把提取出的item存起来,主要是为了方便我们延后处理数据。

    相关文章

      网友评论

        本文标题:(2018-05-22.Python从Zero到One)6、(爬

        本文链接:https://www.haomeiwen.com/subject/mwhljftx.html