美文网首页
15.scrapy-redis多页抓取,使用redis一页一页的

15.scrapy-redis多页抓取,使用redis一页一页的

作者: starrymusic | 来源:发表于2019-04-01 16:53 被阅读0次

    和前一篇类似,这里是修改后的csdn2.py文件,代码如下:

    # -*- coding: utf-8 -*-
    import scrapy
    import example.items
    from scrapy_redis.spiders import RedisSpider
    
    
    class Csdn2Spider(RedisSpider):
        name = 'csdn2'
        redis_key = 'csdn2:start_urls'
    
        def __init__(self, *args, **kwargs):
            # Dynamically define the allowed domains list.
            domain = kwargs.pop('https://edu.csdn.net', '')
            self.allowed_domains = filter(None, domain.split(','))
            super(Csdn2Spider, self).__init__(*args, **kwargs)
    
        def parse(self, response):
            for pagedata in response.xpath("//dl[@class='lector_list']"):
                item = example.items.Csdn2Item()
                item['teacher'] = pagedata.xpath("./dd[1]/ul/li/a/text()").extract()
                item['lessons'] = pagedata.xpath("./dd[1]/ul/li[2]/span/text()").extract()
                item['student'] = pagedata.xpath("./dd[1]/ul/li[3]/span/text()").extract()
                item['describe'] = pagedata.xpath("./dd[1]/p/text()").extract()
                yield item
    

    这里是修改后的items.py文件,代码如下:

    # Define here the models for your scraped items
    #
    # See documentation in:
    # http://doc.scrapy.org/topics/items.html
    
    from scrapy.item import Item, Field
    from scrapy.loader import ItemLoader
    from scrapy.loader.processors import MapCompose, TakeFirst, Join
    
    class Csdn2Item(Item):
        teacher = Field()
        lessons = Field()
        student = Field()
        describe = Field()
        crawled = Field()
        spider = Field()
       
    
    class ExampleItem(Item):
        name = Field()
        description = Field()
        link = Field()
        crawled = Field()
        spider = Field()
        url = Field()
    
    
    class ExampleLoader(ItemLoader):
        default_item_class = ExampleItem
        default_input_processor = MapCompose(lambda s: s.strip())
        default_output_processor = TakeFirst()
        description_out = Join()
    

    其他文件无需更改,如果想看到输出结果,可以在pipelines.py文件里加一句打印的代码。打开redis-cli.exe,连接上redis服务器。
    在cmd.exe里切换到项目所在位置,然后执行“scrapy crawl csdn2”,执行后会显示等待的状态,这时在redis-cli.exe里输入

    lpush csdn2:start_urls https://edu.csdn.net/lecturer?&page=1
    

    在cmd.exe里就会显示抓取的第一页的教师信息。
    如果想让他继续执行,就在redis客户端里不断地输入要爬取的url地址即可。

    相关文章

      网友评论

          本文标题:15.scrapy-redis多页抓取,使用redis一页一页的

          本文链接:https://www.haomeiwen.com/subject/bynvbqtx.html