美文网首页我爱编程
[Scrapy-5] 常用爬虫Spiders

[Scrapy-5] 常用爬虫Spiders

作者: 禅与发现的乐趣 | 来源:发表于2018-05-23 15:54 被阅读45次

    POST方式爬取数据

    一般情况下使用Scrapy默认的方式就可以处理各种GET方式的数据爬取需求,当然有些场景是需要用户登录或者提供某些数据后使用POST请求才能获取到需要的数据。

    class MySpider(scrapy.Spider):
        name = 'myspider'
    
        def start_requests(self):
            return [scrapy.FormRequest("http://www.example.com/login",
                                       formdata={'user': 'john', 'pass': 'secret'},
                                       callback=self.logged_in)]
    
        def logged_in(self, response):
            # here you would extract links to follow and return Requests for
            # each of them, with another callback
            pass
    

    限定域名和处理日志

    在Scrapy.Spider类中初始化了logger:

    @property
    def logger(self):
        logger = logging.getLogger(self.name)
        return logging.LoggerAdapter(logger, {'spider': self})
    
    class MySpider(scrapy.Spider):
        name = 'example.com'
        allowed_domains = ['example.com']
        start_urls = [
            'http://www.example.com/1.html',
            'http://www.example.com/2.html',
            'http://www.example.com/3.html',
        ]
    
        def parse(self, response):
            self.logger.info('A response from %s just arrived!', response.url)
    

    Spider参数

    在命令行下使用crawl命令启动爬虫的时候可以带参数,使用-a指定即可

    scrapy crawl myspider -a category=electronics
    

    可以通过__init__方法获取参数:

    import scrapy
    
    class MySpider(scrapy.Spider):
        name = 'myspider'
    
        def __init__(self, category=None, *args, **kwargs):
            super(MySpider, self).__init__(*args, **kwargs)
            self.start_urls = ['http://www.example.com/categories/%s' % category]
            # ...
    

    Spider的init方法会获取任何的Spider的参数,并且将它们复制,作为Spider的属性,所以你可以直接将参数作为属性使用:

    import scrapy
    
    class MySpider(scrapy.Spider):
        name = 'myspider'
    
        def start_requests(self):
            yield scrapy.Request('http://www.example.com/categories/%s' % self.category)
    

    CrawlSpider

    这是最常用的爬取网络数据的爬虫,并且可以自定义一些列规则

    import scrapy
    from scrapy.spiders import CrawlSpider, Rule
    from scrapy.linkextractors import LinkExtractor
    
    class MySpider(CrawlSpider):
        name = 'example.com'
        allowed_domains = ['example.com']
        start_urls = ['http://www.example.com']
    
        rules = (
            # Extract links matching 'category.php' (but not matching 'subsection.php')
            # and follow links from them (since no callback means follow=True by default).
            Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),
    
            # Extract links matching 'item.php' and parse them with the spider's method parse_item
            Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
        )
    
        def parse_item(self, response):
            self.logger.info('Hi, this is an item page! %s', response.url)
            item = scrapy.Item()
            item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
            item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
            item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
            return item
    

    XMLFeedSpider example

    from scrapy.spiders import XMLFeedSpider
    from myproject.items import TestItem
    
    class MySpider(XMLFeedSpider):
        name = 'example.com'
        allowed_domains = ['example.com']
        start_urls = ['http://www.example.com/feed.xml']
        iterator = 'iternodes'  # This is actually unnecessary, since it's the default value
        itertag = 'item'
    
        def parse_node(self, response, node):
            self.logger.info('Hi, this is a <%s> node!: %s', self.itertag, ''.join(node.extract()))
    
            item = TestItem()
            item['id'] = node.xpath('@id').extract()
            item['name'] = node.xpath('name').extract()
            item['description'] = node.xpath('description').extract()
            return item
    

    CSVFeedSpider example

    from scrapy.spiders import CSVFeedSpider
    from myproject.items import TestItem
    
    class MySpider(CSVFeedSpider):
        name = 'example.com'
        allowed_domains = ['example.com']
        start_urls = ['http://www.example.com/feed.csv']
        delimiter = ';'
        quotechar = "'"
        headers = ['id', 'name', 'description']
    
        def parse_row(self, response, row):
            self.logger.info('Hi, this is a row!: %r', row)
    
            item = TestItem()
            item['id'] = row['id']
            item['name'] = row['name']
            item['description'] = row['description']
            return item
    
    

    相关文章

      网友评论

        本文标题:[Scrapy-5] 常用爬虫Spiders

        本文链接:https://www.haomeiwen.com/subject/xrwrjftx.html