python 学习 DAY18笔记

作者: Peng_001 | 来源:发表于2020-03-10 21:33 被阅读0次

    scrapy 009 Using Item Containers and Stroing in JSON, XML and CSV

    • 最好的抓取信息的思路是
      Extracted data -->Temporary containers (items) --> Storing in database

    • 修改items.py 文件

    import scrapy
    
    
    class QuotetutorialItem(scrapy.Item):
        # define the fields for your item here like:
        title = scrapy.Field()
        author = scrapy.Field()
        tag = scrapy.Field()
    
    • 调用items container
    import scrapy
    
    from ..items import QuotetutorialItem
    
    
    class QuoteSpider (scrapy.Spider):
        name = 'quotes'
        start_urls = [
            'http://quotes.toscrape.com/'
        ]
    
        def parse(self, response):
    
            items = QuotetutorialItem()
    
            all_div_quotes = response.css('div.quote')
    
            for quotes in all_div_quotes:
                title = quotes.css('.span.text::text').extract(),
                author = quotes.css('.author::text').extract(),
                tag = quotes.css('.tag::text').extract()
    
                items['title'] = title
                items['author'] = author
                items['tag'] = tag
    
                yield items
    
    • 保存在定义的文件中
      scrapy crawl quotes -o items.json
      scrapy crawl quotes -o items.csv
      scrapy crawl quotes -o items.xml

    scrapy 010 Pipelines in Web Scraping

    • Scraped data -> items container -> Pipeline -> SQL/Mongo database

    • settings.py 文件中关于pipeline的设定

    ITEM_PIPELINES = {
       'quotetutorial.pipelines.QuotetutorialPipeline': 300,
    }
    # 数字越低,pipeline 优先级越高。
    

    可以在该字典中添加其他pipeline,并设定相关顺序(数值)

    class QuotetutorialPipeline(object):
        def process_item(self, item, spider):
    
            print("Pipelines :")
            return item
    # 在pipeline 文件中添加内容,也可以通过scrapy 打印出,说明数据经过了pipeline。
    
    • ps: 按照教程我按如下修改pipeline后抓取 quotes,却始终报错,疑惑ing。
    class QuotetutorialPipeline(object):
        def process_item(self, item, spider):
    
            print("Pipelines :" + item['title'][0])
            return item
    

    报错原因为TypeError: can only concatenate str (not "list") to str

    相关文章

      网友评论

        本文标题:python 学习 DAY18笔记

        本文链接:https://www.haomeiwen.com/subject/vnyidhtx.html