python 学习 DAY18笔记

作者: Peng_001 | 来源:发表于2020-03-10 21:33 被阅读0次

scrapy 009 Using Item Containers and Stroing in JSON, XML and CSV

  • 最好的抓取信息的思路是
    Extracted data -->Temporary containers (items) --> Storing in database

  • 修改items.py 文件

import scrapy


class QuotetutorialItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    author = scrapy.Field()
    tag = scrapy.Field()
  • 调用items container
import scrapy

from ..items import QuotetutorialItem


class QuoteSpider (scrapy.Spider):
    name = 'quotes'
    start_urls = [
        'http://quotes.toscrape.com/'
    ]

    def parse(self, response):

        items = QuotetutorialItem()

        all_div_quotes = response.css('div.quote')

        for quotes in all_div_quotes:
            title = quotes.css('.span.text::text').extract(),
            author = quotes.css('.author::text').extract(),
            tag = quotes.css('.tag::text').extract()

            items['title'] = title
            items['author'] = author
            items['tag'] = tag

            yield items
  • 保存在定义的文件中
    scrapy crawl quotes -o items.json
    scrapy crawl quotes -o items.csv
    scrapy crawl quotes -o items.xml

scrapy 010 Pipelines in Web Scraping

  • Scraped data -> items container -> Pipeline -> SQL/Mongo database

  • settings.py 文件中关于pipeline的设定

ITEM_PIPELINES = {
   'quotetutorial.pipelines.QuotetutorialPipeline': 300,
}
# 数字越低,pipeline 优先级越高。

可以在该字典中添加其他pipeline,并设定相关顺序(数值)

class QuotetutorialPipeline(object):
    def process_item(self, item, spider):

        print("Pipelines :")
        return item
# 在pipeline 文件中添加内容,也可以通过scrapy 打印出,说明数据经过了pipeline。
  • ps: 按照教程我按如下修改pipeline后抓取 quotes,却始终报错,疑惑ing。
class QuotetutorialPipeline(object):
    def process_item(self, item, spider):

        print("Pipelines :" + item['title'][0])
        return item

报错原因为TypeError: can only concatenate str (not "list") to str

相关文章

网友评论

    本文标题:python 学习 DAY18笔记

    本文链接:https://www.haomeiwen.com/subject/vnyidhtx.html