scrapy 009 Using Item Containers and Stroing in JSON, XML and CSV
-
最好的抓取信息的思路是
Extracted data -->Temporary containers (items) --> Storing in database -
修改items.py 文件
import scrapy
class QuotetutorialItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
author = scrapy.Field()
tag = scrapy.Field()
- 调用items container
import scrapy
from ..items import QuotetutorialItem
class QuoteSpider (scrapy.Spider):
name = 'quotes'
start_urls = [
'http://quotes.toscrape.com/'
]
def parse(self, response):
items = QuotetutorialItem()
all_div_quotes = response.css('div.quote')
for quotes in all_div_quotes:
title = quotes.css('.span.text::text').extract(),
author = quotes.css('.author::text').extract(),
tag = quotes.css('.tag::text').extract()
items['title'] = title
items['author'] = author
items['tag'] = tag
yield items
- 保存在定义的文件中
scrapy crawl quotes -o items.json
scrapy crawl quotes -o items.csv
scrapy crawl quotes -o items.xml
scrapy 010 Pipelines in Web Scraping
-
Scraped data -> items container -> Pipeline -> SQL/Mongo database
-
settings.py 文件中关于pipeline的设定
ITEM_PIPELINES = {
'quotetutorial.pipelines.QuotetutorialPipeline': 300,
}
# 数字越低,pipeline 优先级越高。
可以在该字典中添加其他pipeline,并设定相关顺序(数值)
class QuotetutorialPipeline(object):
def process_item(self, item, spider):
print("Pipelines :")
return item
# 在pipeline 文件中添加内容,也可以通过scrapy 打印出,说明数据经过了pipeline。
- ps: 按照教程我按如下修改pipeline后抓取 quotes,却始终报错,疑惑ing。
class QuotetutorialPipeline(object):
def process_item(self, item, spider):
print("Pipelines :" + item['title'][0])
return item
报错原因为TypeError: can only concatenate str (not "list") to str
网友评论