苍穹之下的PM2.5数据采集——scrapy.defultspi

作者: 我叫钱小钱 | 来源:发表于2017-05-09 02:38 被阅读68次

本文以最简单的形式使用Scrapy,抓取PM2.5数据,然后将数据存储进MongoDB或TXT文件
下面是非常简单的代码,试着自己写写,如果写不出,建议还需要回去扎实自己的基本功

spider code

# -*- coding: utf-8 -*-
import scrapy
from pm25.items import Pm25Item
import re

class InfospSpider(scrapy.Spider):
    name = "infosp"
    allowed_domains = ["pm25.com"]
    start_urls = ['http://www.pm25.com/rank/1day.html', ]
    custom_settings = {'ITEM_PIPELINES':{
             'pm25.pipelines.MongodbPipeline': 30,    #Pipelines开关
             # 'pm25.pipelines.TxtPipeline': 50,
                                            }
                         }
   
   def parse(self, response):
      item = Pm25Item()
      re_time = re.compile("\d+-\d+-\d+")
      date = response.xpath("/html/body/div[4]/div/div/div[2]/span").extract()[0]
      #单独解析出DATE
        selector = response.selector.xpath("/html/body/div[5]/div/div[3]/ul[2]/li")
        #从response里确立解析范围
        for subselector in selector:
        #通过范围逐条解析
            try: 
            #防止[0]报错
                rank = subselector.xpath("span[1]/text()").extract()[0]
                quality = subselector.xpath("span/em/text()")[0].extract()
                city = subselector.xpath("a/text()").extract()[0]
                province = subselector.xpath("span[3]/text()").extract()[0]
                aqi = subselector.xpath("span[4]/text()").extract()[0]
                pm25 = subselector.xpath("span[5]/text()").extract()[0]
            except IndexError:
                print(rank,quality,city,province,aqi,pm25)

            item['date'] = re_time.findall(date)[0]
            item['rank'] = rank
            item['quality'] = quality
            item['province'] = city
            item['city'] = province
            item['aqi'] = aqi
            item['pm25'] = pm25
            yield item

items code

# -*- coding: utf-8 -*-
import scrapy
class Pm25Item(scrapy.Item):
#最常规的写法
    date = scrapy.Field()
    rank = scrapy.Field()
    quality = scrapy.Field()
    province = scrapy.Field()
    city = scrapy.Field()
    aqi = scrapy.Field()
    pm25 = scrapy.Field()
    pass

pipelins code

import time
from scrapy.conf import settings
import pymongo

class TxtPipeline(object):
#将数据写txt文件
    def process_item(self, item, spider):
        today = time.strftime("%y%m%d",time.localtime())
        fname = str(today) + ".txt"
        with open(fname,"a",encoding="utf-8") as f:
            f.write(item["date"] +"," +
                    item["rank"] +"," +
                    item["quality"] +"," +
                    item["province"] +"," +
                    item["city"] +"," +
                    item["aqi"] +"," +
                    item["pm25"] +
                    "\n"
                    )
            f.close()
        return item

class MongodbPipeline(object):
#将数据写入MongoDB
#以下链接参数写在settings中
    def __init__(self):
        client = pymongo.MongoClient(settings["MONGODB_SERVER"],
                                     settings["MONGODB_PORT"]
                                     )
        db = client[settings["MONGODB_DB"]]
        self.coll = db[settings["MONGODB_COLLECTION"]]
    def process_item(self, item, spider):
        self.coll.insert(dict(item))
        return item

相关文章

  • 苍穹之下的PM2.5数据采集——scrapy.defultspi

    本文以最简单的形式使用Scrapy,抓取PM2.5数据,然后将数据存储进MongoDB或TXT文件下面是非常简单的...

  • 我不是多怕死,我只是不想这么活

    今天本地的pm2.5爆表1700,苍穹之下的所有人都飘飘欲仙,忙着奔小康、忙着攒饭局、忙着竞聘、忙着升职、忙着双十...

  • 苍穹之下

    第一次坐飞机,感觉还不错,没有我想象中的那么激动,一切来得这么坦然和安然。当飞机起飞的那一刻,万丈高楼...

  • 苍穹之下

    柴静

  • 苍穹之下

    心若似琉璃,将流年静数。可,谁是谁的红尘看客?谁又是谁的那瓢冷暖?仅道是寻常!❤️终归是:不求最美,只愿最真。[愉快]

  • 苍穹之下

    苍穹之下,芸芸众生,好聪明的中国人,好优美的中国话,难得糊涂是你的美名,和谐发展是你的目标,揣着明白装糊涂是你的真...

  • 苍穹之下

    我怎么也没想到魏先生做这等骇人听闻的事。 一大早一则《本市第五侦查大队队长魏为国药死妻子》赫然出现在本地新闻的头条...

  • 苍穹之下

    现在好多人都将目光放在了钱的身上,为了利益,不惜破坏生态,破坏我们赖以生存的环境。有多少人在关心污染问题,又有多少...

  • 苍穹之下

    最初 题记:跌跌撞撞迷迷糊糊,生死伦回,命运碰触——《爱》 苍穹,独自在角落里微小如尘埃 白云,个...

  • 苍穹之下

网友评论

    本文标题:苍穹之下的PM2.5数据采集——scrapy.defultspi

    本文链接:https://www.haomeiwen.com/subject/zbxbtxtx.html