美文网首页
scrapy爬取墨迹天气—全国

scrapy爬取墨迹天气—全国

作者: 这个太难了 | 来源:发表于2018-08-13 22:13 被阅读0次

    这是作为scrapy一个练习,记下来,怕自己忘了,也方便自己查看
    初始url:https://tianqi.moji.com/weather/china/如下图:


    思路:通过爬取这个页面,获得全部的省对应的url, 通过观察可以看出所有省的url都放在了这个标签下,但是是不完整的,我们需要自己构造完整的url才行 ,例如:url = https://tianqi.moji.com/weather/china/anhui
    如下是构造所有省市的url,然后通过yield生成器调用Request,通过callback调用allcity函数,用来从各省的url中解析出每个省的市,从而取得天气情况,所有省份的url构造如下:
      urls = response.css('.city_list.clearfix a::attr(href)').extract()
            for url in urls:
                url = parse.urljoin('https://tianqi.moji.com',url)
                yield Request(url=url,callback=self.allcity_url)
    

    构造出各省的url后就需要从中解析出各省下的县市,如图是安徽省下的地区地区的url

    我们通过检查源码,发现这些地区的url都是在一个类属性为city_hot的标签下的ul标签下(列表),这样我们就可以解析出省下的各地区的url了
    得到:https://tianqi.moji.com/weather/china/anhui/fanchang-county,如图: 在这里我选择了7天天气预报进去解析,进去后发现url变了、变了,通过观察,发现变的仅仅是https://tianqi.moji.com/weather/china/anhui/fanchang-county中的weather的地方,weather变成了forecast7,查询7天的天气预报的url:https://tianqi.moji.com/forecast7/china/anhui/fanchang-county,这样,思路也就有了,我们通过前边解析得到的省份的url,对url发起请求(Request),然后返回的response传给了allcity_url函数,此函数通过返回来的response解析出城市的url,但是这样解析出来的url不能满足我需要爬取7天天气的需求,所以通过replace构造出我需要的url:https://tianqi.moji.com/forecast7/china/anhui/fanchang-county
    allcity_url函数:
        def allcity_url(self,response):
            city_urls = response.css('.city_hot li a::attr(href)').extract() #一个省下的所有市
            for city_url in city_urls:
                city_url = city_url.replace('weather','forecast7')  #构造url,以便取每个省下的每个地区的7天天气(实际去了8天)
                yield Request(url=city_url,callback=self.detail_parse) #同样。通过yield生成器,调用Request模块,利用detail_parse函数解析出我们想要的结果
    

    这样,我拿到了所有城市的url(查询7天天气预报那个),这里
    我利用detail_parse函数,解析了每个地区的,地址(addr)、日期(days)、星期(weeks)、天气(weathers)、气温(temps)

    7天天气页面如图:

    会发现,每天的天气会有两个来描述、并且他们在同一个类属性中,星期和日期也是在同一个类属性中,解析出来会有两个,所以下边我把它们分开了,为了方便后期的提取和使用

    detail_parse函数:
        def detail_parse(self,response):
            weeks = []  #用来存放星期几
            days = []#用来存放日期,几月几号
            wea_befor = [] #墨迹天气有两个情况,这个用来存放上一种
            wea_after = []#用来存放天气的下一种情况
            temps = []  #用来存放温度
            weathers = [] #用来存放天气(这时候把天气合起来放在了这个里边)
            #获得地址
            addr = response.css('.search_default em::text').extract_first('')
            #取星期几、日期
            week = response.css('.week::text').extract()
            #取星期
            for i in range(0,len(week),2):#因为星期和日期是在一起的['星期日',‘8/12’,‘星期一’,‘8/13’]这样的结构,所以通过这个方法把它们分离开,得到纯的星期和日期
                weeks.append(week[i])
                #取日期
            for i in range(1, len(week), 2):
                days.append(week[i])
                #取天气
            weather = response.css('.wea::text').extract()
            #取得前一个天气情况
            for i in range(0,len(weather),2):#与日期和星期同理
                wea_befor.append(weather[i])
            #取得后一个天气情况
            for i in range(1,len(weather),2):
                wea_after.append(weather[i])
            #提取气温
            max_tmp = response.css('.tree.clearfix p b::text').extract()
            min_tmp = response.css('.tree.clearfix p strong::text').extract()
            for i in range(len(max_tmp)):
                t = max_tmp[i]+'/'+min_tmp[i]
                temps.append(t)  #把气温构造成23/22的结构
            for i in range(len(wea_after)):
                we = wea_befor[i]+'/'+wea_after[i]
                weathers.append(we) #把天气构造成阵雨/多云的结构
            item = TianqiItem()  #实例化items
            item['addr'] = addr   #存数据到items
            item['weeks'] = weeks
            item['days'] = days
            item['temps'] = temps
            item['weathers'] = weathers
            print(item)
            yield item  通过yield迭代,下载item
    
    这样就可以对数据做操作了,比如存json、csv等,这里我存了一个json和csv,存json和csv都是在pipelines里边自己配置函数来实现的。程序结构如图:

    各个个模块代码

    tianqi.py
    # -*- coding: utf-8 -*-
    import scrapy
    from TianQi import main#这个的目的是为了省去我们再去main.py里边运行程序,导入后可以直接在这里运行代码
    from urllib import parse
    from scrapy.http import Request
    from TianQi.items import TianqiItem
    
    class TianqiSpider(scrapy.Spider):
        name = 'tianqi'
        # allowed_domains = ['moji.com']
        start_urls = ['https://tianqi.moji.com/weather/china/']
    
        def parse(self, response):
            #取出省份
            shengfen = response.css('.city_list.clearfix a::text').extract()
            #取出省份的url
            urls = response.css('.city_list.clearfix a::attr(href)').extract()
            for url in urls:
                url = parse.urljoin('https://tianqi.moji.com',url)
                yield Request(url=url,callback=self.allcity_url)
    
    
        def allcity_url(self,response):
            city_urls = response.css('.city_hot li a::attr(href)').extract()
            for city_url in city_urls:
                city_url = city_url.replace('weather','forecast7')
                yield Request(url=city_url,callback=self.detail_parse)
        def detail_parse(self,response):
            weeks = []
            days = []
            wea_befor = []
            wea_after = []
            temps = []
            weathers = []
            #获得地址
            addr = response.css('.search_default em::text').extract_first('')
            #取星期几、日期
            week = response.css('.week::text').extract()
            #取星期
            for i in range(0,len(week),2):
                weeks.append(week[i])
                #取日期
            for i in range(1, len(week), 2):
                days.append(week[i])
                #取天气
            weather = response.css('.wea::text').extract()
            #取得前一个天气情况
            for i in range(0,len(weather),2):
                wea_befor.append(weather[i])
            #取得后一个天气情况
            for i in range(1,len(weather),2):
                wea_after.append(weather[i])
            #提取气温
            max_tmp = response.css('.tree.clearfix p b::text').extract()
            min_tmp = response.css('.tree.clearfix p strong::text').extract()
            for i in range(len(max_tmp)):
                t = max_tmp[i]+'/'+min_tmp[i]
                temps.append(t)
            for i in range(len(wea_after)):
                we = wea_befor[i]+'/'+wea_after[i]
                weathers.append(we)
            item = TianqiItem()
            item['addr'] = addr
            item['weeks'] = weeks
            item['days'] = days
            item['temps'] = temps
            item['weathers'] = weathers
            print(item)
            yield item
    
    main.py
    import os
    import sys
    from scrapy.cmdline import execute
    dir_file = os.path.dirname(os.path.abspath(__file__))
    sys.path.append(dir_file)
    execute(['scrapy','crawl','tianqi','--nolog'])
    
    items.py
    # -*- coding: utf-8 -*-
    
    # Define here the models for your scraped items
    #
    # See documentation in:
    # https://doc.scrapy.org/en/latest/topics/items.html
    
    import scrapy
    
    
    class TianqiItem(scrapy.Item):
        # define the fields for your item here like:
        # name = scrapy.Field()
        addr = scrapy.Field()
        weeks = scrapy.Field()
        days = scrapy.Field()
        weathers = scrapy.Field()
        temps = scrapy.Field()
    
    pipelines.py
    # -*- coding: utf-8 -*-
    
    # Define your item pipelines here
    #
    # Don't forget to add your pipeline to the ITEM_PIPELINES setting
    # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    
    
    class TianqiPipeline(object):
        def process_item(self, item, spider):
            return item
    
    import codecs
    import json
    class weatherPipeline(object):
        def __init__(self):
            self.file = codecs.open('weather.json','w','utf-8')
        def process_item(self, item, spider):
            lines = json.dumps(dict(item),ensure_ascii=False)+'\n'
            self.file.write(lines)
            return item
        def close_spider(self,spider):
            self.file.close()
    
    
    import csv
    import os
    class Pipeline_ToCSV(object):
    
        def __init__(self):
            # csv文件的位置,无需事先创建
            store_file = os.path.dirname(__file__) + '/spiders/weather.csv'
            # 打开(创建)文件
            self.file = open(store_file, 'w',newline="")
            # csv写法# dialect为打开csv文件的方式,默认是excel,delimiter="\t"参数指写入的时候的分隔符
            self.writer = csv.writer(self.file,dialect=("excel"))
            # self.writer.writerow(['地点'])
        def process_item(self, item, spider):
            # 判断字段值不为空再写入文件
                # csv文件插入一行数据,把下面列表中的每一项放入一个单元格(可以用循环插入多行)
            self.writer.writerow(['地点:',item['addr'].replace(',','/')])
            self.writer.writerow(['星期','日期','气温','天气'])
            for i in range(len(item['weeks'])):
                self.writer.writerow([item['weeks'][i],item['days'][i],item['temps'][i],item['weathers'][i]])
            self.writer.writerow([" "])
            return item
    
        def close_spider(self, spider):
            # 关闭爬虫时顺便将文件保存退出
            self.file.close()
    
    settings.py
    # -*- coding: utf-8 -*-
    
    # Scrapy settings for TianQi project
    #
    # For simplicity, this file contains only settings considered important or
    # commonly used. You can find more settings consulting the documentation:
    #
    #     https://doc.scrapy.org/en/latest/topics/settings.html
    #     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #     https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    
    BOT_NAME = 'TianQi'
    
    SPIDER_MODULES = ['TianQi.spiders']
    NEWSPIDER_MODULE = 'TianQi.spiders'
    
    
    # Crawl responsibly by identifying yourself (and your website) on the user-agent
    #USER_AGENT = 'TianQi (+http://www.yourdomain.com)'
    
    # Obey robots.txt rules
    ROBOTSTXT_OBEY = True
    
    # Configure maximum concurrent requests performed by Scrapy (default: 16)
    #CONCURRENT_REQUESTS = 32
    
    # Configure a delay for requests for the same website (default: 0)
    # See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
    # See also autothrottle settings and docs
    #DOWNLOAD_DELAY = 3
    # The download delay setting will honor only one of:
    #CONCURRENT_REQUESTS_PER_DOMAIN = 16
    #CONCURRENT_REQUESTS_PER_IP = 16
    
    # Disable cookies (enabled by default)
    #COOKIES_ENABLED = False
    
    # Disable Telnet Console (enabled by default)
    #TELNETCONSOLE_ENABLED = False
    
    # Override the default request headers:
    #DEFAULT_REQUEST_HEADERS = {
    #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    #   'Accept-Language': 'en',
    #}
    
    # Enable or disable spider middlewares
    # See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
    #SPIDER_MIDDLEWARES = {
    #    'TianQi.middlewares.TianqiSpiderMiddleware': 543,
    #}
    
    # Enable or disable downloader middlewares
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
    #DOWNLOADER_MIDDLEWARES = {
    #    'TianQi.middlewares.TianqiDownloaderMiddleware': 543,
    #}
    
    # Enable or disable extensions
    # See https://doc.scrapy.org/en/latest/topics/extensions.html
    #EXTENSIONS = {
    #    'scrapy.extensions.telnet.TelnetConsole': None,
    #}
    
    # Configure item pipelines
    # See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
    ITEM_PIPELINES = {
       # 'TianQi.pipelines.TianqiPipeline': 300,
        'TianQi.pipelines.weatherPipeline': 200,
        'TianQi.pipelines.Pipeline_ToCSV': 250,
    }
    
    # Enable and configure the AutoThrottle extension (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/autothrottle.html
    #AUTOTHROTTLE_ENABLED = True
    # The initial download delay
    #AUTOTHROTTLE_START_DELAY = 5
    # The maximum download delay to be set in case of high latencies
    #AUTOTHROTTLE_MAX_DELAY = 60
    # The average number of requests Scrapy should be sending in parallel to
    # each remote server
    #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
    # Enable showing throttling stats for every response received:
    #AUTOTHROTTLE_DEBUG = False
    
    # Enable and configure HTTP caching (disabled by default)
    # See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
    #HTTPCACHE_ENABLED = True
    #HTTPCACHE_EXPIRATION_SECS = 0
    #HTTPCACHE_DIR = 'httpcache'
    #HTTPCACHE_IGNORE_HTTP_CODES = []
    #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
    
    这样实现了爬取全国天气的程序,保存的json文件: 保存的csv文件: 最值得注意的就是分析的过程,想要爬取所有省的天气,那我首先得获得每个省的url,每个省又有很多地区,在获得省份url后,再往细了想,通过省份的url解析出每个省份各地区的url,然后再通过这个url去获得每个地区的天气情况。

    ---------------------------------分割----------------------------------


    今天遇到的一个问题,在爬取淘宝手机的时候,我想要进手机的详细页面,通过观察,进入详情页面比较重要的就是手机对应的id(如图最后一项)

    但是没法直接从源码中解析出来,发现它是放在script脚本下(是放在了js中的)但是通过直接刷新页面并不能获得id的js页面,需要点击综合或其他来刷新,不是怎么知道的,偶然试出来的: 可以看到,这是json类型的数据,那就通过json.loads(response.text)来处理成字典,然后结合抓包得到的页面来分析,取出id 但是在这里取到的id并不是纯的id还有一串其他的东西,这里就用到了字符串的分割split函数,id=str.split('_')[-1]这里分割后最后一个就是id,这样就可以取出id,然后通过构造如下url:
    "https://s.taobao.com/searchq=手机&app=detailproduct&pspuid="+id就可以进入每个手机的详情页面了。

    相关文章

      网友评论

          本文标题:scrapy爬取墨迹天气—全国

          本文链接:https://www.haomeiwen.com/subject/zhaubftx.html