scrapy爬取墨迹天气—全国

作者: 这个太难了 | 来源:发表于2018-08-13 22:13 被阅读0次

scrapy爬取墨迹天气—全国
爬虫技术scrapy
Scrapy爬取网易云音乐和评论（二、Scrapy框架每个模块的
Scrapy爬取网易云音乐和评论（一、思路分析）
Scrapy爬取网易云音乐和评论（四、关于API）
Scrapy爬取网易云音乐和评论（三、爬取歌手）
Scrapy爬取网易云音乐和评论（五、评论）
0.Python 爬虫之Scrapy入门实践指南（Scrapy基
[scrapy]scrapy爬取京东商品信息——以自营手机为例
Scrapy爬取数据初识

这是作为scrapy一个练习，记下来，怕自己忘了，也方便自己查看
初始url：https://tianqi.moji.com/weather/china/如下图：

思路：通过爬取这个页面，获得全部的省对应的url，

通过观察可以看出所有省的url都放在了这个标签下，但是是不完整的，我们需要自己构造完整的url才行 ,例如：url = https://tianqi.moji.com/weather/china/anhui
如下是构造所有省市的url，然后通过yield生成器调用Request，通过callback调用allcity函数，用来从各省的url中解析出每个省的市，从而取得天气情况，所有省份的url构造如下：

  urls = response.css('.city_list.clearfix a::attr(href)').extract()
        for url in urls:
            url = parse.urljoin('https://tianqi.moji.com',url)
            yield Request(url=url,callback=self.allcity_url)

构造出各省的url后就需要从中解析出各省下的县市，如图是安徽省下的地区地区的url

我们通过检查源码，发现这些地区的url都是在一个类属性为city_hot的标签下的ul标签下（列表），这样我们就可以解析出省下的各地区的url了

得到：https://tianqi.moji.com/weather/china/anhui/fanchang-county，如图：

在这里我选择了7天天气预报进去解析，进去后发现url变了、变了，通过观察，发现变的仅仅是https://tianqi.moji.com/weather/china/anhui/fanchang-county中的weather的地方，weather变成了forecast7，查询7天的天气预报的url:https://tianqi.moji.com/forecast7/china/anhui/fanchang-county,这样，思路也就有了，我们通过前边解析得到的省份的url，对url发起请求(Request)，然后返回的response传给了allcity_url函数，此函数通过返回来的response解析出城市的url，但是这样解析出来的url不能满足我需要爬取7天天气的需求，所以通过replace构造出我需要的url:https://tianqi.moji.com/forecast7/china/anhui/fanchang-county

allcity_url函数：

    def allcity_url(self,response):
        city_urls = response.css('.city_hot li a::attr(href)').extract() #一个省下的所有市
        for city_url in city_urls:
            city_url = city_url.replace('weather','forecast7')  #构造url，以便取每个省下的每个地区的7天天气(实际去了8天)
            yield Request(url=city_url,callback=self.detail_parse) #同样。通过yield生成器，调用Request模块，利用detail_parse函数解析出我们想要的结果

这样，我拿到了所有城市的url(查询7天天气预报那个)，这里
我利用detail_parse函数，解析了每个地区的，地址(addr)、日期（days）、星期（weeks）、天气（weathers）、气温（temps）

7天天气页面如图：

会发现，每天的天气会有两个来描述、并且他们在同一个类属性中，星期和日期也是在同一个类属性中，解析出来会有两个，所以下边我把它们分开了，为了方便后期的提取和使用

detail_parse函数：

    def detail_parse(self,response):
        weeks = []  #用来存放星期几
        days = []#用来存放日期，几月几号
        wea_befor = [] #墨迹天气有两个情况，这个用来存放上一种
        wea_after = []#用来存放天气的下一种情况
        temps = []  #用来存放温度
        weathers = [] #用来存放天气(这时候把天气合起来放在了这个里边)
        #获得地址
        addr = response.css('.search_default em::text').extract_first('')
        #取星期几、日期
        week = response.css('.week::text').extract()
        #取星期
        for i in range(0,len(week),2):#因为星期和日期是在一起的['星期日'，‘8/12’，‘星期一’，‘8/13’]这样的结构，所以通过这个方法把它们分离开，得到纯的星期和日期
            weeks.append(week[i])
            #取日期
        for i in range(1, len(week), 2):
            days.append(week[i])
            #取天气
        weather = response.css('.wea::text').extract()
        #取得前一个天气情况
        for i in range(0,len(weather),2):#与日期和星期同理
            wea_befor.append(weather[i])
        #取得后一个天气情况
        for i in range(1,len(weather),2):
            wea_after.append(weather[i])
        #提取气温
        max_tmp = response.css('.tree.clearfix p b::text').extract()
        min_tmp = response.css('.tree.clearfix p strong::text').extract()
        for i in range(len(max_tmp)):
            t = max_tmp[i]+'/'+min_tmp[i]
            temps.append(t)  #把气温构造成23/22的结构
        for i in range(len(wea_after)):
            we = wea_befor[i]+'/'+wea_after[i]
            weathers.append(we) #把天气构造成阵雨/多云的结构
        item = TianqiItem()  #实例化items
        item['addr'] = addr   #存数据到items
        item['weeks'] = weeks
        item['days'] = days
        item['temps'] = temps
        item['weathers'] = weathers
        print(item)
        yield item  通过yield迭代，下载item

这样就可以对数据做操作了，比如存json、csv等，这里我存了一个json和csv，存json和csv都是在pipelines里边自己配置函数来实现的。程序结构如图：

各个个模块代码

tianqi.py

# -*- coding: utf-8 -*-
import scrapy
from TianQi import main#这个的目的是为了省去我们再去main.py里边运行程序，导入后可以直接在这里运行代码
from urllib import parse
from scrapy.http import Request
from TianQi.items import TianqiItem

class TianqiSpider(scrapy.Spider):
    name = 'tianqi'
    # allowed_domains = ['moji.com']
    start_urls = ['https://tianqi.moji.com/weather/china/']

    def parse(self, response):
        #取出省份
        shengfen = response.css('.city_list.clearfix a::text').extract()
        #取出省份的url
        urls = response.css('.city_list.clearfix a::attr(href)').extract()
        for url in urls:
            url = parse.urljoin('https://tianqi.moji.com',url)
            yield Request(url=url,callback=self.allcity_url)


    def allcity_url(self,response):
        city_urls = response.css('.city_hot li a::attr(href)').extract()
        for city_url in city_urls:
            city_url = city_url.replace('weather','forecast7')
            yield Request(url=city_url,callback=self.detail_parse)
    def detail_parse(self,response):
        weeks = []
        days = []
        wea_befor = []
        wea_after = []
        temps = []
        weathers = []
        #获得地址
        addr = response.css('.search_default em::text').extract_first('')
        #取星期几、日期
        week = response.css('.week::text').extract()
        #取星期
        for i in range(0,len(week),2):
            weeks.append(week[i])
            #取日期
        for i in range(1, len(week), 2):
            days.append(week[i])
            #取天气
        weather = response.css('.wea::text').extract()
        #取得前一个天气情况
        for i in range(0,len(weather),2):
            wea_befor.append(weather[i])
        #取得后一个天气情况
        for i in range(1,len(weather),2):
            wea_after.append(weather[i])
        #提取气温
        max_tmp = response.css('.tree.clearfix p b::text').extract()
        min_tmp = response.css('.tree.clearfix p strong::text').extract()
        for i in range(len(max_tmp)):
            t = max_tmp[i]+'/'+min_tmp[i]
            temps.append(t)
        for i in range(len(wea_after)):
            we = wea_befor[i]+'/'+wea_after[i]
            weathers.append(we)
        item = TianqiItem()
        item['addr'] = addr
        item['weeks'] = weeks
        item['days'] = days
        item['temps'] = temps
        item['weathers'] = weathers
        print(item)
        yield item

main.py

import os
import sys
from scrapy.cmdline import execute
dir_file = os.path.dirname(os.path.abspath(__file__))
sys.path.append(dir_file)
execute(['scrapy','crawl','tianqi','--nolog'])

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class TianqiItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    addr = scrapy.Field()
    weeks = scrapy.Field()
    days = scrapy.Field()
    weathers = scrapy.Field()
    temps = scrapy.Field()

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class TianqiPipeline(object):
    def process_item(self, item, spider):
        return item

import codecs
import json
class weatherPipeline(object):
    def __init__(self):
        self.file = codecs.open('weather.json','w','utf-8')
    def process_item(self, item, spider):
        lines = json.dumps(dict(item),ensure_ascii=False)+'\n'
        self.file.write(lines)
        return item
    def close_spider(self,spider):
        self.file.close()


import csv
import os
class Pipeline_ToCSV(object):

    def __init__(self):
        # csv文件的位置,无需事先创建
        store_file = os.path.dirname(__file__) + '/spiders/weather.csv'
        # 打开(创建)文件
        self.file = open(store_file, 'w',newline="")
        # csv写法# dialect为打开csv文件的方式，默认是excel，delimiter="\t"参数指写入的时候的分隔符
        self.writer = csv.writer(self.file,dialect=("excel"))
        # self.writer.writerow(['地点'])
    def process_item(self, item, spider):
        # 判断字段值不为空再写入文件
            # csv文件插入一行数据，把下面列表中的每一项放入一个单元格（可以用循环插入多行）
        self.writer.writerow(['地点:',item['addr'].replace(',','/')])
        self.writer.writerow(['星期','日期','气温','天气'])
        for i in range(len(item['weeks'])):
            self.writer.writerow([item['weeks'][i],item['days'][i],item['temps'][i],item['weathers'][i]])
        self.writer.writerow([" "])
        return item

    def close_spider(self, spider):
        # 关闭爬虫时顺便将文件保存退出
        self.file.close()

settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for TianQi project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'TianQi'

SPIDER_MODULES = ['TianQi.spiders']
NEWSPIDER_MODULE = 'TianQi.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'TianQi (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'TianQi.middlewares.TianqiSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'TianQi.middlewares.TianqiDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   # 'TianQi.pipelines.TianqiPipeline': 300,
    'TianQi.pipelines.weatherPipeline': 200,
    'TianQi.pipelines.Pipeline_ToCSV': 250,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

这样实现了爬取全国天气的程序，保存的json文件：

保存的csv文件:

最值得注意的就是分析的过程，想要爬取所有省的天气，那我首先得获得每个省的url，每个省又有很多地区，在获得省份url后，再往细了想，通过省份的url解析出每个省份各地区的url，然后再通过这个url去获得每个地区的天气情况。

---------------------------------分割----------------------------------

今天遇到的一个问题，在爬取淘宝手机的时候，我想要进手机的详细页面，通过观察，进入详情页面比较重要的就是手机对应的id（如图最后一项）

但是没法直接从源码中解析出来，发现它是放在script脚本下（是放在了js中的）但是通过直接刷新页面并不能获得id的js页面，需要点击综合或其他来刷新，不是怎么知道的，偶然试出来的：

可以看到，这是json类型的数据，那就通过json.loads(response.text)来处理成字典，然后结合抓包得到的页面来分析，取出id

但是在这里取到的id并不是纯的id还有一串其他的东西，这里就用到了字符串的分割split函数，id=str.split('_')[-1]这里分割后最后一个就是id，这样就可以取出id，然后通过构造如下url：
"https://s.taobao.com/searchq=手机&app=detailproduct&pspuid="+id就可以进入每个手机的详情页面了。

scrapy爬取墨迹天气—全国
这是作为scrapy一个练习，记下来，怕自己忘了，也方便自己查看初始url：https://tianqi.moji...
爬虫技术scrapy
scrapy爬取原理：
Scrapy爬取网易云音乐和评论（二、Scrapy框架每个模块的
目录： 1、Scrapy爬取网易云音乐和评论（一、思路分析）2、Scrapy爬取网易云音乐和评论（二、Scrapy...
Scrapy爬取网易云音乐和评论（一、思路分析）
目录： 1、Scrapy爬取网易云音乐和评论（一、思路分析）2、Scrapy爬取网易云音乐和评论（二、Scrapy...
Scrapy爬取网易云音乐和评论（四、关于API）
目录： 1、Scrapy爬取网易云音乐和评论（一、思路分析）2、Scrapy爬取网易云音乐和评论（二、Scrapy...
Scrapy爬取网易云音乐和评论（三、爬取歌手）
目录： 1、Scrapy爬取网易云音乐和评论（一、思路分析）2、Scrapy爬取网易云音乐和评论（二、Scrapy...
Scrapy爬取网易云音乐和评论（五、评论）
目录： 1、Scrapy爬取网易云音乐和评论（一、思路分析）2、Scrapy爬取网易云音乐和评论（二、Scrapy...
0.Python 爬虫之Scrapy入门实践指南（Scrapy基
[TOC] 0.0、Scrapy基础 Python2：适合爬取非中文 Python3：适合爬取中文 Scrapy是...
[scrapy]scrapy爬取京东商品信息——以自营手机为例
关于scrapy以及使用的代理轮换中间件请参考我的爬取豆瓣文章：【scrapy】scrapy按分类爬取豆瓣电影基...
Scrapy爬取数据初识
Scrapy爬取数据初识初窥Scrapy Scrapy是一个为了爬取网站数据，提取结构性数据而编写的应用框架。 ...