美文网首页
scrapy爬取墨迹天气—全国

scrapy爬取墨迹天气—全国

作者: 这个太难了 | 来源:发表于2018-08-13 22:13 被阅读0次

这是作为scrapy一个练习,记下来,怕自己忘了,也方便自己查看
初始url:https://tianqi.moji.com/weather/china/如下图:


思路:通过爬取这个页面,获得全部的省对应的url, 通过观察可以看出所有省的url都放在了这个标签下,但是是不完整的,我们需要自己构造完整的url才行 ,例如:url = https://tianqi.moji.com/weather/china/anhui
如下是构造所有省市的url,然后通过yield生成器调用Request,通过callback调用allcity函数,用来从各省的url中解析出每个省的市,从而取得天气情况,所有省份的url构造如下:
  urls = response.css('.city_list.clearfix a::attr(href)').extract()
        for url in urls:
            url = parse.urljoin('https://tianqi.moji.com',url)
            yield Request(url=url,callback=self.allcity_url)

构造出各省的url后就需要从中解析出各省下的县市,如图是安徽省下的地区地区的url

我们通过检查源码,发现这些地区的url都是在一个类属性为city_hot的标签下的ul标签下(列表),这样我们就可以解析出省下的各地区的url了
得到:https://tianqi.moji.com/weather/china/anhui/fanchang-county,如图: 在这里我选择了7天天气预报进去解析,进去后发现url变了、变了,通过观察,发现变的仅仅是https://tianqi.moji.com/weather/china/anhui/fanchang-county中的weather的地方,weather变成了forecast7,查询7天的天气预报的url:https://tianqi.moji.com/forecast7/china/anhui/fanchang-county,这样,思路也就有了,我们通过前边解析得到的省份的url,对url发起请求(Request),然后返回的response传给了allcity_url函数,此函数通过返回来的response解析出城市的url,但是这样解析出来的url不能满足我需要爬取7天天气的需求,所以通过replace构造出我需要的url:https://tianqi.moji.com/forecast7/china/anhui/fanchang-county
allcity_url函数:
    def allcity_url(self,response):
        city_urls = response.css('.city_hot li a::attr(href)').extract() #一个省下的所有市
        for city_url in city_urls:
            city_url = city_url.replace('weather','forecast7')  #构造url,以便取每个省下的每个地区的7天天气(实际去了8天)
            yield Request(url=city_url,callback=self.detail_parse) #同样。通过yield生成器,调用Request模块,利用detail_parse函数解析出我们想要的结果

这样,我拿到了所有城市的url(查询7天天气预报那个),这里
我利用detail_parse函数,解析了每个地区的,地址(addr)、日期(days)、星期(weeks)、天气(weathers)、气温(temps)

7天天气页面如图:

会发现,每天的天气会有两个来描述、并且他们在同一个类属性中,星期和日期也是在同一个类属性中,解析出来会有两个,所以下边我把它们分开了,为了方便后期的提取和使用

detail_parse函数:
    def detail_parse(self,response):
        weeks = []  #用来存放星期几
        days = []#用来存放日期,几月几号
        wea_befor = [] #墨迹天气有两个情况,这个用来存放上一种
        wea_after = []#用来存放天气的下一种情况
        temps = []  #用来存放温度
        weathers = [] #用来存放天气(这时候把天气合起来放在了这个里边)
        #获得地址
        addr = response.css('.search_default em::text').extract_first('')
        #取星期几、日期
        week = response.css('.week::text').extract()
        #取星期
        for i in range(0,len(week),2):#因为星期和日期是在一起的['星期日',‘8/12’,‘星期一’,‘8/13’]这样的结构,所以通过这个方法把它们分离开,得到纯的星期和日期
            weeks.append(week[i])
            #取日期
        for i in range(1, len(week), 2):
            days.append(week[i])
            #取天气
        weather = response.css('.wea::text').extract()
        #取得前一个天气情况
        for i in range(0,len(weather),2):#与日期和星期同理
            wea_befor.append(weather[i])
        #取得后一个天气情况
        for i in range(1,len(weather),2):
            wea_after.append(weather[i])
        #提取气温
        max_tmp = response.css('.tree.clearfix p b::text').extract()
        min_tmp = response.css('.tree.clearfix p strong::text').extract()
        for i in range(len(max_tmp)):
            t = max_tmp[i]+'/'+min_tmp[i]
            temps.append(t)  #把气温构造成23/22的结构
        for i in range(len(wea_after)):
            we = wea_befor[i]+'/'+wea_after[i]
            weathers.append(we) #把天气构造成阵雨/多云的结构
        item = TianqiItem()  #实例化items
        item['addr'] = addr   #存数据到items
        item['weeks'] = weeks
        item['days'] = days
        item['temps'] = temps
        item['weathers'] = weathers
        print(item)
        yield item  通过yield迭代,下载item
这样就可以对数据做操作了,比如存json、csv等,这里我存了一个json和csv,存json和csv都是在pipelines里边自己配置函数来实现的。程序结构如图:

各个个模块代码

tianqi.py
# -*- coding: utf-8 -*-
import scrapy
from TianQi import main#这个的目的是为了省去我们再去main.py里边运行程序,导入后可以直接在这里运行代码
from urllib import parse
from scrapy.http import Request
from TianQi.items import TianqiItem

class TianqiSpider(scrapy.Spider):
    name = 'tianqi'
    # allowed_domains = ['moji.com']
    start_urls = ['https://tianqi.moji.com/weather/china/']

    def parse(self, response):
        #取出省份
        shengfen = response.css('.city_list.clearfix a::text').extract()
        #取出省份的url
        urls = response.css('.city_list.clearfix a::attr(href)').extract()
        for url in urls:
            url = parse.urljoin('https://tianqi.moji.com',url)
            yield Request(url=url,callback=self.allcity_url)


    def allcity_url(self,response):
        city_urls = response.css('.city_hot li a::attr(href)').extract()
        for city_url in city_urls:
            city_url = city_url.replace('weather','forecast7')
            yield Request(url=city_url,callback=self.detail_parse)
    def detail_parse(self,response):
        weeks = []
        days = []
        wea_befor = []
        wea_after = []
        temps = []
        weathers = []
        #获得地址
        addr = response.css('.search_default em::text').extract_first('')
        #取星期几、日期
        week = response.css('.week::text').extract()
        #取星期
        for i in range(0,len(week),2):
            weeks.append(week[i])
            #取日期
        for i in range(1, len(week), 2):
            days.append(week[i])
            #取天气
        weather = response.css('.wea::text').extract()
        #取得前一个天气情况
        for i in range(0,len(weather),2):
            wea_befor.append(weather[i])
        #取得后一个天气情况
        for i in range(1,len(weather),2):
            wea_after.append(weather[i])
        #提取气温
        max_tmp = response.css('.tree.clearfix p b::text').extract()
        min_tmp = response.css('.tree.clearfix p strong::text').extract()
        for i in range(len(max_tmp)):
            t = max_tmp[i]+'/'+min_tmp[i]
            temps.append(t)
        for i in range(len(wea_after)):
            we = wea_befor[i]+'/'+wea_after[i]
            weathers.append(we)
        item = TianqiItem()
        item['addr'] = addr
        item['weeks'] = weeks
        item['days'] = days
        item['temps'] = temps
        item['weathers'] = weathers
        print(item)
        yield item
main.py
import os
import sys
from scrapy.cmdline import execute
dir_file = os.path.dirname(os.path.abspath(__file__))
sys.path.append(dir_file)
execute(['scrapy','crawl','tianqi','--nolog'])
items.py
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class TianqiItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    addr = scrapy.Field()
    weeks = scrapy.Field()
    days = scrapy.Field()
    weathers = scrapy.Field()
    temps = scrapy.Field()
pipelines.py
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class TianqiPipeline(object):
    def process_item(self, item, spider):
        return item

import codecs
import json
class weatherPipeline(object):
    def __init__(self):
        self.file = codecs.open('weather.json','w','utf-8')
    def process_item(self, item, spider):
        lines = json.dumps(dict(item),ensure_ascii=False)+'\n'
        self.file.write(lines)
        return item
    def close_spider(self,spider):
        self.file.close()


import csv
import os
class Pipeline_ToCSV(object):

    def __init__(self):
        # csv文件的位置,无需事先创建
        store_file = os.path.dirname(__file__) + '/spiders/weather.csv'
        # 打开(创建)文件
        self.file = open(store_file, 'w',newline="")
        # csv写法# dialect为打开csv文件的方式,默认是excel,delimiter="\t"参数指写入的时候的分隔符
        self.writer = csv.writer(self.file,dialect=("excel"))
        # self.writer.writerow(['地点'])
    def process_item(self, item, spider):
        # 判断字段值不为空再写入文件
            # csv文件插入一行数据,把下面列表中的每一项放入一个单元格(可以用循环插入多行)
        self.writer.writerow(['地点:',item['addr'].replace(',','/')])
        self.writer.writerow(['星期','日期','气温','天气'])
        for i in range(len(item['weeks'])):
            self.writer.writerow([item['weeks'][i],item['days'][i],item['temps'][i],item['weathers'][i]])
        self.writer.writerow([" "])
        return item

    def close_spider(self, spider):
        # 关闭爬虫时顺便将文件保存退出
        self.file.close()
settings.py
# -*- coding: utf-8 -*-

# Scrapy settings for TianQi project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'TianQi'

SPIDER_MODULES = ['TianQi.spiders']
NEWSPIDER_MODULE = 'TianQi.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'TianQi (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'TianQi.middlewares.TianqiSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'TianQi.middlewares.TianqiDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   # 'TianQi.pipelines.TianqiPipeline': 300,
    'TianQi.pipelines.weatherPipeline': 200,
    'TianQi.pipelines.Pipeline_ToCSV': 250,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
这样实现了爬取全国天气的程序,保存的json文件: 保存的csv文件: 最值得注意的就是分析的过程,想要爬取所有省的天气,那我首先得获得每个省的url,每个省又有很多地区,在获得省份url后,再往细了想,通过省份的url解析出每个省份各地区的url,然后再通过这个url去获得每个地区的天气情况。

---------------------------------分割----------------------------------


今天遇到的一个问题,在爬取淘宝手机的时候,我想要进手机的详细页面,通过观察,进入详情页面比较重要的就是手机对应的id(如图最后一项)

但是没法直接从源码中解析出来,发现它是放在script脚本下(是放在了js中的)但是通过直接刷新页面并不能获得id的js页面,需要点击综合或其他来刷新,不是怎么知道的,偶然试出来的: 可以看到,这是json类型的数据,那就通过json.loads(response.text)来处理成字典,然后结合抓包得到的页面来分析,取出id 但是在这里取到的id并不是纯的id还有一串其他的东西,这里就用到了字符串的分割split函数,id=str.split('_')[-1]这里分割后最后一个就是id,这样就可以取出id,然后通过构造如下url:
"https://s.taobao.com/searchq=手机&app=detailproduct&pspuid="+id就可以进入每个手机的详情页面了。

相关文章

网友评论

      本文标题:scrapy爬取墨迹天气—全国

      本文链接:https://www.haomeiwen.com/subject/zhaubftx.html