这是作为scrapy一个练习,记下来,怕自己忘了,也方便自己查看
初始url:https://tianqi.moji.com/weather/china/如下图:
思路:通过爬取这个页面,获得全部的省对应的url, 通过观察可以看出所有省的url都放在了这个标签下,但是是不完整的,我们需要自己构造完整的url才行 ,例如:url = https://tianqi.moji.com/weather/china/anhui
如下是构造所有省市的url,然后通过yield生成器调用Request,通过callback调用allcity函数,用来从各省的url中解析出每个省的市,从而取得天气情况,所有省份的url构造如下:
urls = response.css('.city_list.clearfix a::attr(href)').extract()
for url in urls:
url = parse.urljoin('https://tianqi.moji.com',url)
yield Request(url=url,callback=self.allcity_url)
构造出各省的url后就需要从中解析出各省下的县市,如图是安徽省下的地区地区的url
我们通过检查源码,发现这些地区的url都是在一个类属性为city_hot的标签下的ul标签下(列表),这样我们就可以解析出省下的各地区的url了得到:https://tianqi.moji.com/weather/china/anhui/fanchang-county,如图: 在这里我选择了7天天气预报进去解析,进去后发现url变了、变了,通过观察,发现变的仅仅是https://tianqi.moji.com/weather/china/anhui/fanchang-county中的weather的地方,weather变成了forecast7,查询7天的天气预报的url:https://tianqi.moji.com/forecast7/china/anhui/fanchang-county,这样,思路也就有了,我们通过前边解析得到的省份的url,对url发起请求(Request),然后返回的response传给了allcity_url函数,此函数通过返回来的response解析出城市的url,但是这样解析出来的url不能满足我需要爬取7天天气的需求,所以通过replace构造出我需要的url:https://tianqi.moji.com/forecast7/china/anhui/fanchang-county
allcity_url函数:
def allcity_url(self,response):
city_urls = response.css('.city_hot li a::attr(href)').extract() #一个省下的所有市
for city_url in city_urls:
city_url = city_url.replace('weather','forecast7') #构造url,以便取每个省下的每个地区的7天天气(实际去了8天)
yield Request(url=city_url,callback=self.detail_parse) #同样。通过yield生成器,调用Request模块,利用detail_parse函数解析出我们想要的结果
这样,我拿到了所有城市的url(查询7天天气预报那个),这里
我利用detail_parse函数,解析了每个地区的,地址(addr)、日期(days)、星期(weeks)、天气(weathers)、气温(temps)
7天天气页面如图:
会发现,每天的天气会有两个来描述、并且他们在同一个类属性中,星期和日期也是在同一个类属性中,解析出来会有两个,所以下边我把它们分开了,为了方便后期的提取和使用
detail_parse函数:
def detail_parse(self,response):
weeks = [] #用来存放星期几
days = []#用来存放日期,几月几号
wea_befor = [] #墨迹天气有两个情况,这个用来存放上一种
wea_after = []#用来存放天气的下一种情况
temps = [] #用来存放温度
weathers = [] #用来存放天气(这时候把天气合起来放在了这个里边)
#获得地址
addr = response.css('.search_default em::text').extract_first('')
#取星期几、日期
week = response.css('.week::text').extract()
#取星期
for i in range(0,len(week),2):#因为星期和日期是在一起的['星期日',‘8/12’,‘星期一’,‘8/13’]这样的结构,所以通过这个方法把它们分离开,得到纯的星期和日期
weeks.append(week[i])
#取日期
for i in range(1, len(week), 2):
days.append(week[i])
#取天气
weather = response.css('.wea::text').extract()
#取得前一个天气情况
for i in range(0,len(weather),2):#与日期和星期同理
wea_befor.append(weather[i])
#取得后一个天气情况
for i in range(1,len(weather),2):
wea_after.append(weather[i])
#提取气温
max_tmp = response.css('.tree.clearfix p b::text').extract()
min_tmp = response.css('.tree.clearfix p strong::text').extract()
for i in range(len(max_tmp)):
t = max_tmp[i]+'/'+min_tmp[i]
temps.append(t) #把气温构造成23/22的结构
for i in range(len(wea_after)):
we = wea_befor[i]+'/'+wea_after[i]
weathers.append(we) #把天气构造成阵雨/多云的结构
item = TianqiItem() #实例化items
item['addr'] = addr #存数据到items
item['weeks'] = weeks
item['days'] = days
item['temps'] = temps
item['weathers'] = weathers
print(item)
yield item 通过yield迭代,下载item
这样就可以对数据做操作了,比如存json、csv等,这里我存了一个json和csv,存json和csv都是在pipelines里边自己配置函数来实现的。程序结构如图:
各个个模块代码
tianqi.py
# -*- coding: utf-8 -*-
import scrapy
from TianQi import main#这个的目的是为了省去我们再去main.py里边运行程序,导入后可以直接在这里运行代码
from urllib import parse
from scrapy.http import Request
from TianQi.items import TianqiItem
class TianqiSpider(scrapy.Spider):
name = 'tianqi'
# allowed_domains = ['moji.com']
start_urls = ['https://tianqi.moji.com/weather/china/']
def parse(self, response):
#取出省份
shengfen = response.css('.city_list.clearfix a::text').extract()
#取出省份的url
urls = response.css('.city_list.clearfix a::attr(href)').extract()
for url in urls:
url = parse.urljoin('https://tianqi.moji.com',url)
yield Request(url=url,callback=self.allcity_url)
def allcity_url(self,response):
city_urls = response.css('.city_hot li a::attr(href)').extract()
for city_url in city_urls:
city_url = city_url.replace('weather','forecast7')
yield Request(url=city_url,callback=self.detail_parse)
def detail_parse(self,response):
weeks = []
days = []
wea_befor = []
wea_after = []
temps = []
weathers = []
#获得地址
addr = response.css('.search_default em::text').extract_first('')
#取星期几、日期
week = response.css('.week::text').extract()
#取星期
for i in range(0,len(week),2):
weeks.append(week[i])
#取日期
for i in range(1, len(week), 2):
days.append(week[i])
#取天气
weather = response.css('.wea::text').extract()
#取得前一个天气情况
for i in range(0,len(weather),2):
wea_befor.append(weather[i])
#取得后一个天气情况
for i in range(1,len(weather),2):
wea_after.append(weather[i])
#提取气温
max_tmp = response.css('.tree.clearfix p b::text').extract()
min_tmp = response.css('.tree.clearfix p strong::text').extract()
for i in range(len(max_tmp)):
t = max_tmp[i]+'/'+min_tmp[i]
temps.append(t)
for i in range(len(wea_after)):
we = wea_befor[i]+'/'+wea_after[i]
weathers.append(we)
item = TianqiItem()
item['addr'] = addr
item['weeks'] = weeks
item['days'] = days
item['temps'] = temps
item['weathers'] = weathers
print(item)
yield item
main.py
import os
import sys
from scrapy.cmdline import execute
dir_file = os.path.dirname(os.path.abspath(__file__))
sys.path.append(dir_file)
execute(['scrapy','crawl','tianqi','--nolog'])
items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class TianqiItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
addr = scrapy.Field()
weeks = scrapy.Field()
days = scrapy.Field()
weathers = scrapy.Field()
temps = scrapy.Field()
pipelines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
class TianqiPipeline(object):
def process_item(self, item, spider):
return item
import codecs
import json
class weatherPipeline(object):
def __init__(self):
self.file = codecs.open('weather.json','w','utf-8')
def process_item(self, item, spider):
lines = json.dumps(dict(item),ensure_ascii=False)+'\n'
self.file.write(lines)
return item
def close_spider(self,spider):
self.file.close()
import csv
import os
class Pipeline_ToCSV(object):
def __init__(self):
# csv文件的位置,无需事先创建
store_file = os.path.dirname(__file__) + '/spiders/weather.csv'
# 打开(创建)文件
self.file = open(store_file, 'w',newline="")
# csv写法# dialect为打开csv文件的方式,默认是excel,delimiter="\t"参数指写入的时候的分隔符
self.writer = csv.writer(self.file,dialect=("excel"))
# self.writer.writerow(['地点'])
def process_item(self, item, spider):
# 判断字段值不为空再写入文件
# csv文件插入一行数据,把下面列表中的每一项放入一个单元格(可以用循环插入多行)
self.writer.writerow(['地点:',item['addr'].replace(',','/')])
self.writer.writerow(['星期','日期','气温','天气'])
for i in range(len(item['weeks'])):
self.writer.writerow([item['weeks'][i],item['days'][i],item['temps'][i],item['weathers'][i]])
self.writer.writerow([" "])
return item
def close_spider(self, spider):
# 关闭爬虫时顺便将文件保存退出
self.file.close()
settings.py
# -*- coding: utf-8 -*-
# Scrapy settings for TianQi project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'TianQi'
SPIDER_MODULES = ['TianQi.spiders']
NEWSPIDER_MODULE = 'TianQi.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'TianQi (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'TianQi.middlewares.TianqiSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'TianQi.middlewares.TianqiDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
# 'TianQi.pipelines.TianqiPipeline': 300,
'TianQi.pipelines.weatherPipeline': 200,
'TianQi.pipelines.Pipeline_ToCSV': 250,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
这样实现了爬取全国天气的程序,保存的json文件:
保存的csv文件:
最值得注意的就是分析的过程,想要爬取所有省的天气,那我首先得获得每个省的url,每个省又有很多地区,在获得省份url后,再往细了想,通过省份的url解析出每个省份各地区的url,然后再通过这个url去获得每个地区的天气情况。
---------------------------------分割----------------------------------
今天遇到的一个问题,在爬取淘宝手机的时候,我想要进手机的详细页面,通过观察,进入详情页面比较重要的就是手机对应的id(如图最后一项)
但是没法直接从源码中解析出来,发现它是放在script脚本下(是放在了js中的)但是通过直接刷新页面并不能获得id的js页面,需要点击综合或其他来刷新,不是怎么知道的,偶然试出来的: 可以看到,这是json类型的数据,那就通过json.loads(response.text)来处理成字典,然后结合抓包得到的页面来分析,取出id 但是在这里取到的id并不是纯的id还有一串其他的东西,这里就用到了字符串的分割split函数,id=str.split('_')[-1]这里分割后最后一个就是id,这样就可以取出id,然后通过构造如下url:"https://s.taobao.com/searchq=手机&app=detailproduct&pspuid="+id就可以进入每个手机的详情页面了。
网友评论