知识点:
1、python
2、scrapy爬虫框架+mongodb数据库
3、http://www.gpsspg.com/maps.htm网站
背景介绍:
最近客户要求找出500个小区的经纬度信息,经分析如果手工在网站上查找经纬度,需要耗费1天时间,而且下次客户再有类似需求则还需要人工查找经纬度,非常费事,好在可以利用python的scrapy框架爬取相关小区经纬度,实现自动化处理。
<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1556437570866" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"><input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>
Python学习交流群:1004391443,有大牛答疑,有资源共享!有想学习python编程的,想提升自己能力的,欢迎加入讨论学习。
http://www.gpsspg.com/maps.htm网站数据分析:
在该网址上输入地址后会自动弹出查找结果,可取前10条进行分析对比找出最精确gps结果。
<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1556437570870" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"><input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>
跟踪后台数据交互内容可以看到如下数据:
<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1556437570874" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"><input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>
requesturl分析:wd=后面数字为输入框信息编码后的结果
网站返回数据为json格式数据,根据分析对比,在pois字段中返回10条查询结果为正好对于网站显示出的前10条结果。
<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1556437570878" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"><input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>
分析结论:网站返回的json数据里有所需的相关gps信息,通过python提取json数据后经过处理可以找到符合要求gps。
scrapy爬虫框架:
基于python的爬虫框架有很多比如django、scrapy等,对于小型网站的爬取我习惯使用较简单的scrapy爬虫框架。
爬虫框架架构:
spiders/main.py 为程序入口
spiders/getpoint.py 为scrapy爬虫启动程序
<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1556437570888" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"><input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>
数据处理逻辑:
<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1556437570892" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"><input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>
getpoint.py 源码:
getpoint.py主要实现从source.xlsx里读取需要查找经纬度的小区名称,直接从网站上提取小区名称对应的10个经纬度信息,并通过最大编辑距离计算出最精确gps。
-- coding: utf-8 --
import scrapy
import pandas as pd
from urllib.parse import quote
import json
import difflib
from scrapy import log
from gpsspg.items import PointItem
import chardet
class GetpointSpider(scrapy.Spider):
name = 'getpoint'
allowed_domains = ['http://www.gpsspg.com']
base_url = 'https://apis.map.qq.com/jsapi'
df = pd.read_excel("source.xlsx", sheet_name='Sheet1')
def start_requests(self):
for i in self.df.index:
searchkey = self.df.loc[i, '城市'] + self.df.loc[i, '县区'] + self.df.loc[i, '小区名称']
address = self.df.loc[i,'地址']
para = r'?qt=poi&wd='+quote(searchkey)+'pn=0&rn=10&rich_source=qipao&rich=web&nj=0&c=1&key=FBOBZ-VODWU-C7SVF-B2BDI-UK3JE-YBFUS&output=jsonp&pf=jsapi&ref=jsapi&cb=qq.maps._svcb3.search_service_0'
url = self.base_url + para
yield scrapy.Request(url=url,method='GET',callback=self.parse,meta={'name':self.df.loc[i,'小区名称'],'city':self.df.loc[i,'城市'],'area':self.df.loc[i,'县区']})
def parse(self, response):
cs = chardet.detect(response.body)
rsp = response.body.decode(cs.get('encoding','utf-8'))
rsp = rsp.replace('qq.maps._svcb3.search_service_0 && qq.maps._svcb3.search_service_0(','')
rsp = rsp[0:-1]
print(rsp)
df = pd.read_json(rsp)
name = response.meta['name']
city = response.meta['city']
area = response.meta['area']
ls = df['detail']['pois']
for l in ls:
if city == l['POI_PATH'][1]['cname'] and area == l['POI_PATH'][0]['cname']:
r = difflib.SequenceMatcher(None,name,l['name']).quick_ratio()
else:
r = 0
l['result'] = r
tuple_data = sorted(ls,key=lambda x:x['result'],reverse=True)
if tuple_data[0]['result']>0.62:
item = PointItem()
item['city'] = city
item['area'] = area
item['name'] = name
item['x'] = tuple_data[0]['pointx']
item['y'] = tuple_data[0]['pointy']
item['scity'] = tuple_data[0]['POI_PATH'][1]['cname']
item['sarea'] =tuple_data[0]['POI_PATH'][0]['cname']
item['saddr'] =tuple_data[0]['addr']
item['sname'] = tuple_data[0]['name']
item['sresult'] = tuple_data[0]['result']
yield item
pipelines.py源码:
该模块主要功能是将提取的经纬度信息保存至mongodb数据库中
from pymongo import MongoClient
import pandas as pd
from scrapy.conf import settings
class GpsspgPipeline(object):
def init(self):
self.client = MongoClient(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
db = self.client[settings['MONGODB_DB']]
self.collection_companyinfo = db[settings['COLLECTION_POINT']]
def process_item(self, item, spider):
print(item['name']+":::"+item['sname']+":::"+str(item['sresult']))
self.collection_companyinfo.insert(dict(item))
def close_spider(self, spider):
self.client.close()
items.py 源码:
import scrapy
from scrapy import Field
class PointItem(scrapy.Item):
x = Field()
y = Field()
scity = Field()
sarea = Field()
saddr = Field()
sname = Field()
sresult = Field()
name = Field()
addr = Field()
area = Field()
city = Field()
执行效果:
mongodb数据库内容展示:
<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1556437570902" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"><input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>
关键算法:
从json中提取10个经纬度位置,通过json中name名称与小区名称进行对比,找出相似度最高的name,继而匹配出经纬度,用到的相似性对比方法是——编辑距离算法,首先由俄国科学家Levenshtein提出的,又叫Levenshtein Distance。
主要代码:
<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1556437570906" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"><input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>
网友评论