美文网首页大数据 爬虫Python AI SqlPython小哥哥
python自动化爬取地名的gps信息 !

python自动化爬取地名的gps信息 !

作者: 14e61d025165 | 来源:发表于2019-04-28 15:49 被阅读2次

    知识点:

    1、python

    2、scrapy爬虫框架+mongodb数据库

    3、http://www.gpsspg.com/maps.htm网站

    背景介绍:

    最近客户要求找出500个小区的经纬度信息,经分析如果手工在网站上查找经纬度,需要耗费1天时间,而且下次客户再有类似需求则还需要人工查找经纬度,非常费事,好在可以利用python的scrapy框架爬取相关小区经纬度,实现自动化处理。

    <tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1556437570866" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image

    <input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>

    Python学习交流群:1004391443,有大牛答疑,有资源共享!有想学习python编程的,想提升自己能力的,欢迎加入讨论学习。

    http://www.gpsspg.com/maps.htm网站数据分析:

    在该网址上输入地址后会自动弹出查找结果,可取前10条进行分析对比找出最精确gps结果。

    <tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1556437570870" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image

    <input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>

    跟踪后台数据交互内容可以看到如下数据:

    <tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1556437570874" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image

    <input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>

    requesturl分析:wd=后面数字为输入框信息编码后的结果

    RequestURL:https://apis.map.qq.com/jsapi?qt=poi&wd=%E7%9F%B3%E5%AE%B6%E5%BA%84%E6%A1%A5%E8%A5%BF%E5%8C%BA%E7%95%99%E8%90%A5%E5%8D%8E%E8%8B%91&pn=0&rn=10&rich_source=qipao&rich=web&nj=0&c=1&key=FBOBZ-VODWU-C7SVF-B2BDI-UK3JE-YBFUS&output=jsonp&pf=jsapi&ref=jsapi&cb=qq.maps._svcb3.search_service_0

    网站返回数据为json格式数据,根据分析对比,在pois字段中返回10条查询结果为正好对于网站显示出的前10条结果。

    <tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1556437570878" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image <input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image> <tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1556437570882" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image

    <input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>

    分析结论:网站返回的json数据里有所需的相关gps信息,通过python提取json数据后经过处理可以找到符合要求gps。

    scrapy爬虫框架:

    基于python的爬虫框架有很多比如django、scrapy等,对于小型网站的爬取我习惯使用较简单的scrapy爬虫框架。

    爬虫框架架构:

    spiders/main.py 为程序入口

    spiders/getpoint.py 为scrapy爬虫启动程序

    <tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1556437570888" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image

    <input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>

    数据处理逻辑:

    <tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1556437570892" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image

    <input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>

    getpoint.py 源码:

    getpoint.py主要实现从source.xlsx里读取需要查找经纬度的小区名称,直接从网站上提取小区名称对应的10个经纬度信息,并通过最大编辑距离计算出最精确gps。

    -- coding: utf-8 --

    import scrapy

    import pandas as pd

    from urllib.parse import quote

    import json

    import difflib

    from scrapy import log

    from gpsspg.items import PointItem

    import chardet

    class GetpointSpider(scrapy.Spider):

    name = 'getpoint'

    allowed_domains = ['http://www.gpsspg.com']

    base_url = 'https://apis.map.qq.com/jsapi'

    df = pd.read_excel("source.xlsx", sheet_name='Sheet1')

    def start_requests(self):

    for i in self.df.index:
    
      searchkey = self.df.loc[i, '城市'] + self.df.loc[i, '县区'] + self.df.loc[i, '小区名称']
    
      address = self.df.loc[i,'地址']
    
      para = r'?qt=poi&wd='+quote(searchkey)+'pn=0&rn=10&rich_source=qipao&rich=web&nj=0&c=1&key=FBOBZ-VODWU-C7SVF-B2BDI-UK3JE-YBFUS&output=jsonp&pf=jsapi&ref=jsapi&cb=qq.maps._svcb3.search_service_0'
    
      url = self.base_url + para
    
      yield scrapy.Request(url=url,method='GET',callback=self.parse,meta={'name':self.df.loc[i,'小区名称'],'city':self.df.loc[i,'城市'],'area':self.df.loc[i,'县区']})
    

    def parse(self, response):

    cs = chardet.detect(response.body)
    
    rsp = response.body.decode(cs.get('encoding','utf-8'))
    
    rsp = rsp.replace('qq.maps._svcb3.search_service_0 && qq.maps._svcb3.search_service_0(','')
    
    rsp = rsp[0:-1]
    
    print(rsp)
    
    df = pd.read_json(rsp)
    
    name = response.meta['name']
    
    city = response.meta['city']
    
    area = response.meta['area']
    
    ls = df['detail']['pois']
    
    for l in ls:
    
      if city == l['POI_PATH'][1]['cname'] and area == l['POI_PATH'][0]['cname']:
    
        r = difflib.SequenceMatcher(None,name,l['name']).quick_ratio()
    
      else:
    
        r = 0
    
      l['result'] = r
    
    tuple_data = sorted(ls,key=lambda x:x['result'],reverse=True)
    
    if tuple_data[0]['result']>0.62:
    
      item = PointItem()
    
      item['city'] = city
    
      item['area'] = area
    
      item['name'] = name
    
      item['x'] = tuple_data[0]['pointx']
    
      item['y'] = tuple_data[0]['pointy']
    
      item['scity'] = tuple_data[0]['POI_PATH'][1]['cname']
    
      item['sarea'] =tuple_data[0]['POI_PATH'][0]['cname']
    
      item['saddr'] =tuple_data[0]['addr']
    
      item['sname'] = tuple_data[0]['name']
    
      item['sresult'] = tuple_data[0]['result']
    
    yield item
    

    pipelines.py源码:

    该模块主要功能是将提取的经纬度信息保存至mongodb数据库中

    from pymongo import MongoClient

    import pandas as pd

    from scrapy.conf import settings

    class GpsspgPipeline(object):

    def init(self):

    self.client = MongoClient(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
    
    db = self.client[settings['MONGODB_DB']]
    
    self.collection_companyinfo = db[settings['COLLECTION_POINT']]
    

    def process_item(self, item, spider):

    print(item['name']+":::"+item['sname']+":::"+str(item['sresult']))
    
    self.collection_companyinfo.insert(dict(item))
    

    def close_spider(self, spider):

    self.client.close()
    

    items.py 源码:

    import scrapy

    from scrapy import Field

    class PointItem(scrapy.Item):

    x = Field()

    y = Field()

    scity = Field()

    sarea = Field()

    saddr = Field()

    sname = Field()

    sresult = Field()

    name = Field()

    addr = Field()

    area = Field()

    city = Field()

    执行效果:

    mongodb数据库内容展示:

    <tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1556437570902" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image

    <input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>

    关键算法:

    从json中提取10个经纬度位置,通过json中name名称与小区名称进行对比,找出相似度最高的name,继而匹配出经纬度,用到的相似性对比方法是——编辑距离算法,首先由俄国科学家Levenshtein提出的,又叫Levenshtein Distance。

    主要代码:

    <tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1556437570906" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image

    <input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>

    相关文章

      网友评论

        本文标题:python自动化爬取地名的gps信息 !

        本文链接:https://www.haomeiwen.com/subject/ctxjnqtx.html