美文网首页我爱编程
818寿司外卖数据

818寿司外卖数据

作者: aboutlikefish | 来源:发表于2018-06-17 21:42 被阅读21次

    几个月前空闲时候爬了下外卖的寿司数据(才不会承认是那段时间靠外卖维持生存),得闲写写分享下。本文适合围观群众和有一丁点基础的人。
    tips:本爬虫为了提高爬取速度,使用了异步协程,有需要且数据量小的喷油并不建议这么使用,会被封掉,可以修改为常规同步代码。
    根据数据分析的ETL流程,该小爬虫讲解如下:

    1. 先准备下面的Python第三方包:
    import requests
    import aiohttp
    import asyncio
    from multiprocessing.pool import Pool
    from datetime import date
    import pymysql
    from sqlalchemy import create_engine
    import collections
    

    2.然后选择一个外卖平台进行分析,这里我选择的是ele.me,原因就是因为简单!简单!简单!
    ele.me可以直接通过Chrome的web分析找到数据接口,没有那么多反爬套路。
    接下来上正餐~~~~
    2.1先定义好一个请求数据的函数:

    async def gethtml(url):
        header = {
            'Accept': 'application/json, text/plain, */*',
            'Accept-Encoding': 'gzip, deflate, br',
            'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
            'Cache-Control': 'max-age=0',
            'Connection': 'keep-alive',
            'Host': 'www.ele.me',
            'Referer': 'https://www.ele.me/place/wsbrgts6d1ry?latitude=28.111704&longitude=113.011304',
            'x-shard': 'loc=113.011304,28.111704',
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36',
        }
        try:
            async with aiohttp.ClientSession() as session:         
                async with session.get(url=url, headers=header3) as r:
                    # time.sleep(0.5)
                    if not r.raise_for_status():
                        data = await r.json()            
                    # print(data)
                    # data = ujson.loads(data)
                    return data
        except Exception as e:
            print(e)
            pass
    

    后续的数据请求都是通过这个函数,因为使用的是异步协程,所以使用async定义。
    2.2接下来是数据提取函数:

    def getshopid(html):
        shop_id = {i['restaurant']['id'] for i in html['restaurant_with_foods']}
        return shop_id
    
    
    def geturl(ids):
        restaurant_url = {'https://www.ele.me/restapi/shopping/restaurant/%s?latitude=28.09515&longitude=113.012001&terminal=web' %
                          shop_id for shop_id in ids}
        foodurl = {'https://www.ele.me/restapi/shopping/v2/menu?restaurant_id=%s&terminal=web' %
                   shop_id for shop_id in ids}
        return restaurant_url, foodurl
    

    函数分别是获取店铺id,获取店铺详情,这里面需要注意的是提取数据要注意去重,这里使用了简单暴力的集合数据结构去重。
    2.3数据提取完毕,接下来使用pandas重新载入数据做最后的分析,如下:

    def food_table(foodlists):
        foods = {(y['specfoods'][0]['restaurant_id'], y['name'], y['specfoods'][0]['price'],y['month_sales'], date.today().strftime('%Y-%m-%d'), date.today().strftime('%A')) for foodlist in foodlists for x in foodlist for y in x['foods']}
        return foods
    
    
    def shop_table(shoplist):
        shop_detail = {(shop['id'], shop['name'], shop['distance'], shop['float_delivery_fee'],shop['float_minimum_order_amount'], shop['rating'], shop['rating_count']) for shop in shoplist}
        return shop_detail
    

    函数分别是生成食物详情表,店铺详情表。
    2.4最后一步就是做分析,使用pandas处理,这里以简单的每个店铺月销售总额做为指标:

    def join_table(shoptable, foodtable):
        shoptable = pd.DataFrame(list(shoptable), columns=[ 'id', 'name', 'distance', 'delivery_fee', 'minimum_order_amount', 'rating', 'rating_count'])
        foodtable = pd.DataFrame(list(foodtable), columns=['id', 'fname', 'price', 'msale', 'date', 'weekday'])
        # print(foodtable.values)
        new = pd.merge(shoptable, foodtable, on='id')
        new['total'] = new['msale'] * new['price']
        group = new.groupby(['name', 'id'])
        return new, group.sum()
    

    这一步是用pandas替代了SQL做处理,也可以存入MySQL中再处理,代码如下:

    connect = create_engine( 'mysql+pymysql://root:12345678@localhost:3306/waimai?charset=utf8')
    pd.io.sql.to_sql(frame=detail, name=k, con=connect, if_exists='append')
    

    3.处理函数全部定义好,就可以开始写main函数了:

    async def main(name):
        pool = Pool(8)
        # html = await gethtml(yangqi)
        htasks = [asyncio.ensure_future(gethtml(url))for url in name]
        htmls = await asyncio.gather(*htasks)
        # ids = getshopid(html)
        # print(htmls)
        ids = [getshopid(html) for html in htmls]
        # print(ids)
        restaurant_url, food_url = geturl(ids[0])
        print('async crawl...')
        shoptasks = [asyncio.ensure_future(
            gethtml(url)) for url in restaurant_url]
        foodtasks = [asyncio.ensure_future(
            gethtml(url)) for url in food_url]
        fdone, fpending = await asyncio.wait(foodtasks)
        sdone, spending = await asyncio.wait(shoptasks)
        shoplist = [task.result() for task in sdone]
        foodlist = [task.result() for task in fdone]
        print('distribute pasrse....')
        sparse_jobs = [pool.apply_async(shop_table, args=(shoplist,))]
        fparse_jobs = [pool.apply_async(food_table, args=(foodlist,))]
        shoptable = [x.get() for x in sparse_jobs][0]
        foodtable = [x.get() for x in fparse_jobs][0]
        new, result = join_table(shoptable, foodtable)
    
        return new, result
    

    4.最后一波操作,执行main函数:

    while len(lists)>0:
        for k,v in list(lists.items()): 
            try:
                loop = asyncio.get_event_loop()
                tasks = asyncio.ensure_future(main(v))
                loop.run_until_complete(tasks)
                detail, totals = tasks.result()
    
                lists.pop(k)
                print('done:{}'.format(k))                  
            except KeyError:
                print('fail:{}'.format(k))
                pass
            else:
                connect = create_engine( 'mysql+pymysql://root:12345678@localhost:3306/waimai?charset=utf8')
                pd.io.sql.to_sql(frame=detail, name=k, con=connect, if_exists='append')
    

    因为是异步,需要在事件循环中执行。里面的lists就是自己想要搜索的区域中的外卖店列表,下面提供几个列表示例:

    wuyisquare=['https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E5%92%96%E5%95%A1&latitude=28.19652&limit=100&longitude=112.977361&offset={0}&terminal=web'.format(x) for x in range(0, 120, 24)]
    sushi = ['https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E5%AF%BF%E5%8F%B8&latitude=28.111704&limit=100&longitude=113.011304&offset={0}&terminal=web'.format(x) for x in range(0, 120, 24)]
    yangqi = ['https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E8%8C%B6&latitude=28.23188&limit=100&longitude=112.871522&offset={0}&terminal=web'.format(x) for x in range(0, 120, 24)]
    tea = ['https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E5%92%96%E5%95%A1&latitude=28.09515&limit=100&longitude=113.012001&offset={0}&terminal=web'.format(x) for x in range(0, 120, 24)]
    fen = ['https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E7%AD%92%E5%AD%90%E9%AA%A8%E7%B2%89&latitude=28.111704&limit=100&longitude=113.011304&offset={0}&terminal=web'.format(x) for x in range(0, 120, 24)]
    gaosheng = ['https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E7%B2%89&latitude=28.09515&limit=100&longitude=113.012001&offset={0}&terminal=web'.format(x) for x in range(0, 120, 24)]
    fangcun = ['https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E6%96%B9%E5%AF%B8%E5%AF%BF%E5%8F%B8&latitude=28.23188&limit=100&longitude=112.871522&offset={0}&terminal=web'.format(x) for x in range(0, 120, 24)]
    luoyide = ['https://www.ele.me/restapi/shopping/restaurants/search?extras%5B%5D=activity&keyword=%E7%BD%97%E4%B9%89%E5%BE%B7&latitude=28.23188&limit=100&longitude=112.871522&offset={0}&terminal=web'.format(x) for x in range(0, 120, 24)]
    lists={'sushi':sushi,'tea':tea,'fen':fen,'gaosheng':gaosheng,'luoyide':luoyide,'fangcun':fangcun}
    

    URL只要替换keyword和latitude,longitude就可以搜索自己想要区域,经度纬度可以通过各类地图API获取,这里就不打广告了

    这个爬虫使用了异步请求,集合去重,pandas的数据库同步写入等基础知识,适合练手,至于数据的价值自己慢慢挖掘,有点意思。
    比如月售与各种维度的关系,比如散点图,柱状图,日历热点图:


    image.png
    image

    下一波玩一玩微信和QQ机器人,敬请期待~~~~~

    相关文章

      网友评论

        本文标题:818寿司外卖数据

        本文链接:https://www.haomeiwen.com/subject/yxaweftx.html