美文网首页我爱编程
一月份工作记录

一月份工作记录

作者: Chelsea_Dagger | 来源:发表于2018-01-17 11:24 被阅读0次

    关键字
    K_means、ARIMA


    前言

    一月份主要工作如下:

    精细化数据预处理
    过滤掉单一地点mac、过滤掉出现天数低于10天的mac、进一步细分地点列表;

    数据索引
    保留两份原始数据,以不同的索引保存,便于后续检索
    a.时间戳、地点->mac
    b.日期、mac->时间段:地点

    人员数目分布统计

    聚类准备
    将人员关于地点的时间分布以ndarray的形式呈现(经过数据处理)


    1.

    第一部分的工作只是简单的修改了之前的代码,内容意义不是很多,所以这里就不详细记录啦~
    数据索引这块,详细记录一下通过日期和mac索引到place id的过程:

    输入参数:

    start_time  开始时间
    end_time    结束时间
    mac         索引的mac地址对象
    

    输出

    stime1,etime1,pid1  停留时间段1
    stime2,etime2,pid2  停留时间段2
    ...
    stimen,etimen,pidn  停留时间段n
    

    数据片段

    2017-09-11 00:00:00,0,141,dormitory,382dd1da2381,N
    2017-09-11 00:00:00,0,142,dormitory,5844988f54a5,N
    2017-09-11 00:00:00,0,145,dormitory,c8f23075fa06,N
    2017-09-11 00:00:00,0,148,dormitory,1c77f6ab931e,N
    2017-09-11 00:00:00,0,149,dormitory,10f681e38ca9,N
    2017-09-11 00:00:00,0,149,dormitory,4c49e3406f61,N
    2017-09-11 00:00:00,0,150,dormitory,bc201040b118,N
    2017-09-11 00:00:00,0,150,dormitory,6021013f5b85,N
    2017-09-11 00:00:00,0,150,dormitory,a444d1108e48,N
    2017-09-11 00:00:00,0,151,dormitory,c8f230a5c86f,N
    2017-09-11 00:00:00,0,151,dormitory,483c0cc230cc,N
    2017-09-11 00:00:00,0,151,dormitory,bc7574a0e1fa,N
    2017-09-11 00:00:00,0,158,edu,8056f2ea0cd9,N
    2017-09-11 00:00:00,0,168,edu,74042bcb3a77,N
    2017-09-11 00:00:00,0,181,canteen,40f02f4c670d,N
    2017-09-11 00:00:00,0,193,edu,8844773c62e3,N
    2017-09-11 00:00:00,0,240,canteen,4c1a3d3f0f21,N
    2017-09-11 00:01:00,0,141,dormitory,382dd1da2381,N
    2017-09-11 00:01:00,0,142,dormitory,5844988f54a5,N
    2017-09-11 00:01:00,0,145,dormitory,c8f23075fa06,N
    2017-09-11 00:01:00,0,148,dormitory,1c77f6ab931e,N
    2017-09-11 00:01:00,0,149,dormitory,10f681e38ca9,N
    2017-09-11 00:01:00,0,150,dormitory,bc201040b118,N
    2017-09-11 00:01:00,0,150,dormitory,6021013f5b85,N
    2017-09-11 00:01:00,0,150,dormitory,a444d1108e48,N
    2017-09-11 00:01:00,0,151,dormitory,483c0cc230cc,N
    2017-09-11 00:01:00,0,151,dormitory,bc7574a0e1fa,N
    2017-09-11 00:01:00,0,151,dormitory,c8f230a5c86f,N
    2017-09-11 00:01:00,0,158,edu,8056f2ea0cd9,N
    2017-09-11 00:01:00,0,168,edu,74042bcb3a77,N
    

    python代码:

    # -*- coding: UTF-8 -*-
    
    import csv
    import numpy as np
    import pandas as pd
    from pandas import Series, DataFrame
    from dateutil.parser import parse
    import datetime
    import time
    
    __author__ = 'SuZibo'
    """
    数据维度变换2
    日期、mac->时间段:地点
    
    重新索引后的数据格式:
    起始时间1,终止时间1,place id1
    起始时间2,终止时间2,place id2
    起始时间3,终止时间3,place id3
    ...
    并得到规定时间内的轨迹数组:
    [142, 202, 142, 202, 200, 202, 200, 142, 142](example)
    输入参数:mac地址、开始时间、结束时间
    """
    
    start_time ='2017-09-11 00:00:00'
    end_time ='2017-09-18 00:00:00'
    mac ='205d4717e6de'
    
    def findpathByMacDate(mac,start_time,end_time):
    
        records = pd.read_csv('./macdata/normalinfo_trans.txt',names=['timestamp','timerange','pid','ptype','mac','isholiday'])
        #读取源数据,并指明列名(时间、时间范围、地点id、地点类型、mac、是否为节假日)
        records_select = records[(records['mac']==mac) &(records['timestamp'] >start_time) &(records['timestamp'] <end_time)]
        #筛选出时间范围内的mac记录
        records_select = records_select.reset_index(drop=True)
        #重新索引数据(从0到n)
    
        filepath='./macdata/path/'+mac+'_pathinfo'+'.txt'
        change = []
        #chang列表,记录该mac地点变化的节点
        place =[]
        #place列表,记录该mac的place轨迹
    
        rs = open(filepath,'w')
    
        for i in range(records_select.shape[0] - 1):
            if int(records_select.ix[i][2]) == int(records_select.ix[i+1][2]):
                continue
            #如果相邻记录的地点一致,则继续
            else:
                change.append((i + 1))
            #否则,记录记录变化之处的index
    
        # print records_select
        # print change
    
        print str(records_select.ix[0][0])+','+str(records_select.ix[change[0] - 1][0])+','+ str(records_select.ix[0][2])
        place.append(str(records_select.ix[0][2]))
        # 头部,也就是第一个地点对应的时间段
        rs.write(str(records_select.ix[0][0])+','+str(records_select.ix[change[0] - 1][0])+','+ str(records_select.ix[0][2])+'\n')
    
        for n in range(len(change) - 1):
            print str(records_select.ix[change[n]][0]) +','+ str(records_select.ix[change[n + 1] - 1][0])+','+ str(records_select.ix[change[n]][2])
            place.append(str(records_select.ix[change[n]][2]))
            rs.write(str(records_select.ix[change[n]][0]) +','+ str(records_select.ix[change[n + 1] - 1][0])+','+ str(records_select.ix[change[n]][2])+'\n')
        # 中部
    
        print str(records_select.ix[change[-1]][0])+','+ str(records_select.ix[records_select.shape[0] - 1][0])+','+ str(records_select.ix[records_select.shape[0] - 1][2])
        place.append(str(records_select.ix[records_select.shape[0] - 1][2]))
        rs.write(str(records_select.ix[change[-1]][0])+','+ str(records_select.ix[records_select.shape[0] - 1][0])+','+ str(records_select.ix[records_select.shape[0] - 1][2])+'\n')
        # 尾部,也就是最后一个时间段对应的地点
    
        place = [int(i) for i in place]
        print place
        rs.close()
    
    findpathByMacDate(mac,start_time,end_time)
    

    数据重新索引看上去比较麻烦,巧妙运用pandas进行数据聚合、筛选操作,发现代码量并不多,很容易就实现了~


    2.人员分布统计

    工作内容:

    根据日期、时间段、地点类型(地点)等三个维度统计mac数量。柱状图同时显示两个维度(固定第三个维度),显示时可以切换第三个维度便于观察特征

    输入:start_time,end_time
    按天输出:不同地点类型的mac数量
    按时段输出:不同地点类型的mac数量

    返回文件属性说明:
    宿舍,食堂,教学楼,体育馆/学生活动中心

    python代码

    # -*- coding: UTF-8 -*-
    
    import numpy as np
    import pandas as pd
    from pandas import Series, DataFrame
    from dateutil.parser import parse
    import datetime
    
    __author__ = 'SuZibo'
    
    """
    根据日期、时间段、地点类型(地点)等三个维度统计mac数量。柱状图同时显示两个维度(固定第三个维度),显示时可以切换第三个维度便于观察特征
    
    输入:start_time,end_time
    按天输出:不同地点类型的mac数量
    按时段输出:不同地点类型的mac数量
    
    返回文件属性说明:
    宿舍,食堂,教学楼,体育馆/学生活动中心
    """
    
    dormitory =[141,142,145,146,148,149,150,151,152,153]
    canteen =[171,172,173,174,175,176,177,178,179,180,181,182,240]
    edu =[54,60,133,134,136,154,155,156,157,158,159,160,161,162,164,165,166,167,168,169,193,194,195,196,197,198,199,200,201,202,203,204,205,207,208,209,213,214,215,216,218,219,227,230,231,233,234,235,236,237,238,239]
    stadium =[183,184,185,186,187,188,189,190,191,232]
    
    stime ='2017-09-11 00:00:00'
    # etime ='2017-09-12 00:00:00'
    etime ='2017-09-12 00:00:00'
    #小循环里面的时间上限和下限
    
    weekdaylist =[]
    start_date = '2017-09-11'
    # end_date = '2017-11-13'
    end_date='2017-11-13'
    #大循环的时间上限和下限
    
    sdate = datetime.datetime.strptime(start_date,'%Y-%m-%d')
    edate = datetime.datetime.strptime(end_date,'%Y-%m-%d')
    while sdate<edate:
        weekdaylist.append(sdate.strftime('%Y-%m-%d'))
        sdate += datetime.timedelta(days=1)
    
    def getMacCountInfoByDay(stime,etime):
        #实现stime到etime时间段内的人数分布统计
        dic_dormitory =dict()
        dic_canteen =dict()
        dic_edu =dict()
        dic_stadium =dict()
        mac_count=[]
    
        with open('../macinfo/macdata/normalinfo_trans.txt') as file:
    
            for line in file:
                line = line.split(',')
                line[-1] = line[-1].strip('\n')
                day = line[0][5:10]
    
                if stime<line[0]<etime:
    
                    if line[3] =='dormitory':
                        if line[4] not in dic_dormitory:
                            dic_dormitory[line[4]] = dict()
                        dic_dormitory[line[4]][day] = dic_dormitory[line[4]].get(day, 0) + 1
                    if line[3] =='canteen':
                        if line[4] not in dic_canteen:
                            dic_canteen[line[4]] = dict()
                        dic_canteen[line[4]][day] = dic_canteen[line[4]].get(day, 0) + 1
                    if line[3] =='edu':
                        if line[4] not in dic_edu:
                            dic_edu[line[4]] = dict()
                        dic_edu[line[4]][day] = dic_edu[line[4]].get(day, 0) + 1
                    if line[3] =='stadium':
                        if line[4] not in dic_stadium:
                            dic_stadium[line[4]] = dict()
                        dic_stadium[line[4]][day] = dic_stadium[line[4]].get(day, 0) + 1
    
            for mac in dic_dormitory:
                dic_dormitory[mac] = len(dic_dormitory[mac])
            for mac in dic_canteen:
                dic_canteen[mac] = len(dic_canteen[mac])
            for mac in dic_edu:
                dic_edu[mac] = len(dic_edu[mac])
            for mac in dic_stadium:
                dic_stadium[mac] = len(dic_stadium[mac])
    
            mac_count.append(stime)
            mac_count.append(len(dic_dormitory))
            mac_count.append(len(dic_canteen))
            mac_count.append(len(dic_edu))
            mac_count.append(len(dic_stadium))
            # print mac_count
            return mac_count
            #返回mac_count列表
    
    rs = open('./plotdata/maccountbyday.txt','w')
    
    for i in range(len(weekdaylist)):
        #for 循环程序运行getMacCountInfoByDay,得到(sdate到edate时间段内的)mac数按天、按地点分布
        list = getMacCountInfoByDay(stime,etime)
        # print list
        # print list[0]
        rs.write(str(list[0][0:10])+','+str(list[1])+','+str(list[2])+','+str(list[3])+','+str(list[4])+'\n')
        #将返回的mac_count列表写入文件
        stime = datetime.datetime.strptime(stime, '%Y-%m-%d %H:%M:%S')
        etime = datetime.datetime.strptime(etime, '%Y-%m-%d %H:%M:%S')
        stime += datetime.timedelta(days=1)
        etime += datetime.timedelta(days=1)
        stime = str(stime)
        etime = str(etime)
    rs.close()
    

    关于人员统计,需要熟练运用python字典里面的get方法
    简要陈述字典get方法:

    语法
    get()方法语法:

    dict.get(key, default=None)
    

    参数
    key -- 字典中要查找的键。
    default -- 如果指定键的值不存在时,返回该默认值值。

    返回值
    返回指定键的值,如果值不在字典中返回默认值None。


    3.人员时间分布矩阵获取

    工作内容:
    以male_dor,famale_dor,postgraduate_dor,net,hospital,canteen,edu,lab,stadium,activity,administration,library为属性
    建立人员出现时长矩阵(以mac为索引)

    python代码:

    # -*- coding: UTF-8 -*-
    
    import csv
    import numpy as np
    import pandas as pd
    from pandas import Series, DataFrame
    from dateutil.parser import parse
    import datetime
    import time
    
    __author__ = 'SuZibo'
    
    """
    统计每个人时间特征矩阵(地点分布)
    
    地点list
    male_dor=[141,145,146,149,151]
    #男生宿舍
    famale_dor=[148,150,152,153]
    #女生宿舍
    postgraduate_dor=[142]
    #研究生宿舍
    net=[217,229]
    #网络中心
    hospital=[192]
    #校医院
    canteen=[171,172,173,174,175,176,177,178,179,180,181,182,240]
    #食堂
    edu=[54,60,133,134,136,196,197,198,199,200,201,202,203,204,205,207,208,209,213,214,215,216,218,219,230,231,239]
    #教学楼
    lab=[155,156,157,158,159,160,161,162,164,165,166,167,168,169,233,234,235,236,237,238]
    #实验室
    stadium=[189,190,191]
    #体育馆
    activity=[183,184,185,186,187,188,232]
    #学生活动中心
    administration=[221,222,223]
    #行政楼
    library=[193,194,195,227]
    #图书馆
    """
    mac_time_dic =dict()
    #建立字典存储mac对应的时间统计信息,因为源数据的时间周期为1min,利用此特性累加得到的结果正好就是时长(单位为min)
    
    # start_time ='2017-09-11 00:00:00'
    # end_time ='2017-11-13 00:00:00'
    
    # frame_data = pd.read_csv('../macinfo/macdata/normalinfo_trans_v2.txt',header=None)
    # print frame_data.tail()
    
    with open('../macinfo/macdata/normalinfo_trans_v2.txt') as file:
    
        for line in file:
            # print line
            line = line.split(',')
            line[-1] = line[-1].strip('\n')
    
            if line[4] not in mac_time_dic:
                mac_time_dic[line[4]] = dict()
            mac_time_dic[line[4]][line[3]] = mac_time_dic[line[4]].get(line[3], 0) + 1
    #{mac1:{place1:m,place2:n,...},...}
    #{'10b1f8f3a4d0': {'famale_dor': 10, 'male_dor': 507, 'hospital': 10, 'activity': 10, 'library': 4, 'edu': 41, 'canteen': 86...},...}
    
    # print mac_time_dic
    # print list(mac_time_dic.iteritems())
    # print list(mac_time_dic.values())
    # list1 = list(mac_time_dic.values())
    # print list(mac_time_dic.keys())
    
    frame = DataFrame(list(mac_time_dic.values()),columns=['male_dor','famale_dor','postgraduate_dor','net','hospital','canteen','edu','stadium','activity','administration','library'],index=list(mac_time_dic.keys()))
    #转换成dataframe格式,并且以mac为index
    frame = frame.dropna(how='all')
    #去掉NA项
    frame = frame.fillna(0)
    #用0填充NA项
    frame.to_csv('./data/user_time_array_includex.csv')
    frame.to_csv('./data/user_time_array.csv',index=False,header=False)
    
    

    4.人员频次分布矩阵生成

    接3,由于android和iOS操作系统的区别——前者开启wifi后锁屏会继续连接,而后者锁屏后过一小段时间会退出无线连接,因此以时间长度来衡量人员特征不够准确,于是希望以人员频次为单位建立人员关于地点的向量矩阵。

    Ps:希望对特定区域划分时间段来区分人群,比如教学楼7:00-22:00和其他时间两个时间段,借此划分人群
    因此在以上基础上又扩充了一些数据运算操作

    python代码1:
    不需要划分时间段的地点频次统计

    # -*- coding: UTF-8 -*-
    
    import csv
    import numpy as np
    import pandas as pd
    from pandas import Series, DataFrame
    from dateutil.parser import parse
    import datetime
    import time
    
    __author__ = 'SuZibo'
    
    """
    统计每个人时间特征矩阵(地点分布)
    
    地点list
    male_dor=[141,145,146,149,151]
    famale_dor=[148,150,152,153]
    postgraduate_dor=[142]
    net=[217,229]
    hospital=[192]
    canteen=[171,172,173,174,175,176,177,178,179,180,181,182,240]
    edu=[54,60,133,134,136,196,197,198,199,200,201,202,203,204,205,207,208,209,213,214,215,216,218,219,230,231,239]
    lab=[155,156,157,158,159,160,161,162,164,165,166,167,168,169,233,234,235,236,237,238]
    stadium=[189,190,191]
    activity=[183,184,185,186,187,188,232]
    administration=[221,222,223]
    library=[193,194,195,227]
    
    
    最终数据结构:{'教学楼(07:00-22:00)': 1, '教学楼(其他时段)': 0, '男生宿舍': 0, '研究生宿舍': 0, '女生宿舍': 0, '学生活动中心(07:00-21:00)': 0, '学生活动中心(其他时段)': 0, '行政楼(07:00-21:00)': 0, '行政楼(其他时段)': 0, '实验楼(07:00-21:00)': 0, '实验楼(其他时段)': 0, '食堂(07:00-23:00)': 0, '食堂(其他时段)': 0}
    edu,edu1,male_dor,postgraduate_dor,famale_dor,activity,activity1,administration,administration1,lab,lab1,canteen,canteen1,library,hospital,stadium
    """
    
    mac_count_dic = dict()
    with open('../macinfo/macdata/normalinfo_trans_v2.txt') as file:
        for line in file:
            # print line
            line = line.split(',')
            line[-1] = line[-1].strip('\n')
            day = line[0][5:10]
    
            if line[4] not in mac_count_dic:
                mac_count_dic[line[4]] = dict()
    
            if line[3] not in mac_count_dic[line[4]]:
                mac_count_dic[line[4]][line[3]] = dict()
    
            mac_count_dic[line[4]][line[3]][day] = mac_count_dic[line[4]][line[3]].get(day,0)+1
    #建立嵌套mac
    #mac_count_dic['mac']['地点'] [日期集合]
    # print mac_count_dic
    
    rs = open('./data/user_count_array_includex.csv','w')
    
    for key in mac_count_dic:
    #遍历得到的字典
        mac = key
        dis = mac_count_dic[key]
        #相当于解嵌套
        if dis.has_key('male_dor') == True:
            male_dor_count = len(dis['male_dor'])
        if dis.has_key('male_dor') == False:
            male_dor_count = 0
    
        if dis.has_key('famale_dor') == True:
            famale_dor_count = len(dis['famale_dor'])
        if dis.has_key('famale_dor') == False:
            famale_dor_count = 0
    
        if dis.has_key('postgraduate_dor') == True:
            postgraduate_dor_count = len(dis['postgraduate_dor'])
        if dis.has_key('postgraduate_dor') == False:
            postgraduate_dor_count = 0
    
        if dis.has_key('net') == True:
            net_count = len(dis['net'])
        if dis.has_key('net') == False:
            net_count = 0
    
        if dis.has_key('hospital') == True:
            hospital_count = len(dis['hospital'])
        if dis.has_key('hospital') == False:
            hospital_count = 0
    
        if dis.has_key('stadium') == True:
            stadium_count = len(dis['stadium'])
        if dis.has_key('stadium') == False:
            stadium_count = 0
    
    rs.write(str(mac)+','+str(male_dor_count)+','+str(famale_dor_count)+','+str(postgraduate_dor_count)+','+str(net_count)+','+str(hospital_count)+','+str(stadium_count).strip('\n')+'\n')
    rs.close()
    #mac,male_count,famale_count,...
    #mac为索引
    

    同理得到7:00-22:00时间段内的频次字典/extra时间段内的频次字典
    建立三个dataframe对象,命名为df1,df2,df3

    python代码2:
    dataframe对象合并

    # -*- coding: UTF-8 -*-
    
    import csv
    import numpy as np
    import pandas as pd
    from pandas import Series, DataFrame
    from dateutil.parser import parse
    import datetime
    import time
    
    __author__ = 'SuZibo'
    
    df1 = pd.read_csv('./data/user_count_array_includex_1.csv',names=['canteen','edu','lab','activity','administration','library'])
    df2 = pd.read_csv('./data/user_count_array_includex_1_extra.csv',names=['canteen_extra','edu_extra','lab_extra','activity_extra','administration_extra','library_extra'])
    df3 = pd.read_csv('./data/user_count_array_includex.csv',names=['male_dor','famale_dor','postgraduate_dor','net','hospital','stadium'])
    
    # print len(df1)
    # print len(df2)
    # print len(df3)
    
    df = df2.join(df1)
    # print df
    df = df.join(df3)
    df = df.dropna(how='all')
    df = df.fillna(0)
    # print df
    df.to_csv('./data/user_TimeArray_includex.csv')
    #生成有索引的csv
    df.to_csv('./data/user_TimeArray.csv',index=False,header=False)
    #生成无索引csv
    

    至此就完成了人员频次向量矩阵的生成

    矩阵片段:

    ,canteen_extra,edu_extra,lab_extra,activity_extra,administration_extra,library_extra,canteen,edu,lab,activity,administration,library,male_dor,famale_dor,postgraduate_dor,net,hospital,stadium
    483b38cac86d,15,10,0,0,0,3,15.0,9.0,0.0,0.0,0.0,3.0,0,12,0,3,1,1
    786256354ae3,9,1,1,0,0,0,9.0,1.0,1.0,0.0,0.0,0.0,5,0,0,0,1,0
    908d6c7faa0c,7,13,0,0,0,2,7.0,13.0,0.0,0.0,0.0,2.0,0,6,0,0,0,0
    4c49e31c7c69,20,10,3,6,13,19,20.0,10.0,3.0,6.0,13.0,19.0,0,1,22,0,4,3
    58449877c1c5,3,7,8,0,2,1,3.0,7.0,8.0,0.0,2.0,1.0,0,0,4,0,0,2
    64cc2e771dd3,21,10,6,0,2,6,21.0,10.0,6.0,0.0,2.0,6.0,3,38,0,0,4,1
    9cb2b2c7ad65,3,10,2,0,10,0,3.0,10.0,2.0,0.0,10.0,0.0,0,0,0,1,0,0
    742344e4ff39,10,3,1,0,0,1,10.0,3.0,1.0,0.0,0.0,1.0,5,0,0,0,0,0
    1ccde57a678a,7,4,0,0,0,6,7.0,4.0,0.0,0.0,0.0,6.0,11,0,0,0,1,0
    ecdf3ad00c44,15,9,3,0,0,0,15.0,9.0,3.0,0.0,0.0,0.0,0,13,0,1,0,0
    f431c39cf8cc,8,4,0,0,0,0,8.0,4.0,0.0,0.0,0.0,0.0,12,0,0,2,1,1
    f40e22420be9,18,32,14,0,11,8,17.0,32.0,12.0,0.0,11.0,8.0,5,0,0,0,0,1
    68fb7eee63e9,13,6,0,0,1,1,13.0,6.0,0.0,0.0,1.0,1.0,0,15,0,0,2,0
    205d47642a4c,17,12,4,2,0,1,17.0,12.0,4.0,2.0,0.0,1.0,27,4,0,0,2,12
    b0e235c341d5,13,11,0,1,0,1,13.0,11.0,0.0,1.0,0.0,1.0,0,18,0,0,0,0
    

    在下一篇准备对于ARIMA模型进行描述和研究

    相关文章

      网友评论

        本文标题:一月份工作记录

        本文链接:https://www.haomeiwen.com/subject/qifsoxtx.html