美文网首页
通过感应数据检测人员建模分析

通过感应数据检测人员建模分析

作者: 任海亮 | 来源:发表于2019-01-03 23:21 被阅读0次

    Occupancy Detection,数据来自UCI
    通过检测室内光,温度,湿度,二氧化碳来判断是否有人

    In [1]:

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    ​
    %matplotlib inline
    

    date time year-month-day hour:minute:secondTemperature, in CelsiusRelative Humidity, %Light, in LuxCO2, in ppmHumidity Ratio, Derived quantity from temperature and relative humidity, in kgwater-vapor/kg-airOccupancy, 0 or 1, 0 for not occupied, 1 for occupied status

    导入数据

    In [11]:

    train = pd.read_csv("E:/figure/occupancy_data/datatraining.txt")
    test = pd.read_csv("E:/figure/occupancy_data/datatest.txt")
    train.head()
    

    Out[11]:

    train.png

    探索数据

    In [4]:

    train.info()
    

    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 8143 entries, 1 to 8143Data columns (total 7 columns):
    date 8143 non-null object
    Temperature 8143 non-null float64
    Humidity 8143 non-null float64
    Light 8143 non-null float64
    CO2 8143 non-null float64
    HumidityRatio 8143 non-null float64
    Occupancy 8143 non-null int64
    dtypes: float64(5), int64(1), object(1)
    memory usage: 477.1+ KB

    共8143个数据 再看下是否有NA值

    In [5]:

    train.isnull().sum()
    

    Out[5]:

    date 0
    Temperature 0
    Humidity 0
    Light 0
    CO2 0
    HumidityRatio 0
    Occupancy 0
    dtype: int64

    In [6]:

    #看下样本,是否有不正常的数据
    train.describe()
    

    Out[6]:

    describe.png

    In [7]:
    feature相关性

    train.corr()
    

    Out[7]:

    correlation.png

    可以看到HumidityRatio和Humidity相关性超过95%

    In [8]:

    把date转换为datetime格式

    b = []
    from datetime import datetime
    for i in train["date"]:
     b.append(datetime.strptime(i,"%Y-%m-%d %H:%M:%S"))
    train["date"]=np.array(b)
    

    画图看下各变量与occpancy关系图

    In [72]:

    style.use("classic")
    ax1= plt.subplot2grid((4,1),(0,0),rowspan=1,colspan=1)
    ax2= plt.subplot2grid((4,1),(1,0),rowspan=1,colspan=1,sharex=ax1)
    ax3= plt.subplot2grid((4,1),(2,0),rowspan=1,colspan=1,sharex=ax1)
    ax4= plt.subplot2grid((4,1),(3,0),rowspan=1,colspan=1,sharex=ax1)
    
    ax1.plot(train["date"],train["Temperature"])
    ax1.set_ylabel("Teperature")
    ax5=ax1.twinx()
    ax5.plot(train["date"],train["Occupancy"],color="g")
    ​
    ax2.plot(train["date"],train["Humidity"])
    ax2.set_ylabel("Humidity")
    ax6=ax2.twinx()
    ax6.plot(train["date"],train["Occupancy"],color="g")
    ​
    ax3.plot(train["date"],train["Light"])
    ax3.set_ylabel("Light")
    ax7=ax3.twinx()
    ax7.plot(train["date"],train["Occupancy"],color="g")
    ​
    ​
    ax4.plot(train["date"],train["CO2"])
    ax4.set_ylabel("CO2")
    ax8=ax4.twinx()
    ax8.plot(train["date"],train["Occupancy"],color="g")
    

    Out[72]:
    [<matplotlib.lines.Line2D at 0xe6ecfb0>]

    各变量与Occupancy.png

    <matplotlib.figure.Figure at 0xe077bb0>

    看下Humidity和HumidityRatio相关性

    In [78]:

    style.use("ggplot")
    fig=plt.figure()
    ax1 = fig.add_subplot(111)
    ax1.plot(train["date"],train["Humidity"])
    ax2=ax1.twinx()
    ax2.plot(train["date"],train["HumidityRatio"],color="b")
    

    Out[78]:
    [<matplotlib.lines.Line2D at 0xf453890>]

    HumidityRatio/Humidity.png

    建模

    In [22]:

    x_train = train.drop(["Occupancy","date"],axis=1)
    y_train = train["Occupancy"]
    x_test = test.drop(["Occupancy","date"],axis=1)
    y_test = test["Occupancy"]
    

    In [24]:

    from sklearn.linear_model import LogisticRegression
    from sklearn.svm import SVC
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import f1_score
    from sklearn.metrics import precision_recall_curve
    

    标准化features

    In [25]:

    from sklearn.preprocessing import StandardScaler
    std = StandardScaler()
    std.fit(x_train)
    x_train_std = std.transform(x_train)
    x_test_std = std.transform(x_test)
    

    In [31]:

    logistic模型

    logreg =LogisticRegression()
    logreg.fit(x_train_std,y_train)
    y_predlog = logreg.predict(x_test_std)
    logreg.score(x_train_std,y_train)
    

    Out[31]:

    0.9860002456097261

    模型评估

    In [28]:

    看下F1分数

    f1_score(y_test,y_predlog)
    

    Out[28]:

    0.9709418837675351

    画出PR图

    In [32]:

    prob = logreg.predict_proba(x_test_std)
    precision, recall, thresholds = precision_recall_curve(y_test,prob[:,1])
    plt.plot(precision,recall)
    plt.xlabel("precision")
    plt.ylabel("recall")
    

    Out[32]:
    <matplotlib.text.Text at 0xb688b30>

    precison recall.png

    作出report

    In [33]:

    from sklearn.metrics import classification_report
    print(classification_report(y_test,logreg.predict(x_test_std),target_names=["unoccupancy","occupancy"]))
    
                    precision       recall          f1-score        support
    unoccupancy       1.00           0.97           0.98             1693 
    occupancy         0.95          0.99           0.97             972
    avg /total        0.98          0.98           0.98              2665
    

    In [34]:

    svm模型

    svc =SVC()
    svc.fit(x_train_std,y_train)
    y_predsvc=svc.predict(x_test_std)
    svc.score(x_train_std,y_train)
    

    Out[34]:

    0.98882475746039544

    In [35]:

    f1_score(y_test,y_predsvc)
    

    Out[35]:

    0.96035678889990084

    In [39]:

    random forest 模型

    rf =RandomForestClassifier()
    rf.fit(x_train_std,y_train)
    y_predrf = rf.predict(x_test_std)
    rf.score(x_train_std,y_train)
    

    Out[39]:

    0.99987719513692741

    In [44]:

    f1_score(y_test, y_predrf)
    

    Out[44]:

    0.90160427807486643

    训练集达到99.98%,但在测试集上不是很好

    综合三个模型,选用logistic regression

    相关文章

      网友评论

          本文标题:通过感应数据检测人员建模分析

          本文链接:https://www.haomeiwen.com/subject/mdscrqtx.html