美文网首页厦门市房价预测系统我爱编程
房产估值模型训练及预测结果

房产估值模型训练及预测结果

作者: 潇洒坤 | 来源:发表于2018-06-26 20:38 被阅读89次

    0.下载数据集

    本文房产估值模型源数据为厦门市房价数据,文件下载链接: https://pan.baidu.com/s/1vOact6MsyZZlTSxjmMqTbw 密码: 8zg6
    下载文件打开后如下图所示:

    文件打开图示.png
    从上图中可以看出数据已经经过简单的处理,只需要再稍微调整就可以投入模型的训练中。

    1.MLPR和GBR模型对比

    df_y = df['unitPrice']得到DataFrame的unitPrice字段数据;
    y = df_y.values得到shape为(21935,),类型为numpy.ndarray的矩阵,即长度为21935的一维矩阵;
    df_x = df.drop(['unitPrice'],axis=1)得到DataFrame的除了unitPrice字段的其他字段;
    x = df_x.values得到shape为(21935,120),类型为numpy.ndarray的矩阵,即大小为21935*120的二维矩阵。
    用sklearn中的预处理函数preprocessing.StandardScaler()对数据标准化处理,处理过程是先用训练集fit,再把测试集也标准化处理。
    调用MLPRegresso()获得多层感知器-回归模型,再用训练集进行训练,最后对测试集进行测试得分。
    调用GradientBoostingRegressor()获得集成-回归模型,再用训练集进行训练,最后对测试集进行测试得分。

    from sklearn.model_selection import train_test_split
    from sklearn import preprocessing
    from sklearn.neural_network import MLPRegressor
    from sklearn.ensemble import GradientBoostingRegressor
    import pandas as pd
    
    #boston = load_boston()
    df = pd.read_excel("数据处理结果.xlsx")
    df_y = df['unitPrice']
    df_x = df.drop(['unitPrice'],axis=1)
    x = df_x.values
    y = df_y.values
    
    train_x,test_x,train_y,test_y = train_test_split(x,y,train_size=0.8,\
                                                      random_state=33)
    ss_x = preprocessing.StandardScaler()
    train_x1 = ss_x.fit_transform(train_x)
    test_x1 = ss_x.transform(test_x)
    
    ss_y = preprocessing.StandardScaler()
    train_y1 = ss_y.fit_transform(train_y.reshape(-1,1))
    test_y1 = ss_y.transform(test_y.reshape(-1,1))
    
    model_mlp = MLPRegressor(solver='lbfgs',hidden_layer_sizes=(20,20,20),random_state=1)
    model_mlp.fit(train_x1,train_y1.ravel())
    mlp_score = model_mlp.score(test_x1,test_y1.ravel())
    print("sklearn多层感知器-回归模型得分",mlp_score)
    
    model_gbr = GradientBoostingRegressor()
    model_gbr.fit(train_x1,train_y1.ravel())
    gbr_score = model_gbr.score(test_x1,test_y1.ravel())
    print("sklearn集成-回归模型得分",gbr_score)
    

    打印的结果是:

    sklearn多层感知器-回归模型得分 0.683941816792
    sklearn集成-回归模型得分 0.762351806857

    对于第一次调整模型,这个结果还可以接受。

    2.异常值处理

    image.png
    从图中我们可以看到有的房子单价达到几十上百万,这种异常值需要删除。
    暂时没有发现可以直接调用处理异常值的函数,所以需要自己写。下面的代码中定义了一个cleanOutlier函数,函数的功能主要是删除异常值。首先得清楚下四分位数和上四分位数的概念:例如总共有100个数,中位数是从小到大排序第50个数的值,低位数是从小到大排序第25个数,高位数是从小到大排序第75个数。
    四分位距是上四分位数减下四分位数所得值,例如:上四分位数为900,下四分位数为700,则四分位距为200
    异常值指的是过大或者过小的值。在我们这个删除异常值的方法中,低于(下四分位数-3四分位距)的值或者高于(上四分位数+3四分位距)的值会被判定为异常值并删除。例如,上四分位数为900,下四分位数为700,则低于100或者高于1500的数被删除。
    将DataFrame转换为ndarray只需要用df.values就可以获得,训练模型时数值类型一般为float,所以用df.values.astype('float')来获得浮点类型数值的矩阵。
    cleanOutlier函数删除异常值,然后把第0列负值给y变量,把1列到最后一列赋值给x变量
    因为x大多是1-hot编码,所以不需要再进行标准化。
    from sklearn.model_selection import train_test_split
    from sklearn import preprocessing
    from sklearn.neural_network import MLPRegressor
    from sklearn.ensemble import GradientBoostingRegressor
    import pandas as pd
    
    def cleanOutlier(data,column,mul=3):
        data = data[data[:,column].argsort()] #得到排序后的ndarray
        l = len(data)
        low = int(l/4)
        high = int(l/4*3)
        lowValue = data[low,column]
        highValue = data[high,column]
        print("下四分位数为{}  上四分位数{}".format(lowValue,highValue))
        if lowValue - mul * (highValue - lowValue) < data[0,column] :
            delLowValue = data[0,column]
        else:
            delLowValue = lowValue - mul * (highValue - lowValue)
        if highValue + mul * (highValue - lowValue) > data[-1,column]:
            delHighValue = data[-1,column]
        else:
            delHighValue = highValue + mul * (highValue - lowValue)
        print("删除第{}列中数值小于{}或者大于{}的部分".format(column,\
              delLowValue,delHighValue))
        for i in range(low):
            if data[i,column] >= delLowValue:
                recordLow = i 
                break
        for i in range(len(data)-1,high,-1):
            if data[i,column] <= delHighValue:
                recordHigh = i
                break
        #打印处理异常值的相关信息
        print("原矩阵共有{}行".format(len(data)),end=',')
        print("保留{}到{}行".format(recordLow,recordHigh),end=',')
        data = data[recordLow:recordHigh+1]
        print("删除第{}列中的异常值后剩余{}行".format(column,\
              recordHigh+1-recordLow))
        return data
    
    df = pd.read_excel("数据处理结果.xlsx")
    data = df.values.astype('float')
    data = cleanOutlier(data,0)
    x = data[:,1:]
    y = data[:,0]
    
    train_x,test_x,train_y,test_y = train_test_split(x,y,train_size=0.8,\
                                                      random_state=33)
    
    ss_y = preprocessing.StandardScaler()
    train_y = ss_y.fit_transform(train_y.reshape(-1,1))
    test_y = ss_y.transform(test_y.reshape(-1,1))
    
    model_mlp = MLPRegressor(solver='lbfgs',hidden_layer_sizes=(20,20,20),random_state=1)
    model_mlp.fit(train_x,train_y.ravel())
    mlp_score = model_mlp.score(test_x,test_y.ravel())
    print("sklearn多层感知器-回归模型得分",mlp_score)
    
    model_gbr = GradientBoostingRegressor(learning_rate=0.1)
    model_gbr.fit(train_x,train_y.ravel())
    
    ss_y = preprocessing.StandardScaler()
    train_y = ss_y.fit_transform(train_y.reshape(-1,1))
    test_y = ss_y.transform(test_y.reshape(-1,1))
    
    gbr_score = model_gbr.score(test_x,test_y.ravel())
    print("sklearn集成-回归模型得分",gbr_score)
    

    打印的结果是:

    sklearn多层感知器-回归模型得分 0.795028773029
    sklearn集成-回归模型得分 0.767157061712

    对于第二次调整模型,我们可以看到sklearn多层感知器-回归模型得分明显提高,而对于sklearn集成-回归模型则没有太大提高。总之,这次异常值处理是成功的。

    3.正态化

    正态化就是将y的值以e为底取对数,得到新的一列赋值给y。
    正态化用一个循环完成:for i in range(len(y)): y[i] = math.log(y[i])
    正态化之后按照原理是不用再标准化了,但是经过实验,对x,y标准化都可以提高得分。

    from sklearn.model_selection import train_test_split
    from sklearn import preprocessing
    from sklearn.neural_network import MLPRegressor
    from sklearn.ensemble import GradientBoostingRegressor
    import pandas as pd
    import math
    
    def cleanOutlier(data,column,mul=3):
        data = data[data[:,column].argsort()] #得到排序后的ndarray
        l = len(data)
        low = int(l/4)
        high = int(l/4*3)
        lowValue = data[low,column]
        highValue = data[high,column]
        print("下四分位数为{}  上四分位数{}".format(lowValue,highValue))
        if lowValue - mul * (highValue - lowValue) < data[0,column] :
            delLowValue = data[0,column]
        else:
            delLowValue = lowValue - mul * (highValue - lowValue)
        if highValue + mul * (highValue - lowValue) > data[-1,column]:
            delHighValue = data[-1,column]
        else:
            delHighValue = highValue + mul * (highValue - lowValue)
        print("删除第{}列中数值小于{}或者大于{}的部分".format(column,\
              delLowValue,delHighValue))
        for i in range(low):
            if data[i,column] >= delLowValue:
                recordLow = i 
                break
        for i in range(len(data)-1,high,-1):
            if data[i,column] <= delHighValue:
                recordHigh = i
                break
        #打印处理异常值的相关信息
        print("原矩阵共有{}行".format(len(data)),end=',')
        print("保留{}到{}行".format(recordLow,recordHigh),end=',')
        data = data[recordLow:recordHigh+1]
        print("删除第{}列中的异常值后剩余{}行".format(column,\
              recordHigh+1-recordLow))
        return data
    
    df = pd.read_excel("数据处理结果.xlsx")
    data = df.values.astype('float')
    data = cleanOutlier(data,0)
    x = data[:,1:]
    y = data[:,0]
    for i in range(len(y)):
        y[i] = math.log(y[i])
        
    train_x,test_x,train_y,test_y = train_test_split(x,y,train_size=0.8,\
                                                      random_state=33)
    
    ss_x = preprocessing.StandardScaler()
    train_x = ss_x.fit_transform(train_x)
    test_x = ss_x.transform(test_x)
    
    ss_y = preprocessing.StandardScaler()
    train_y = ss_y.fit_transform(train_y.reshape(-1,1))
    test_y = ss_y.transform(test_y.reshape(-1,1))
    
    model_mlp = MLPRegressor(solver='lbfgs',hidden_layer_sizes=(20,20,20),random_state=1)
    model_mlp.fit(train_x,train_y.ravel())
    mlp_score = model_mlp.score(test_x,test_y.ravel())
    print("sklearn多层感知器-回归模型得分",mlp_score)
    
    model_gbr = GradientBoostingRegressor(learning_rate=0.1)
    model_gbr.fit(train_x,train_y.ravel())
    gbr_score = model_gbr.score(test_x,test_y.ravel())
    print("sklearn集成-回归模型得分",gbr_score)
    

    打印的结果是:

    sklearn多层感知器-回归模型得分 0.831448099649
    sklearn集成-回归模型得分 0.780133207248

    相比较于前一次,分数又得到了提高,是一次成功的调整。

    4.交叉验证

    主要使用的是sklearn.model_selection中的KFold方法选择训练集和测试集
    kf = KFold(n_splits=5,shuffle=True)这一行代码初始化KFold对象
    for train_index,test_index in kf.split(x):这一行代码可以看出kf.split(x)得到的是一个长度为n_splits的列表,即长度为5的列表,列表中元素是元组,元组中的元素是训练集和测试集的索引。

    from sklearn import preprocessing
    from sklearn.neural_network import MLPRegressor
    from sklearn.ensemble import GradientBoostingRegressor
    import pandas as pd
    import math
    from sklearn.model_selection import KFold
    
    def cleanOutlier(data,column,mul=3):
        data = data[data[:,column].argsort()] #得到排序后的ndarray
        l = len(data)
        low = int(l/4)
        high = int(l/4*3)
        lowValue = data[low,column]
        highValue = data[high,column]
        print("下四分位数为{}  上四分位数{}".format(lowValue,highValue))
        if lowValue - mul * (highValue - lowValue) < data[0,column] :
            delLowValue = data[0,column]
        else:
            delLowValue = lowValue - mul * (highValue - lowValue)
        if highValue + mul * (highValue - lowValue) > data[-1,column]:
            delHighValue = data[-1,column]
        else:
            delHighValue = highValue + mul * (highValue - lowValue)
        print("删除第{}列中数值小于{}或者大于{}的部分".format(column,\
              delLowValue,delHighValue))
        for i in range(low):
            if data[i,column] >= delLowValue:
                recordLow = i 
                break
        for i in range(len(data)-1,high,-1):
            if data[i,column] <= delHighValue:
                recordHigh = i
                break
        #打印处理异常值的相关信息
        print("原矩阵共有{}行".format(len(data)),end=',')
        print("保留{}到{}行".format(recordLow,recordHigh),end=',')
        data = data[recordLow:recordHigh+1]
        print("删除第{}列中的异常值后剩余{}行".format(column,\
              recordHigh+1-recordLow))
        return data
    
    df = pd.read_excel("数据处理结果.xlsx")
    data = df.values.astype('float')
    data = cleanOutlier(data,0)
    x = data[:,1:]
    y = data[:,0]
    for i in range(len(y)):
        y[i] = math.log(y[i])
    
    kf = KFold(n_splits=5,shuffle=True)
    
    for train_index,test_index in kf.split(x):
        train_x = x[train_index]    
        test_x = x[test_index]
        train_y = y[train_index]
        test_y = y[test_index]
     
        ss_x = preprocessing.StandardScaler()
        train_x = ss_x.fit_transform(train_x)
        test_x = ss_x.transform(test_x)
        
        ss_y = preprocessing.StandardScaler()
        train_y = ss_y.fit_transform(train_y.reshape(-1,1))
        test_y = ss_y.transform(test_y.reshape(-1,1))
        
        model_mlp = MLPRegressor(solver='lbfgs',hidden_layer_sizes=(20,20,20),random_state=1)
        model_mlp.fit(train_x,train_y.ravel())
        mlp_score = model_mlp.score(test_x,test_y.ravel())
        print("sklearn多层感知器-回归模型得分",mlp_score)
        
        model_gbr = GradientBoostingRegressor(learning_rate=0.1)
        model_gbr.fit(train_x,train_y.ravel())
        gbr_score = model_gbr.score(test_x,test_y.ravel())
        print("sklearn集成-回归模型得分",gbr_score)
    

    打印结果是:

    sklearn多层感知器-回归模型得分 0.8427725943791746
    sklearn集成-回归模型得分 0.7915684454283963
    sklearn多层感知器-回归模型得分 0.8317854959807023
    sklearn集成-回归模型得分 0.7705608099963528
    sklearn多层感知器-回归模型得分 0.8369280445356948
    sklearn集成-回归模型得分 0.7851823734454625
    sklearn多层感知器-回归模型得分 0.8364897250676866
    sklearn集成-回归模型得分 0.7833199279062474
    sklearn多层感知器-回归模型得分 0.8335782493590231
    sklearn集成-回归模型得分 0.7722233325504181

    相关文章

      网友评论

        本文标题:房产估值模型训练及预测结果

        本文链接:https://www.haomeiwen.com/subject/woeryftx.html