美文网首页
线性回归分析_Boston房价预测

线性回归分析_Boston房价预测

作者: a_big_cat | 来源:发表于2021-03-29 08:39 被阅读0次

分析目的:

我们使用Boston房价数据集,利用线性回归模型建立房价与各变量之间的关系

1.获取数据集



#导入数据
import pandas as pd
import numpy as np
Boston=pd.read_csv("./BostonHousePriceDataset.csv",
                   usecols=['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV'])

# 特征    说明
# MedianHomePrice   房价中位数
# CRIM  人均城镇犯罪率
# ZN    25,000平方英尺以上土地的住宅用地比例
# INDIUS    每个城镇非零售业务英亩的比例。
# CHAS  查尔斯河虚拟变量(如果束缚河,则为1;否则为0)
# NOX-  氧化氮浓度(百万分之一)
# RM    每个住宅的平均房间数
# AGE   1940年之前建造的自有住房的比例
# DIS   到五个波士顿就业中心的加权距离
# RAD   径向公路的可达性指数
# TAX   每10,000美元的全值财产税率
# PTRATIO   各镇师生比例
# B 1000(Bk-0.63)^ 2,其中Bk是按城镇划分的黑人比例
# LSTAT 人口状况降低百分比
# MEDV  自有住房的中位价格(以$ 1000为单位)

Boston.head() #数据集的top5个样本
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.04741 0.0 11.93 0 0.573 6.030 80.8 2.5050 1 273.0 21.0 396.90 7.88 11.9
1 0.10959 0.0 11.93 0 0.573 6.794 89.3 2.3889 1 273.0 21.0 393.45 6.48 22.0
2 0.06076 0.0 11.93 0 0.573 6.976 91.0 2.1675 1 273.0 21.0 396.90 5.64 23.9
3 0.04527 0.0 11.93 0 0.573 6.120 76.7 2.2875 1 273.0 21.0 396.90 9.08 20.6
4 0.06263 0.0 11.93 0 0.573 6.593 69.1 2.4786 1 273.0 21.0 391.99 9.67 22.4

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# from sklearn.datasets import load_boston
# dataset = load_boston()

plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False #当坐标轴有负号的时候可以显示负号

2.数据探索


Boston.info() #数据集信息
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    int64  
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    int64  
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(12), int64(2)
memory usage: 55.5 KB

所有变量没有空值

Boston.describe().T #数据集的基本统计量
count mean std min 25% 50% 75% max
CRIM 506.0 3.613524 8.601545 0.00632 0.082045 0.25651 3.677082 88.9762
ZN 506.0 11.363636 23.322453 0.00000 0.000000 0.00000 12.500000 100.0000
INDUS 506.0 11.136779 6.860353 0.46000 5.190000 9.69000 18.100000 27.7400
CHAS 506.0 0.069170 0.253994 0.00000 0.000000 0.00000 0.000000 1.0000
NOX 506.0 0.554695 0.115878 0.38500 0.449000 0.53800 0.624000 0.8710
RM 506.0 6.284634 0.702617 3.56100 5.885500 6.20850 6.623500 8.7800
AGE 506.0 68.574901 28.148861 2.90000 45.025000 77.50000 94.075000 100.0000
DIS 506.0 3.795043 2.105710 1.12960 2.100175 3.20745 5.188425 12.1265
RAD 506.0 9.549407 8.707259 1.00000 4.000000 5.00000 24.000000 24.0000
TAX 506.0 408.237154 168.537116 187.00000 279.000000 330.00000 666.000000 711.0000
PTRATIO 506.0 18.455534 2.164946 12.60000 17.400000 19.05000 20.200000 22.0000
B 506.0 356.674032 91.294864 0.32000 375.377500 391.44000 396.225000 396.9000
LSTAT 506.0 12.653063 7.141062 1.73000 6.950000 11.36000 16.955000 37.9700
MEDV 506.0 22.532806 9.197104 5.00000 17.025000 21.20000 25.000000 50.0000

有些存在极端值



import matplotlib.pyplot as plt
#定义绘制散点图的函数
def drawing(x, y, xlabel):
    plt.scatter(x, y)
    plt.title('%s与房价散点图' %xlabel)
    plt.xlabel(xlabel)
    plt.ylabel('房价')
    plt.yticks(range(0,60,5))
    plt.grid()
    plt.show()


查看各变量与房价的散点图



plt.figure(figsize=(15,10.5))
plot_count = 1
for feature in list(Boston.columns)[0:12]:
    plt.subplot(3,4,plot_count)
    plt.scatter(Boston [feature], Boston ['MEDV'])
    plt.xlabel(feature.replace('_',' ').title())
    plt.ylabel('MEDV')
    plot_count+=1
plt.show()


output_14_0.png


#相关系数计算
corr = Boston.corr()
print(corr)
#绘制相关矩阵图形
import seaborn as sn
varcorr = Boston[['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV']].corr()
mask = np.array(varcorr)
mask[np.tril_indices_from(mask)] = False
sn.heatmap(varcorr, mask=mask,vmax=.8, square=True,annot=False)


             CRIM        ZN     INDUS      CHAS       NOX        RM       AGE  \
CRIM     1.000000 -0.200469  0.406583 -0.055892  0.420972 -0.219247  0.352734   
ZN      -0.200469  1.000000 -0.533828 -0.042697 -0.516604  0.311991 -0.569537   
INDUS    0.406583 -0.533828  1.000000  0.062938  0.763651 -0.391676  0.644779   
CHAS    -0.055892 -0.042697  0.062938  1.000000  0.091203  0.091251  0.086518   
NOX      0.420972 -0.516604  0.763651  0.091203  1.000000 -0.302188  0.731470   
RM      -0.219247  0.311991 -0.391676  0.091251 -0.302188  1.000000 -0.240265   
AGE      0.352734 -0.569537  0.644779  0.086518  0.731470 -0.240265  1.000000   
DIS     -0.379670  0.664408 -0.708027 -0.099176 -0.769230  0.205246 -0.747881   
RAD      0.625505 -0.311948  0.595129 -0.007368  0.611441 -0.209847  0.456022   
TAX      0.582764 -0.314563  0.720760 -0.035587  0.668023 -0.292048  0.506456   
PTRATIO  0.289946 -0.391679  0.383248 -0.121515  0.188933 -0.355501  0.261515   
B       -0.385064  0.175520 -0.356977  0.048788 -0.380051  0.128069 -0.273534   
LSTAT    0.455621 -0.412995  0.603800 -0.053929  0.590879 -0.613808  0.602339   
MEDV    -0.388305  0.360445 -0.483725  0.175260 -0.427321  0.695360 -0.376955   

              DIS       RAD       TAX   PTRATIO         B     LSTAT      MEDV  
CRIM    -0.379670  0.625505  0.582764  0.289946 -0.385064  0.455621 -0.388305  
ZN       0.664408 -0.311948 -0.314563 -0.391679  0.175520 -0.412995  0.360445  
INDUS   -0.708027  0.595129  0.720760  0.383248 -0.356977  0.603800 -0.483725  
CHAS    -0.099176 -0.007368 -0.035587 -0.121515  0.048788 -0.053929  0.175260  
NOX     -0.769230  0.611441  0.668023  0.188933 -0.380051  0.590879 -0.427321  
RM       0.205246 -0.209847 -0.292048 -0.355501  0.128069 -0.613808  0.695360  
AGE     -0.747881  0.456022  0.506456  0.261515 -0.273534  0.602339 -0.376955  
DIS      1.000000 -0.494588 -0.534432 -0.232471  0.291512 -0.496996  0.249929  
RAD     -0.494588  1.000000  0.910228  0.464741 -0.444413  0.488676 -0.381626  
TAX     -0.534432  0.910228  1.000000  0.460853 -0.441808  0.543993 -0.468536  
PTRATIO -0.232471  0.464741  0.460853  1.000000 -0.177383  0.374044 -0.507787  
B        0.291512 -0.444413 -0.441808 -0.177383  1.000000 -0.366087  0.333461  
LSTAT   -0.496996  0.488676  0.543993  0.374044 -0.366087  1.000000 -0.737663  
MEDV     0.249929 -0.381626 -0.468536 -0.507787  0.333461 -0.737663  1.000000  





<matplotlib.axes._subplots.AxesSubplot at 0x1e33967a948>
output_15_2.png


print(Boston.corr().abs().nlargest(4, 'MEDV').index)


Index(['MEDV', 'LSTAT', 'RM', 'PTRATIO'], dtype='object')

与因变量MEDV相关程度最高的三个变量是LSTAT、RM、PTRATIO

3.建立模型

X = Boston[['LSTAT','RM','PTRATIO']]
Y = Boston['MEDV']
from sklearn.model_selection import train_test_split
x_train, x_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2,random_state=5)

print(x_train.shape)
print(Y_train.shape)
print(x_test.shape)
print(Y_test.shape)
(404, 3)
(404,)
(102, 3)
(102,)

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x_train, Y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)


print(model.intercept_)


15.418645832987309


coeffcients = pd.DataFrame([x_train.columns,model.coef_]).T
coeffcients = coeffcients.rename(columns={0: 'Attribute', 1: 'Coefficients'})
coeffcients


Attribute Coefficients
0 LSTAT -0.566666
1 RM 4.99206
2 PTRATIO -0.929461

MEDV=15.4-0.56LSTAT+4.99RM-0.93*PTRATIO

4.模型评估

print('R-Squared: %.4f' % model.score(x_test,Y_test))
R-Squared: 0.6035

本次模型训练结果良好

price_pred = model.predict(x_test)
plt.scatter(Y_test, price_pred)
plt.xlabel("Actual Prices")
plt.ylabel("Predicted prices")
plt.title("Actual prices vs Predicted prices")
Text(0.5, 1.0, 'Actual prices vs Predicted prices')
output_29_1.png

查看数据分布情况



from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sn
sn.distplot(Boston['MEDV'], hist=True);
fig = plt.figure()
res = stats.probplot(Boston['MEDV'], plot=plt)


output_31_0.png output_31_1.png

查看MEDV的直方图,发现50左右有离群点,这对线性回归有较大影响

#删除MEDV=50的观测样本
Boston_new=Boston[Boston['MEDV']<50]
Boston_new.info()
X_new = Boston_new[ ['LSTAT','RM','PTRATIO']]
Y_new = Boston_new['MEDV']
from sklearn.model_selection import train_test_split
x_new_train, x_new_test, Y_new_train, Y_new_test = train_test_split(X_new, Y_new, test_size = 0.2,random_state=5)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x_new_train, Y_new_train)
print('R-Squared: %.4f' % model.score(x_new_test,Y_new_test))


<class 'pandas.core.frame.DataFrame'>
Int64Index: 490 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     490 non-null    float64
 1   ZN       490 non-null    float64
 2   INDUS    490 non-null    float64
 3   CHAS     490 non-null    int64  
 4   NOX      490 non-null    float64
 5   RM       490 non-null    float64
 6   AGE      490 non-null    float64
 7   DIS      490 non-null    float64
 8   RAD      490 non-null    int64  
 9   TAX      490 non-null    float64
 10  PTRATIO  490 non-null    float64
 11  B        490 non-null    float64
 12  LSTAT    490 non-null    float64
 13  MEDV     490 non-null    float64
dtypes: float64(12), int64(2)
memory usage: 57.4 KB
R-Squared: 0.7458

删除50的样本之后,重新拟合的R方为0.7458,较第一次有明显的提升。

from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sn
sn.distplot(Boston_new['MEDV'], hist=True);
fig = plt.figure()
res = stats.probplot(Boston_new['MEDV'], plot=plt)
output_35_0.png output_35_1.png

print(model.intercept_)

coeffcients = pd.DataFrame([x_new_train.columns,model.coef_]).T
coeffcients = coeffcients.rename(columns={0: 'Attribute', 1: 'Coefficients'})
coeffcients



22.52450276250959
Attribute Coefficients
0 LSTAT -0.533567
1 RM 3.75825
2 PTRATIO -0.944395

5.总结:

最终回归函数为:MEDV=22.52450276250959-0.53LSTAT+3.758RM-0.94*PTRATIO

拟合的R方为0.7458


相关文章

网友评论

      本文标题:线性回归分析_Boston房价预测

      本文链接:https://www.haomeiwen.com/subject/snnhhltx.html