线性回归分析_Boston房价预测

作者: a_big_cat | 来源:发表于2021-03-29 08:39 被阅读0次

线性回归分析_Boston房价预测
【线性回归】预测Boston房价
线性回归预测房价
keras-boston房价回归
机器学习之线性回归
深度学习-1
吃瓜学习笔记2：第三章线性回归&对数几率回归&线性判别分析
【机器学习】-Week1 2. 模型表示
分类与预测
Linear regression with multiple

分析目的：

我们使用Boston房价数据集，利用线性回归模型建立房价与各变量之间的关系

1.获取数据集



#导入数据
import pandas as pd
import numpy as np
Boston=pd.read_csv("./BostonHousePriceDataset.csv",
                   usecols=['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV'])

# 特征    说明
# MedianHomePrice   房价中位数
# CRIM  人均城镇犯罪率
# ZN    25,000平方英尺以上土地的住宅用地比例
# INDIUS    每个城镇非零售业务英亩的比例。
# CHAS  查尔斯河虚拟变量（如果束缚河，则为1；否则为0）
# NOX-  氧化氮浓度（百万分之一）
# RM    每个住宅的平均房间数
# AGE   1940年之前建造的自有住房的比例
# DIS   到五个波士顿就业中心的加权距离
# RAD   径向公路的可达性指数
# TAX   每10,000美元的全值财产税率
# PTRATIO   各镇师生比例
# B 1000（Bk-0.63）^ 2，其中Bk是按城镇划分的黑人比例
# LSTAT 人口状况降低百分比
# MEDV  自有住房的中位价格（以$ 1000为单位）

Boston.head() #数据集的top5个样本


	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
0	0.04741	0.0	11.93	0	0.573	6.030	80.8	2.5050	1	273.0	21.0	396.90	7.88	11.9
1	0.10959	0.0	11.93	0	0.573	6.794	89.3	2.3889	1	273.0	21.0	393.45	6.48	22.0
2	0.06076	0.0	11.93	0	0.573	6.976	91.0	2.1675	1	273.0	21.0	396.90	5.64	23.9
3	0.04527	0.0	11.93	0	0.573	6.120	76.7	2.2875	1	273.0	21.0	396.90	9.08	20.6
4	0.06263	0.0	11.93	0	0.573	6.593	69.1	2.4786	1	273.0	21.0	391.99	9.67	22.4

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# from sklearn.datasets import load_boston
# dataset = load_boston()

plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False #当坐标轴有负号的时候可以显示负号

2.数据探索


Boston.info() #数据集信息

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    int64  
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    int64  
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(12), int64(2)
memory usage: 55.5 KB

所有变量没有空值

Boston.describe().T #数据集的基本统计量


	count	mean	std	min	25%	50%	75%	max
CRIM	506.0	3.613524	8.601545	0.00632	0.082045	0.25651	3.677082	88.9762
ZN	506.0	11.363636	23.322453	0.00000	0.000000	0.00000	12.500000	100.0000
INDUS	506.0	11.136779	6.860353	0.46000	5.190000	9.69000	18.100000	27.7400
CHAS	506.0	0.069170	0.253994	0.00000	0.000000	0.00000	0.000000	1.0000
NOX	506.0	0.554695	0.115878	0.38500	0.449000	0.53800	0.624000	0.8710
RM	506.0	6.284634	0.702617	3.56100	5.885500	6.20850	6.623500	8.7800
AGE	506.0	68.574901	28.148861	2.90000	45.025000	77.50000	94.075000	100.0000
DIS	506.0	3.795043	2.105710	1.12960	2.100175	3.20745	5.188425	12.1265
RAD	506.0	9.549407	8.707259	1.00000	4.000000	5.00000	24.000000	24.0000
TAX	506.0	408.237154	168.537116	187.00000	279.000000	330.00000	666.000000	711.0000
PTRATIO	506.0	18.455534	2.164946	12.60000	17.400000	19.05000	20.200000	22.0000
B	506.0	356.674032	91.294864	0.32000	375.377500	391.44000	396.225000	396.9000
LSTAT	506.0	12.653063	7.141062	1.73000	6.950000	11.36000	16.955000	37.9700
MEDV	506.0	22.532806	9.197104	5.00000	17.025000	21.20000	25.000000	50.0000

有些存在极端值



import matplotlib.pyplot as plt
#定义绘制散点图的函数
def drawing(x, y, xlabel):
    plt.scatter(x, y)
    plt.title('%s与房价散点图' %xlabel)
    plt.xlabel(xlabel)
    plt.ylabel('房价')
    plt.yticks(range(0,60,5))
    plt.grid()
    plt.show()

查看各变量与房价的散点图



plt.figure(figsize=(15,10.5))
plot_count = 1
for feature in list(Boston.columns)[0:12]:
    plt.subplot(3,4,plot_count)
    plt.scatter(Boston [feature], Boston ['MEDV'])
    plt.xlabel(feature.replace('_',' ').title())
    plt.ylabel('MEDV')
    plot_count+=1
plt.show()

output_14_0.png



#相关系数计算
corr = Boston.corr()
print(corr)
#绘制相关矩阵图形
import seaborn as sn
varcorr = Boston[['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV']].corr()
mask = np.array(varcorr)
mask[np.tril_indices_from(mask)] = False
sn.heatmap(varcorr, mask=mask,vmax=.8, square=True,annot=False)

             CRIM        ZN     INDUS      CHAS       NOX        RM       AGE  \
CRIM     1.000000 -0.200469  0.406583 -0.055892  0.420972 -0.219247  0.352734   
ZN      -0.200469  1.000000 -0.533828 -0.042697 -0.516604  0.311991 -0.569537   
INDUS    0.406583 -0.533828  1.000000  0.062938  0.763651 -0.391676  0.644779   
CHAS    -0.055892 -0.042697  0.062938  1.000000  0.091203  0.091251  0.086518   
NOX      0.420972 -0.516604  0.763651  0.091203  1.000000 -0.302188  0.731470   
RM      -0.219247  0.311991 -0.391676  0.091251 -0.302188  1.000000 -0.240265   
AGE      0.352734 -0.569537  0.644779  0.086518  0.731470 -0.240265  1.000000   
DIS     -0.379670  0.664408 -0.708027 -0.099176 -0.769230  0.205246 -0.747881   
RAD      0.625505 -0.311948  0.595129 -0.007368  0.611441 -0.209847  0.456022   
TAX      0.582764 -0.314563  0.720760 -0.035587  0.668023 -0.292048  0.506456   
PTRATIO  0.289946 -0.391679  0.383248 -0.121515  0.188933 -0.355501  0.261515   
B       -0.385064  0.175520 -0.356977  0.048788 -0.380051  0.128069 -0.273534   
LSTAT    0.455621 -0.412995  0.603800 -0.053929  0.590879 -0.613808  0.602339   
MEDV    -0.388305  0.360445 -0.483725  0.175260 -0.427321  0.695360 -0.376955   

              DIS       RAD       TAX   PTRATIO         B     LSTAT      MEDV  
CRIM    -0.379670  0.625505  0.582764  0.289946 -0.385064  0.455621 -0.388305  
ZN       0.664408 -0.311948 -0.314563 -0.391679  0.175520 -0.412995  0.360445  
INDUS   -0.708027  0.595129  0.720760  0.383248 -0.356977  0.603800 -0.483725  
CHAS    -0.099176 -0.007368 -0.035587 -0.121515  0.048788 -0.053929  0.175260  
NOX     -0.769230  0.611441  0.668023  0.188933 -0.380051  0.590879 -0.427321  
RM       0.205246 -0.209847 -0.292048 -0.355501  0.128069 -0.613808  0.695360  
AGE     -0.747881  0.456022  0.506456  0.261515 -0.273534  0.602339 -0.376955  
DIS      1.000000 -0.494588 -0.534432 -0.232471  0.291512 -0.496996  0.249929  
RAD     -0.494588  1.000000  0.910228  0.464741 -0.444413  0.488676 -0.381626  
TAX     -0.534432  0.910228  1.000000  0.460853 -0.441808  0.543993 -0.468536  
PTRATIO -0.232471  0.464741  0.460853  1.000000 -0.177383  0.374044 -0.507787  
B        0.291512 -0.444413 -0.441808 -0.177383  1.000000 -0.366087  0.333461  
LSTAT   -0.496996  0.488676  0.543993  0.374044 -0.366087  1.000000 -0.737663  
MEDV     0.249929 -0.381626 -0.468536 -0.507787  0.333461 -0.737663  1.000000  





<matplotlib.axes._subplots.AxesSubplot at 0x1e33967a948>

output_15_2.png



print(Boston.corr().abs().nlargest(4, 'MEDV').index)

Index(['MEDV', 'LSTAT', 'RM', 'PTRATIO'], dtype='object')

与因变量MEDV相关程度最高的三个变量是LSTAT、RM、PTRATIO

3.建立模型

X = Boston[['LSTAT','RM','PTRATIO']]
Y = Boston['MEDV']

from sklearn.model_selection import train_test_split
x_train, x_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2,random_state=5)


print(x_train.shape)
print(Y_train.shape)
print(x_test.shape)
print(Y_test.shape)

(404, 3)
(404,)
(102, 3)
(102,)


from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x_train, Y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)



print(model.intercept_)

15.418645832987309



coeffcients = pd.DataFrame([x_train.columns,model.coef_]).T
coeffcients = coeffcients.rename(columns={0: 'Attribute', 1: 'Coefficients'})
coeffcients


	Attribute	Coefficients
0	LSTAT	-0.566666
1	RM	4.99206
2	PTRATIO	-0.929461

MEDV=15.4-0.56LSTAT+4.99RM-0.93*PTRATIO

4.模型评估

print('R-Squared: %.4f' % model.score(x_test,Y_test))

R-Squared: 0.6035

本次模型训练结果良好

price_pred = model.predict(x_test)
plt.scatter(Y_test, price_pred)
plt.xlabel("Actual Prices")
plt.ylabel("Predicted prices")
plt.title("Actual prices vs Predicted prices")

Text(0.5, 1.0, 'Actual prices vs Predicted prices')

output_29_1.png

查看数据分布情况



from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sn
sn.distplot(Boston['MEDV'], hist=True);
fig = plt.figure()
res = stats.probplot(Boston['MEDV'], plot=plt)

output_31_0.png

output_31_1.png

查看MEDV的直方图，发现50左右有离群点，这对线性回归有较大影响

#删除MEDV=50的观测样本
Boston_new=Boston[Boston['MEDV']<50]
Boston_new.info()
X_new = Boston_new[ ['LSTAT','RM','PTRATIO']]
Y_new = Boston_new['MEDV']
from sklearn.model_selection import train_test_split
x_new_train, x_new_test, Y_new_train, Y_new_test = train_test_split(X_new, Y_new, test_size = 0.2,random_state=5)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x_new_train, Y_new_train)
print('R-Squared: %.4f' % model.score(x_new_test,Y_new_test))

<class 'pandas.core.frame.DataFrame'>
Int64Index: 490 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     490 non-null    float64
 1   ZN       490 non-null    float64
 2   INDUS    490 non-null    float64
 3   CHAS     490 non-null    int64  
 4   NOX      490 non-null    float64
 5   RM       490 non-null    float64
 6   AGE      490 non-null    float64
 7   DIS      490 non-null    float64
 8   RAD      490 non-null    int64  
 9   TAX      490 non-null    float64
 10  PTRATIO  490 non-null    float64
 11  B        490 non-null    float64
 12  LSTAT    490 non-null    float64
 13  MEDV     490 non-null    float64
dtypes: float64(12), int64(2)
memory usage: 57.4 KB
R-Squared: 0.7458

删除50的样本之后，重新拟合的R方为0.7458，较第一次有明显的提升。

from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sn
sn.distplot(Boston_new['MEDV'], hist=True);
fig = plt.figure()
res = stats.probplot(Boston_new['MEDV'], plot=plt)

output_35_0.png

output_35_1.png


print(model.intercept_)

coeffcients = pd.DataFrame([x_new_train.columns,model.coef_]).T
coeffcients = coeffcients.rename(columns={0: 'Attribute', 1: 'Coefficients'})
coeffcients