分析目的:
我们使用Boston房价数据集,利用线性回归模型建立房价与各变量之间的关系
1.获取数据集
#导入数据
import pandas as pd
import numpy as np
Boston=pd.read_csv("./BostonHousePriceDataset.csv",
usecols=['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV'])
# 特征 说明
# MedianHomePrice 房价中位数
# CRIM 人均城镇犯罪率
# ZN 25,000平方英尺以上土地的住宅用地比例
# INDIUS 每个城镇非零售业务英亩的比例。
# CHAS 查尔斯河虚拟变量(如果束缚河,则为1;否则为0)
# NOX- 氧化氮浓度(百万分之一)
# RM 每个住宅的平均房间数
# AGE 1940年之前建造的自有住房的比例
# DIS 到五个波士顿就业中心的加权距离
# RAD 径向公路的可达性指数
# TAX 每10,000美元的全值财产税率
# PTRATIO 各镇师生比例
# B 1000(Bk-0.63)^ 2,其中Bk是按城镇划分的黑人比例
# LSTAT 人口状况降低百分比
# MEDV 自有住房的中位价格(以$ 1000为单位)
Boston.head() #数据集的top5个样本
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV | |
0 | 0.04741 | 0.0 | 11.93 | 0 | 0.573 | 6.030 | 80.8 | 2.5050 | 1 | 273.0 | 21.0 | 396.90 | 7.88 | 11.9 |
1 | 0.10959 | 0.0 | 11.93 | 0 | 0.573 | 6.794 | 89.3 | 2.3889 | 1 | 273.0 | 21.0 | 393.45 | 6.48 | 22.0 |
2 | 0.06076 | 0.0 | 11.93 | 0 | 0.573 | 6.976 | 91.0 | 2.1675 | 1 | 273.0 | 21.0 | 396.90 | 5.64 | 23.9 |
3 | 0.04527 | 0.0 | 11.93 | 0 | 0.573 | 6.120 | 76.7 | 2.2875 | 1 | 273.0 | 21.0 | 396.90 | 9.08 | 20.6 |
4 | 0.06263 | 0.0 | 11.93 | 0 | 0.573 | 6.593 | 69.1 | 2.4786 | 1 | 273.0 | 21.0 | 391.99 | 9.67 | 22.4 |
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# from sklearn.datasets import load_boston
# dataset = load_boston()
plt.rcParams['font.sans-serif'] = ['SimHei'] # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False #当坐标轴有负号的时候可以显示负号
2.数据探索
Boston.info() #数据集信息
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CRIM 506 non-null float64
1 ZN 506 non-null float64
2 INDUS 506 non-null float64
3 CHAS 506 non-null int64
4 NOX 506 non-null float64
5 RM 506 non-null float64
6 AGE 506 non-null float64
7 DIS 506 non-null float64
8 RAD 506 non-null int64
9 TAX 506 non-null float64
10 PTRATIO 506 non-null float64
11 B 506 non-null float64
12 LSTAT 506 non-null float64
13 MEDV 506 non-null float64
dtypes: float64(12), int64(2)
memory usage: 55.5 KB
所有变量没有空值
Boston.describe().T #数据集的基本统计量
count | mean | std | min | 25% | 50% | 75% | max | |
CRIM | 506.0 | 3.613524 | 8.601545 | 0.00632 | 0.082045 | 0.25651 | 3.677082 | 88.9762 |
ZN | 506.0 | 11.363636 | 23.322453 | 0.00000 | 0.000000 | 0.00000 | 12.500000 | 100.0000 |
INDUS | 506.0 | 11.136779 | 6.860353 | 0.46000 | 5.190000 | 9.69000 | 18.100000 | 27.7400 |
CHAS | 506.0 | 0.069170 | 0.253994 | 0.00000 | 0.000000 | 0.00000 | 0.000000 | 1.0000 |
NOX | 506.0 | 0.554695 | 0.115878 | 0.38500 | 0.449000 | 0.53800 | 0.624000 | 0.8710 |
RM | 506.0 | 6.284634 | 0.702617 | 3.56100 | 5.885500 | 6.20850 | 6.623500 | 8.7800 |
AGE | 506.0 | 68.574901 | 28.148861 | 2.90000 | 45.025000 | 77.50000 | 94.075000 | 100.0000 |
DIS | 506.0 | 3.795043 | 2.105710 | 1.12960 | 2.100175 | 3.20745 | 5.188425 | 12.1265 |
RAD | 506.0 | 9.549407 | 8.707259 | 1.00000 | 4.000000 | 5.00000 | 24.000000 | 24.0000 |
TAX | 506.0 | 408.237154 | 168.537116 | 187.00000 | 279.000000 | 330.00000 | 666.000000 | 711.0000 |
PTRATIO | 506.0 | 18.455534 | 2.164946 | 12.60000 | 17.400000 | 19.05000 | 20.200000 | 22.0000 |
B | 506.0 | 356.674032 | 91.294864 | 0.32000 | 375.377500 | 391.44000 | 396.225000 | 396.9000 |
LSTAT | 506.0 | 12.653063 | 7.141062 | 1.73000 | 6.950000 | 11.36000 | 16.955000 | 37.9700 |
MEDV | 506.0 | 22.532806 | 9.197104 | 5.00000 | 17.025000 | 21.20000 | 25.000000 | 50.0000 |
有些存在极端值
import matplotlib.pyplot as plt
#定义绘制散点图的函数
def drawing(x, y, xlabel):
plt.scatter(x, y)
plt.title('%s与房价散点图' %xlabel)
plt.xlabel(xlabel)
plt.ylabel('房价')
plt.yticks(range(0,60,5))
plt.grid()
plt.show()
查看各变量与房价的散点图
plt.figure(figsize=(15,10.5))
plot_count = 1
for feature in list(Boston.columns)[0:12]:
plt.subplot(3,4,plot_count)
plt.scatter(Boston [feature], Boston ['MEDV'])
plt.xlabel(feature.replace('_',' ').title())
plt.ylabel('MEDV')
plot_count+=1
plt.show()
output_14_0.png
#相关系数计算
corr = Boston.corr()
print(corr)
#绘制相关矩阵图形
import seaborn as sn
varcorr = Boston[['CRIM','ZN','INDUS','CHAS','NOX','RM','AGE','DIS','RAD','TAX','PTRATIO','B','LSTAT','MEDV']].corr()
mask = np.array(varcorr)
mask[np.tril_indices_from(mask)] = False
sn.heatmap(varcorr, mask=mask,vmax=.8, square=True,annot=False)
CRIM ZN INDUS CHAS NOX RM AGE \
CRIM 1.000000 -0.200469 0.406583 -0.055892 0.420972 -0.219247 0.352734
ZN -0.200469 1.000000 -0.533828 -0.042697 -0.516604 0.311991 -0.569537
INDUS 0.406583 -0.533828 1.000000 0.062938 0.763651 -0.391676 0.644779
CHAS -0.055892 -0.042697 0.062938 1.000000 0.091203 0.091251 0.086518
NOX 0.420972 -0.516604 0.763651 0.091203 1.000000 -0.302188 0.731470
RM -0.219247 0.311991 -0.391676 0.091251 -0.302188 1.000000 -0.240265
AGE 0.352734 -0.569537 0.644779 0.086518 0.731470 -0.240265 1.000000
DIS -0.379670 0.664408 -0.708027 -0.099176 -0.769230 0.205246 -0.747881
RAD 0.625505 -0.311948 0.595129 -0.007368 0.611441 -0.209847 0.456022
TAX 0.582764 -0.314563 0.720760 -0.035587 0.668023 -0.292048 0.506456
PTRATIO 0.289946 -0.391679 0.383248 -0.121515 0.188933 -0.355501 0.261515
B -0.385064 0.175520 -0.356977 0.048788 -0.380051 0.128069 -0.273534
LSTAT 0.455621 -0.412995 0.603800 -0.053929 0.590879 -0.613808 0.602339
MEDV -0.388305 0.360445 -0.483725 0.175260 -0.427321 0.695360 -0.376955
DIS RAD TAX PTRATIO B LSTAT MEDV
CRIM -0.379670 0.625505 0.582764 0.289946 -0.385064 0.455621 -0.388305
ZN 0.664408 -0.311948 -0.314563 -0.391679 0.175520 -0.412995 0.360445
INDUS -0.708027 0.595129 0.720760 0.383248 -0.356977 0.603800 -0.483725
CHAS -0.099176 -0.007368 -0.035587 -0.121515 0.048788 -0.053929 0.175260
NOX -0.769230 0.611441 0.668023 0.188933 -0.380051 0.590879 -0.427321
RM 0.205246 -0.209847 -0.292048 -0.355501 0.128069 -0.613808 0.695360
AGE -0.747881 0.456022 0.506456 0.261515 -0.273534 0.602339 -0.376955
DIS 1.000000 -0.494588 -0.534432 -0.232471 0.291512 -0.496996 0.249929
RAD -0.494588 1.000000 0.910228 0.464741 -0.444413 0.488676 -0.381626
TAX -0.534432 0.910228 1.000000 0.460853 -0.441808 0.543993 -0.468536
PTRATIO -0.232471 0.464741 0.460853 1.000000 -0.177383 0.374044 -0.507787
B 0.291512 -0.444413 -0.441808 -0.177383 1.000000 -0.366087 0.333461
LSTAT -0.496996 0.488676 0.543993 0.374044 -0.366087 1.000000 -0.737663
MEDV 0.249929 -0.381626 -0.468536 -0.507787 0.333461 -0.737663 1.000000
<matplotlib.axes._subplots.AxesSubplot at 0x1e33967a948>
output_15_2.png
print(Boston.corr().abs().nlargest(4, 'MEDV').index)
Index(['MEDV', 'LSTAT', 'RM', 'PTRATIO'], dtype='object')
与因变量MEDV相关程度最高的三个变量是LSTAT、RM、PTRATIO
3.建立模型
X = Boston[['LSTAT','RM','PTRATIO']]
Y = Boston['MEDV']
from sklearn.model_selection import train_test_split
x_train, x_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2,random_state=5)
print(x_train.shape)
print(Y_train.shape)
print(x_test.shape)
print(Y_test.shape)
(404, 3)
(404,)
(102, 3)
(102,)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x_train, Y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
print(model.intercept_)
15.418645832987309
coeffcients = pd.DataFrame([x_train.columns,model.coef_]).T
coeffcients = coeffcients.rename(columns={0: 'Attribute', 1: 'Coefficients'})
coeffcients
Attribute | Coefficients | |
0 | LSTAT | -0.566666 |
1 | RM | 4.99206 |
2 | PTRATIO | -0.929461 |
MEDV=15.4-0.56LSTAT+4.99RM-0.93*PTRATIO
4.模型评估
print('R-Squared: %.4f' % model.score(x_test,Y_test))
R-Squared: 0.6035
本次模型训练结果良好
price_pred = model.predict(x_test)
plt.scatter(Y_test, price_pred)
plt.xlabel("Actual Prices")
plt.ylabel("Predicted prices")
plt.title("Actual prices vs Predicted prices")
Text(0.5, 1.0, 'Actual prices vs Predicted prices')
output_29_1.png
查看数据分布情况
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sn
sn.distplot(Boston['MEDV'], hist=True);
fig = plt.figure()
res = stats.probplot(Boston['MEDV'], plot=plt)
output_31_0.png
output_31_1.png
查看MEDV的直方图,发现50左右有离群点,这对线性回归有较大影响
#删除MEDV=50的观测样本
Boston_new=Boston[Boston['MEDV']<50]
Boston_new.info()
X_new = Boston_new[ ['LSTAT','RM','PTRATIO']]
Y_new = Boston_new['MEDV']
from sklearn.model_selection import train_test_split
x_new_train, x_new_test, Y_new_train, Y_new_test = train_test_split(X_new, Y_new, test_size = 0.2,random_state=5)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x_new_train, Y_new_train)
print('R-Squared: %.4f' % model.score(x_new_test,Y_new_test))
<class 'pandas.core.frame.DataFrame'>
Int64Index: 490 entries, 0 to 505
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CRIM 490 non-null float64
1 ZN 490 non-null float64
2 INDUS 490 non-null float64
3 CHAS 490 non-null int64
4 NOX 490 non-null float64
5 RM 490 non-null float64
6 AGE 490 non-null float64
7 DIS 490 non-null float64
8 RAD 490 non-null int64
9 TAX 490 non-null float64
10 PTRATIO 490 non-null float64
11 B 490 non-null float64
12 LSTAT 490 non-null float64
13 MEDV 490 non-null float64
dtypes: float64(12), int64(2)
memory usage: 57.4 KB
R-Squared: 0.7458
删除50的样本之后,重新拟合的R方为0.7458,较第一次有明显的提升。
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sn
sn.distplot(Boston_new['MEDV'], hist=True);
fig = plt.figure()
res = stats.probplot(Boston_new['MEDV'], plot=plt)
output_35_0.png
output_35_1.png
print(model.intercept_)
coeffcients = pd.DataFrame([x_new_train.columns,model.coef_]).T
coeffcients = coeffcients.rename(columns={0: 'Attribute', 1: 'Coefficients'})
coeffcients
22.52450276250959
Attribute | Coefficients | |
0 | LSTAT | -0.533567 |
1 | RM | 3.75825 |
2 | PTRATIO | -0.944395 |
5.总结:
最终回归函数为:MEDV=22.52450276250959-0.53LSTAT+3.758RM-0.94*PTRATIO
拟合的R方为0.7458
网友评论