多元线性回归
前面简单线性回归只有一个解释变量,当有多个解释变量的时候就需要使用多元线性回归,它的模型公式如下所示:
下面看一个新的例子,对前面的比萨用例增加顶部配料数量变量。
训练数据:
训练实例 | 直径(英寸) | 顶部配料数量 | 价格(美元) |
---|---|---|---|
1 | 6 | 2 | 7 |
2 | 8 | 1 | 9 |
3 | 10 | 0 | 13 |
4 | 14 | 2 | 17.5 |
5 | 18 | 0 | 18 |
测试数据:
测试实例 | 直径(英寸) | 顶部配料数量 | 价格(美元) |
---|---|---|---|
1 | 8 | 2 | 11 |
2 | 9 | 0 | 8.5 |
3 | 11 | 2 | 15 |
4 | 16 | 2 | 18 |
5 | 12 | 0 | 11 |
下面使用2个解释变量预测比萨价格。
from sklearn.linear_model import LinearRegression
X = [[6, 2], [8, 1], [10, 0], [14, 2], [18, 0]]
y = [[7], [9], [13], [17.5], [18]]
model = LinearRegression()
model.fit(X, y)
X_test = [[8, 2], [9, 0], [11, 2], [16, 2], [12, 0]]
y_test = [[11], [8.5], [15], [18], [11]]
predictions = model.predict(X_test)
for i, prediction in enumerate(predictions):
print('Predicted: %s, Target: %s' % (prediction, y_test[i]))
print('R-squared: %.2f' % model.score(X_test, y_test))
Predicted: [10.0625], Target: [11]
Predicted: [10.28125], Target: [8.5]
Predicted: [13.09375], Target: [15]
Predicted: [18.14583333], Target: [18]
Predicted: [13.3125], Target: [11]
R-squared: 0.77
很明显,增加顶部配料数量提升了模型的效果。
多项式回归
在前面,我们假设解释变量和响应变量之间的关系是线性的,在本节,我们使用多项式回归。
为了方便可视化,我们依旧使用比萨直径作为唯一的解释变量。
训练数据:
训练实例 | 直径(英寸) | 价格(美元) |
---|---|---|
1 | 6 | 7 |
2 | 8 | 9 |
3 | 10 | 13 |
4 | 14 | 17.5 |
5 | 18 | 18 |
测试数据:
训练实例 | 直径(英寸) | 价格(美元) |
---|---|---|
1 | 6 | 7 |
2 | 8 | 9 |
3 | 10 | 13 |
4 | 14 | 17.5 |
二阶多项式回归公式如下:
PolynomialFeatures转换器可以提用于为一个特征表示多项式特征,我们使用这些特征来拟合一个模型,并将其和简单线性回归模型作比较。
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
X_train = [[6], [8], [10], [14], [18]]
y_train = [[7], [9], [13], [17.5], [18]]
X_test = [[6], [8], [11], [16]]
y_test = [[8], [12], [15], [18]]
regressor = LinearRegression()
regressor.fit(X_train, y_train)
quadratic_featurizer = PolynomialFeatures(degree=2)
X_train_quadratic = quadratic_featurizer.fit_transform(X_train)
X_test_quadratic = quadratic_featurizer.transform(X_test)
regressor_quadratic = LinearRegression()
regressor_quadratic.fit(X_train_quadratic, y_train)
# 图像上的线条
xx = np.linspace(0, 26, 100)
yy = regressor.predict(xx.reshape(xx.shape[0], 1))
plt.plot(xx, yy)
xx_quadratic = quadratic_featurizer.transform(xx.reshape(xx.shape[0], 1))
plt.plot(xx, regressor_quadratic.predict(xx_quadratic), c='r', linestyle='--')
plt.title('Pizza price regressed on diameter')
plt.xlabel('Diameter in inches')
plt.ylabel('Price in dollars')
plt.axis([0, 25, 0, 25])
plt.grid(True)
plt.scatter(X_train, y_train)
plt.show()
print(X_train)
print(X_train_quadratic)
print(X_test)
print(X_test_quadratic)
print('Simple linear regression r-squared', regressor.score(X_test, y_test))
print('Quadratic regression r-squared', regressor_quadratic.score(X_test_quadratic, y_test))
output_3_0.png
[[6], [8], [10], [14], [18]]
[[ 1. 6. 36.]
[ 1. 8. 64.]
[ 1. 10. 100.]
[ 1. 14. 196.]
[ 1. 18. 324.]]
[[6], [8], [11], [16]]
[[ 1. 6. 36.]
[ 1. 8. 64.]
[ 1. 11. 121.]
[ 1. 16. 256.]]
Simple linear regression r-squared 0.809726797707665
Quadratic regression r-squared 0.8675443656345054
二次回归的决定系数为0.87,比简单线性回归要好。我们可以继续增加阶数,下面尝试下更高阶的多项式,试一下9阶多项式。
quadratic_featurizer_9 = PolynomialFeatures(degree=9)
X_train_quadratic_9 = quadratic_featurizer_9.fit_transform(X_train)
X_test_quadratic_9 = quadratic_featurizer_9.transform(X_test)
regressor_quadratic_9 = LinearRegression()
regressor_quadratic_9.fit(X_train_quadratic_9, y_train)
# 图像上的线条
xx = np.linspace(0, 26, 100)
yy = regressor.predict(xx.reshape(xx.shape[0], 1))
plt.plot(xx, yy)
xx_quadratic_9 = quadratic_featurizer_9.transform(xx.reshape(xx.shape[0], 1))
plt.plot(xx, regressor_quadratic_9.predict(xx_quadratic_9), c='r', linestyle='--')
plt.title('Pizza price regressed on diameter')
plt.xlabel('Diameter in inches')
plt.ylabel('Price in dollars')
plt.axis([0, 25, 0, 25])
plt.grid(True)
plt.scatter(X_train, y_train)
plt.show()
print(X_train)
print(X_train_quadratic_9)
print(X_test)
print(X_test_quadratic_9)
print('Simple linear regression r-squared', regressor.score(X_test, y_test))
print('Quadratic regression 9 degree r-squared', regressor_quadratic_9.score(X_test_quadratic_9, y_test))
output_5_0.png
[[6], [8], [10], [14], [18]]
[[1.00000000e+00 6.00000000e+00 3.60000000e+01 2.16000000e+02
1.29600000e+03 7.77600000e+03 4.66560000e+04 2.79936000e+05
1.67961600e+06 1.00776960e+07]
[1.00000000e+00 8.00000000e+00 6.40000000e+01 5.12000000e+02
4.09600000e+03 3.27680000e+04 2.62144000e+05 2.09715200e+06
1.67772160e+07 1.34217728e+08]
[1.00000000e+00 1.00000000e+01 1.00000000e+02 1.00000000e+03
1.00000000e+04 1.00000000e+05 1.00000000e+06 1.00000000e+07
1.00000000e+08 1.00000000e+09]
[1.00000000e+00 1.40000000e+01 1.96000000e+02 2.74400000e+03
3.84160000e+04 5.37824000e+05 7.52953600e+06 1.05413504e+08
1.47578906e+09 2.06610468e+10]
[1.00000000e+00 1.80000000e+01 3.24000000e+02 5.83200000e+03
1.04976000e+05 1.88956800e+06 3.40122240e+07 6.12220032e+08
1.10199606e+10 1.98359290e+11]]
[[6], [8], [11], [16]]
[[1.00000000e+00 6.00000000e+00 3.60000000e+01 2.16000000e+02
1.29600000e+03 7.77600000e+03 4.66560000e+04 2.79936000e+05
1.67961600e+06 1.00776960e+07]
[1.00000000e+00 8.00000000e+00 6.40000000e+01 5.12000000e+02
4.09600000e+03 3.27680000e+04 2.62144000e+05 2.09715200e+06
1.67772160e+07 1.34217728e+08]
[1.00000000e+00 1.10000000e+01 1.21000000e+02 1.33100000e+03
1.46410000e+04 1.61051000e+05 1.77156100e+06 1.94871710e+07
2.14358881e+08 2.35794769e+09]
[1.00000000e+00 1.60000000e+01 2.56000000e+02 4.09600000e+03
6.55360000e+04 1.04857600e+06 1.67772160e+07 2.68435456e+08
4.29496730e+09 6.87194767e+10]]
Simple linear regression r-squared 0.809726797707665
Quadratic regression 9 degree r-squared -0.09435666704315328
模型几乎完全准确地拟合了训练数据!然而,模型在测试数据集上的决定系数为-0.09。一个模型可以准确拟合训练数据,却不能逼近真实的关系,这个问题称为过拟合。一般可以使用正则化来防止过拟合。
应用线性回归
下面看一个实际的例子。加州大学机器学习库的酒质量数据集包含1599种酒的11种物理化学属性,该数据集可以从 http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/ 下载。
探索数据
首先加载数据集并进行简单地分析。
import pandas as pd
# 加载数据集
df = pd.read_csv('./winequality-red.csv', sep=';')
df.describe()
import matplotlib.pyplot as plt
plt.scatter(df['alcohol'], df['quality'])
plt.xlabel('Alcohol')
plt.ylabel('Quality')
plt.title('Alcohol Against Quality')
plt.show()
output_9_0.png
从上图可以看出,酒精含量和质量之间存在着弱正相关古纳西,酒精含量高的酒通常质量也高。
拟合和评估模型
我们将数据分为训练数据和测试数据,训练回归器并评估它的预测能力。
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
X = df[list(df.columns)[:-1]]
y = df['quality']
X_train, X_test, y_train, y_test = train_test_split(X, y)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_predictions = regressor.predict(X_test)
print('R-squared: %s' % regressor.score(X_test, y_test))
R-squared: 0.40863217504719407
上面将数据划分为测试数据和训练数据,最后决定系数是0.41。下面我们使用交叉验证来产出一个对预测器性能更好的估计。
from sklearn.model_selection import cross_val_score
regressor = LinearRegression()
scores = cross_val_score(regressor, X, y, cv=5)
print(scores.mean())
print(scores)
0.2900416288421962
[0.13200871 0.31858135 0.34955348 0.369145 0.2809196 ]
梯度下降法
梯度下降法在机器学习中应用很广泛,可以通过迭代找到目标函数的最小值,或者收敛到最小值。
梯度下降法的基本思想可以类比为一个下山的过程。假设这样一个场景:一个人被困在山上,需要从山上下来(找到山的最低点,也就是山谷)。但此时山上的浓雾很大,导致可视度很低;因此,下山的路径就无法确定,必须利用自己周围的信息一步一步地找到下山的路。这个时候,便可利用梯度下降算法来帮助自己下山。怎么做呢,首先以他当前的所处的位置为基准,寻找这个位置最陡峭的地方,然后朝着下降方向走一步,然后又继续以当前位置为基准,再找最陡峭的地方,再走直到最后到达最低处。
scikit-learn类库中的SGDRegressor类是随机梯度下降法的一种实现,他能够被用来优化不同的代价函数以拟合不同的模型。下面看一个预测波士顿房价的例子。
from sklearn.linear_model import SGDRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_boston
# 直接从类库加载数据集
data = load_boston()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target)
# 将数据标准化
X_scaler = StandardScaler()
y_scaler = StandardScaler()
X_train = X_scaler.fit_transform(X_train)
y_train = y_scaler.fit_transform(y_train.reshape(-1, 1))
X_test = X_scaler.transform(X_test)
y_test = y_scaler.transform(y_test.reshape(-1, 1))
regressor = SGDRegressor(loss='squared_loss')
scores = cross_val_score(regressor, X_train, y_train, cv=5)
print('Cross validation r-squared scores: %s' % scores)
print('Average cross validation r-squared score: %s' % np.mean(scores))
regressor.fit(X_train, y_train)
print('Test set r-squared score %s' % regressor.score(X_test, y_test))
Cross validation r-squared scores: [0.71427658 0.74428569 0.6566014 0.58914921 0.79699039]
Average cross validation r-squared score: 0.7002606532660949
Test set r-squared score 0.7451204468940292
E:\python\python36\lib\site-packages\sklearn\utils\validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
E:\python\python36\lib\site-packages\sklearn\utils\validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
E:\python\python36\lib\site-packages\sklearn\utils\validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
E:\python\python36\lib\site-packages\sklearn\utils\validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
E:\python\python36\lib\site-packages\sklearn\utils\validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
E:\python\python36\lib\site-packages\sklearn\utils\validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
网友评论