美文网首页
scikit-learn机器学习:多元线性回归

scikit-learn机器学习:多元线性回归

作者: 简单一点点 | 来源:发表于2020-02-06 22:08 被阅读0次

    多元线性回归

    前面简单线性回归只有一个解释变量,当有多个解释变量的时候就需要使用多元线性回归,它的模型公式如下所示:

    {y= \alpha + \beta \mathop{{}}\nolimits_{{1}}x\mathop{{}}\nolimits_{{1}}+ \beta \mathop{{}}\nolimits_{{2}}x\mathop{{}}\nolimits_{{2}}+∙∙∙+ \beta \mathop{{}}\nolimits_{{n}}x\mathop{{}}\nolimits_{{n}}}

    下面看一个新的例子,对前面的比萨用例增加顶部配料数量变量。

    训练数据:

    训练实例 直径(英寸) 顶部配料数量 价格(美元)
    1 6 2 7
    2 8 1 9
    3 10 0 13
    4 14 2 17.5
    5 18 0 18

    测试数据:

    测试实例 直径(英寸) 顶部配料数量 价格(美元)
    1 8 2 11
    2 9 0 8.5
    3 11 2 15
    4 16 2 18
    5 12 0 11

    下面使用2个解释变量预测比萨价格。

    from sklearn.linear_model import LinearRegression
    
    X = [[6, 2], [8, 1], [10, 0], [14, 2], [18, 0]]
    y = [[7], [9], [13], [17.5], [18]]
    model = LinearRegression()
    model.fit(X, y)
    X_test = [[8, 2], [9, 0], [11, 2], [16, 2], [12, 0]]
    y_test = [[11], [8.5], [15], [18], [11]]
    
    predictions = model.predict(X_test)
    
    for i, prediction in enumerate(predictions):
        print('Predicted: %s, Target: %s' % (prediction, y_test[i]))
    print('R-squared: %.2f' % model.score(X_test, y_test))
    
    Predicted: [10.0625], Target: [11]
    Predicted: [10.28125], Target: [8.5]
    Predicted: [13.09375], Target: [15]
    Predicted: [18.14583333], Target: [18]
    Predicted: [13.3125], Target: [11]
    R-squared: 0.77
    

    很明显,增加顶部配料数量提升了模型的效果。

    多项式回归

    在前面,我们假设解释变量和响应变量之间的关系是线性的,在本节,我们使用多项式回归。

    为了方便可视化,我们依旧使用比萨直径作为唯一的解释变量。

    训练数据:

    训练实例 直径(英寸) 价格(美元)
    1 6 7
    2 8 9
    3 10 13
    4 14 17.5
    5 18 18

    测试数据:

    训练实例 直径(英寸) 价格(美元)
    1 6 7
    2 8 9
    3 10 13
    4 14 17.5

    二阶多项式回归公式如下:

    {y= \alpha + \beta \mathop{{}}\nolimits_{{1}}x+\mathop{{ \beta }}\nolimits_{{2}}\mathop{{x}}\nolimits^{{2}}}

    PolynomialFeatures转换器可以提用于为一个特征表示多项式特征,我们使用这些特征来拟合一个模型,并将其和简单线性回归模型作比较。

    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.linear_model import LinearRegression
    from sklearn.preprocessing import PolynomialFeatures
    
    X_train = [[6], [8], [10], [14], [18]]
    y_train = [[7], [9], [13], [17.5], [18]]
    X_test = [[6], [8], [11], [16]]
    y_test = [[8], [12], [15], [18]]
    regressor = LinearRegression()
    regressor.fit(X_train, y_train)
    
    quadratic_featurizer = PolynomialFeatures(degree=2)
    X_train_quadratic = quadratic_featurizer.fit_transform(X_train)
    X_test_quadratic = quadratic_featurizer.transform(X_test)
    regressor_quadratic = LinearRegression()
    regressor_quadratic.fit(X_train_quadratic, y_train)
    
    # 图像上的线条
    xx = np.linspace(0, 26, 100)
    yy = regressor.predict(xx.reshape(xx.shape[0], 1))
    plt.plot(xx, yy)
    xx_quadratic = quadratic_featurizer.transform(xx.reshape(xx.shape[0], 1))
    plt.plot(xx, regressor_quadratic.predict(xx_quadratic), c='r', linestyle='--')
    plt.title('Pizza price regressed on diameter')
    plt.xlabel('Diameter in inches')
    plt.ylabel('Price in dollars')
    plt.axis([0, 25, 0, 25])
    plt.grid(True)
    plt.scatter(X_train, y_train)
    plt.show()
    
    print(X_train)
    print(X_train_quadratic)
    print(X_test)
    print(X_test_quadratic)
    print('Simple linear regression r-squared', regressor.score(X_test, y_test))
    print('Quadratic regression r-squared', regressor_quadratic.score(X_test_quadratic, y_test))
    
    output_3_0.png
    [[6], [8], [10], [14], [18]]
    [[  1.   6.  36.]
     [  1.   8.  64.]
     [  1.  10. 100.]
     [  1.  14. 196.]
     [  1.  18. 324.]]
    [[6], [8], [11], [16]]
    [[  1.   6.  36.]
     [  1.   8.  64.]
     [  1.  11. 121.]
     [  1.  16. 256.]]
    Simple linear regression r-squared 0.809726797707665
    Quadratic regression r-squared 0.8675443656345054
    

    二次回归的决定系数为0.87,比简单线性回归要好。我们可以继续增加阶数,下面尝试下更高阶的多项式,试一下9阶多项式。

    quadratic_featurizer_9 = PolynomialFeatures(degree=9)
    X_train_quadratic_9 = quadratic_featurizer_9.fit_transform(X_train)
    X_test_quadratic_9 = quadratic_featurizer_9.transform(X_test)
    regressor_quadratic_9 = LinearRegression()
    regressor_quadratic_9.fit(X_train_quadratic_9, y_train)
    
    # 图像上的线条
    xx = np.linspace(0, 26, 100)
    yy = regressor.predict(xx.reshape(xx.shape[0], 1))
    plt.plot(xx, yy)
    xx_quadratic_9 = quadratic_featurizer_9.transform(xx.reshape(xx.shape[0], 1))
    plt.plot(xx, regressor_quadratic_9.predict(xx_quadratic_9), c='r', linestyle='--')
    plt.title('Pizza price regressed on diameter')
    plt.xlabel('Diameter in inches')
    plt.ylabel('Price in dollars')
    plt.axis([0, 25, 0, 25])
    plt.grid(True)
    plt.scatter(X_train, y_train)
    plt.show()
    
    print(X_train)
    print(X_train_quadratic_9)
    print(X_test)
    print(X_test_quadratic_9)
    print('Simple linear regression r-squared', regressor.score(X_test, y_test))
    print('Quadratic regression 9 degree r-squared', regressor_quadratic_9.score(X_test_quadratic_9, y_test))
    
    output_5_0.png
    [[6], [8], [10], [14], [18]]
    [[1.00000000e+00 6.00000000e+00 3.60000000e+01 2.16000000e+02
      1.29600000e+03 7.77600000e+03 4.66560000e+04 2.79936000e+05
      1.67961600e+06 1.00776960e+07]
     [1.00000000e+00 8.00000000e+00 6.40000000e+01 5.12000000e+02
      4.09600000e+03 3.27680000e+04 2.62144000e+05 2.09715200e+06
      1.67772160e+07 1.34217728e+08]
     [1.00000000e+00 1.00000000e+01 1.00000000e+02 1.00000000e+03
      1.00000000e+04 1.00000000e+05 1.00000000e+06 1.00000000e+07
      1.00000000e+08 1.00000000e+09]
     [1.00000000e+00 1.40000000e+01 1.96000000e+02 2.74400000e+03
      3.84160000e+04 5.37824000e+05 7.52953600e+06 1.05413504e+08
      1.47578906e+09 2.06610468e+10]
     [1.00000000e+00 1.80000000e+01 3.24000000e+02 5.83200000e+03
      1.04976000e+05 1.88956800e+06 3.40122240e+07 6.12220032e+08
      1.10199606e+10 1.98359290e+11]]
    [[6], [8], [11], [16]]
    [[1.00000000e+00 6.00000000e+00 3.60000000e+01 2.16000000e+02
      1.29600000e+03 7.77600000e+03 4.66560000e+04 2.79936000e+05
      1.67961600e+06 1.00776960e+07]
     [1.00000000e+00 8.00000000e+00 6.40000000e+01 5.12000000e+02
      4.09600000e+03 3.27680000e+04 2.62144000e+05 2.09715200e+06
      1.67772160e+07 1.34217728e+08]
     [1.00000000e+00 1.10000000e+01 1.21000000e+02 1.33100000e+03
      1.46410000e+04 1.61051000e+05 1.77156100e+06 1.94871710e+07
      2.14358881e+08 2.35794769e+09]
     [1.00000000e+00 1.60000000e+01 2.56000000e+02 4.09600000e+03
      6.55360000e+04 1.04857600e+06 1.67772160e+07 2.68435456e+08
      4.29496730e+09 6.87194767e+10]]
    Simple linear regression r-squared 0.809726797707665
    Quadratic regression 9 degree r-squared -0.09435666704315328
    

    模型几乎完全准确地拟合了训练数据!然而,模型在测试数据集上的决定系数为-0.09。一个模型可以准确拟合训练数据,却不能逼近真实的关系,这个问题称为过拟合。一般可以使用正则化来防止过拟合。

    应用线性回归

    下面看一个实际的例子。加州大学机器学习库的酒质量数据集包含1599种酒的11种物理化学属性,该数据集可以从 http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/ 下载。

    探索数据

    首先加载数据集并进行简单地分析。

    import pandas as pd
    
    # 加载数据集
    df = pd.read_csv('./winequality-red.csv', sep=';')
    df.describe()
    
    import matplotlib.pyplot as plt
    
    plt.scatter(df['alcohol'], df['quality'])
    plt.xlabel('Alcohol')
    plt.ylabel('Quality')
    plt.title('Alcohol Against Quality')
    plt.show()
    
    output_9_0.png

    从上图可以看出,酒精含量和质量之间存在着弱正相关古纳西,酒精含量高的酒通常质量也高。

    拟合和评估模型

    我们将数据分为训练数据和测试数据,训练回归器并评估它的预测能力。

    from sklearn.linear_model import LinearRegression
    from sklearn.model_selection import train_test_split
    
    X = df[list(df.columns)[:-1]]
    y = df['quality']
    X_train, X_test, y_train, y_test = train_test_split(X, y)
    regressor = LinearRegression()
    regressor.fit(X_train, y_train)
    y_predictions = regressor.predict(X_test)
    print('R-squared: %s' % regressor.score(X_test, y_test))
    
    R-squared: 0.40863217504719407
    

    上面将数据划分为测试数据和训练数据,最后决定系数是0.41。下面我们使用交叉验证来产出一个对预测器性能更好的估计。

    from sklearn.model_selection import cross_val_score
    
    regressor = LinearRegression()
    scores = cross_val_score(regressor, X, y, cv=5)
    print(scores.mean())
    print(scores)
    
    0.2900416288421962
    [0.13200871 0.31858135 0.34955348 0.369145   0.2809196 ]
    

    梯度下降法

    梯度下降法在机器学习中应用很广泛,可以通过迭代找到目标函数的最小值,或者收敛到最小值。

    梯度下降法的基本思想可以类比为一个下山的过程。假设这样一个场景:一个人被困在山上,需要从山上下来(找到山的最低点,也就是山谷)。但此时山上的浓雾很大,导致可视度很低;因此,下山的路径就无法确定,必须利用自己周围的信息一步一步地找到下山的路。这个时候,便可利用梯度下降算法来帮助自己下山。怎么做呢,首先以他当前的所处的位置为基准,寻找这个位置最陡峭的地方,然后朝着下降方向走一步,然后又继续以当前位置为基准,再找最陡峭的地方,再走直到最后到达最低处。

    scikit-learn类库中的SGDRegressor类是随机梯度下降法的一种实现,他能够被用来优化不同的代价函数以拟合不同的模型。下面看一个预测波士顿房价的例子。

    from sklearn.linear_model import SGDRegressor
    from sklearn.preprocessing import StandardScaler
    from sklearn.datasets import load_boston
    
    # 直接从类库加载数据集
    data = load_boston()
    X_train, X_test, y_train, y_test = train_test_split(data.data, data.target)
    # 将数据标准化
    X_scaler = StandardScaler()
    y_scaler = StandardScaler()
    X_train = X_scaler.fit_transform(X_train)
    y_train = y_scaler.fit_transform(y_train.reshape(-1, 1))
    X_test = X_scaler.transform(X_test)
    y_test = y_scaler.transform(y_test.reshape(-1, 1))
    
    regressor = SGDRegressor(loss='squared_loss')
    scores = cross_val_score(regressor, X_train, y_train, cv=5)
    print('Cross validation r-squared scores: %s' % scores)
    print('Average cross validation r-squared score: %s' % np.mean(scores))
    regressor.fit(X_train, y_train)
    print('Test set r-squared score %s' % regressor.score(X_test, y_test))
    
    Cross validation r-squared scores: [0.71427658 0.74428569 0.6566014  0.58914921 0.79699039]
    Average cross validation r-squared score: 0.7002606532660949
    Test set r-squared score 0.7451204468940292
    
    
    E:\python\python36\lib\site-packages\sklearn\utils\validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
      y = column_or_1d(y, warn=True)
    E:\python\python36\lib\site-packages\sklearn\utils\validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
      y = column_or_1d(y, warn=True)
    E:\python\python36\lib\site-packages\sklearn\utils\validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
      y = column_or_1d(y, warn=True)
    E:\python\python36\lib\site-packages\sklearn\utils\validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
      y = column_or_1d(y, warn=True)
    E:\python\python36\lib\site-packages\sklearn\utils\validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
      y = column_or_1d(y, warn=True)
    E:\python\python36\lib\site-packages\sklearn\utils\validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
      y = column_or_1d(y, warn=True)
    
    
    

    相关文章

      网友评论

          本文标题:scikit-learn机器学习:多元线性回归

          本文链接:https://www.haomeiwen.com/subject/dckpxhtx.html