美文网首页
03.线性回归

03.线性回归

作者: __豆约翰__ | 来源:发表于2019-07-18 07:34 被阅读0次

    线性回归算法简介

    image

    线性回归算法以一个坐标系里一个维度为结果,其他维度为特征(如二维平面坐标系中横轴为特征,纵轴为结果),无数的训练集放在坐标系中,发现他们是围绕着一条执行分布。线性回归算法的期望,就是寻找一条直线,最大程度的“拟合”样本特征和样本输出标记的关系

    image.png

    ########## 样本特征只有一个的线性回归问题,为简单线性回归,如房屋价格-房屋面积

    将横坐标作为x轴,纵坐标作为y轴,每一个点为(X(i) ,y(i)),那么我们期望寻找的直线就是y=ax+b,当给出一个新的点x(j)的时候,我们希望预测的y^(j)=ax(j)+b

    image.png
    • 不使用直接相减的方式,由于差值有正有负,会抵消
    • 不适用绝对值的方式,由于绝对值函数存在不可导的点
    image image

    ########## 通过上面的推导,我们可以归纳出一类机器学习算法的基本思路,如下图;其中损失函数是计算期望值和预测值的差值,期望其差值(也就是损失)越来越小,而效用函数则是描述拟合度,期望契合度越来越好

    image

    简单线性回归的最小二乘法推导过程

    image.png image.png image.png image.png image.png image.png

    实现简单线性回归法

    import numpy as np
    import matplotlib.pyplot as plt
    
    x = np.array([1., 2., 3., 4., 5.])
    y = np.array([1., 3., 2., 3., 5.])
    
    plt.scatter(x, y)
    plt.axis([0, 6, 0, 6])
    plt.show()
    
    image.png
    x_mean = np.mean(x)
    y_mean = np.mean(y)
    
    num = 0.0
    d = 0.0
    for x_i, y_i in zip(x, y):
        num += (x_i - x_mean) * (y_i - y_mean)
        d += (x_i - x_mean) ** 2
    
    a = num/d
    
    b = y_mean - a * x_mean
    
    y_hat = a * x + b
    
    plt.scatter(x, y)
    plt.plot(x, y_hat, color='r')
    plt.axis([0, 6, 0, 6])
    plt.show()
    
    image.png
    x_predict = 6
    y_predict = a * x_predict + b
    y_predict
    
    5.2000000000000002
    
    封装我们自己的SimpleLinearRegression

    代码SimpleLinearRegression.py

    class SimpleLinearRegression1:
    
        def __init__(self):
            """初始化Simple Linear Regression 模型"""
            self.a_ = None
            self.b_ = None
    
        def fit(self, x_train, y_train):
            """根据训练集x_train,y_train 训练Simple Linear Regression 模型"""
            assert x_train.ndim == 1,\
                "Simple Linear Regression can only solve simple feature training data"
            assert len(x_train) == len(y_train),\
                "the size of x_train must be equal to the size of y_train"
    
            ## 求均值
            x_mean = x_train.mean()
            y_mean = y_train.mean()
    
            ## 分子
            num = 0.0
            ## 分母
            d = 0.0
    
            ## 计算分子分母
            for x_i, y_i in zip(x_train, y_train):
                num += (x_i-x_mean)*(y_i-y_mean)
                d += (x_i-x_mean) ** 2
    
            ## 计算参数a和b
            self.a_ = num/d
            self.b_ = y_mean - self.a_ * x_mean
    
            return self
    
        def predict(self, x_predict):
            """给定待预测集x_predict,返回x_predict对应的预测结果值"""
            assert x_predict.ndim == 1,\
                "Simple Linear Regression can only solve simple feature training data"
            assert self.a_ is not None and self.b_ is not None,\
                "must fit before predict!"
    
            return np.array([self._predict(x) for x in x_predict])
    
        def _predict(self, x_single):
            """给定单个待预测数据x_single,返回x_single对应的预测结果值"""
            return self.a_*x_single+self.b_
    
        def __repr__(self):
            return "SimpleLinearRegression1()"
    
    
    
    from playML.SimpleLinearRegression import SimpleLinearRegression1
    
    reg1 = SimpleLinearRegression1()
    reg1.fit(x, y)
    reg1.predict(np.array([x_predict]))
    
    array([ 5.2])
    
    reg1.a_
    
    0.80000000000000004
    
    reg1.b_
    
    0.39999999999999947
    
    y_hat1 = reg1.predict(x)
    
    plt.scatter(x, y)
    plt.plot(x, y_hat1, color='r')
    plt.axis([0, 6, 0, 6])
    plt.show()
    
    image.png

    向量化

    image image
    向量化实现SimpleLinearRegression

    代码SimpleLinearRegression.py

    import numpy as np
    
    
    class SimpleLinearRegression2:
    
        def __init__(self):
            """初始化Simple Linear Regression模型"""
            self.a_ = None
            self.b_ = None
    
        def fit(self, x_train, y_train):
            """根据训练数据集x_train,y_train训练Simple Linear Regression模型"""
            assert x_train.ndim == 1, \
                "Simple Linear Regressor can only solve single feature training data."
            assert len(x_train) == len(y_train), \
                "the size of x_train must be equal to the size of y_train"
    
            x_mean = np.mean(x_train)
            y_mean = np.mean(y_train)
    
            self.a_ = (x_train - x_mean).dot(y_train - y_mean) / (x_train - x_mean).dot(x_train - x_mean)
            self.b_ = y_mean - self.a_ * x_mean
    
            return self
    
        def predict(self, x_predict):
            """给定待预测数据集x_predict,返回表示x_predict的结果向量"""
            assert x_predict.ndim == 1, \
                "Simple Linear Regressor can only solve single feature training data."
            assert self.a_ is not None and self.b_ is not None, \
                "must fit before predict!"
    
            return np.array([self._predict(x) for x in x_predict])
    
        def _predict(self, x_single):
            """给定单个待预测数据x_single,返回x_single的预测结果值"""
            return self.a_ * x_single + self.b_
    
        def __repr__(self):
            return "SimpleLinearRegression2()"
    
    
    from playML.SimpleLinearRegression import SimpleLinearRegression2
    
    reg2 = SimpleLinearRegression2()
    reg2.fit(x, y)
    reg2.predict(np.array([x_predict]))
    
    array([ 5.2])
    
    reg2.a_
    
    0.80000000000000004
    
    reg2.b_
    
    0.39999999999999947
    
    向量化实现的性能测试
    m = 1000000
    big_x = np.random.random(size=m)
    big_y = big_x * 2 + 3 + np.random.normal(size=m)
    %timeit reg1.fit(big_x, big_y)
    %timeit reg2.fit(big_x, big_y)
    
    1 loop, best of 3: 984 ms per loop
    100 loops, best of 3: 18.7 ms per loop
    
    reg1.a_
    
    1.9998479120324177
    
    reg1.b_
    
    2.9989427131166595
    
    reg2.a_
    
    1.9998479120324153
    
    reg2.b_
    
    2.9989427131166604
    

    衡量线性回归算法的指标

    衡量标准

    image

    其中衡量标准是和m有关的,因为越多的数据量产生的误差和可能会更大,但是毫无疑问越多的数据量训练出来的模型更好,为此需要一个取消误差的方法,如下

    image

    MSE 的缺点,量纲不准确,如果y的单位是万元,平方后就变成了万元的平方,这可能会给我们带来一些麻烦

    image image

    衡量回归算法的标准,MSE vs MAE

    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn import datasets
    
    波士顿房产数据
    boston = datasets.load_boston()
    
    boston.keys()
    
    dict_keys(['data', 'target', 'feature_names', 'DESCR'])
    
    print(boston.DESCR)
    
        Boston House Prices dataset
        ===========================
        
        Notes
        ------
        Data Set Characteristics:  
        
            :Number of Instances: 506 
        
            :Number of Attributes: 13 numeric/categorical predictive
            
            :Median Value (attribute 14) is usually the target
        
            :Attribute Information (in order):
                - CRIM     per capita crime rate by town
                - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
                - INDUS    proportion of non-retail business acres per town
                - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
                - NOX      nitric oxides concentration (parts per 10 million)
                - RM       average number of rooms per dwelling
                - AGE      proportion of owner-occupied units built prior to 1940
                - DIS      weighted distances to five Boston employment centres
                - RAD      index of accessibility to radial highways
                - TAX      full-value property-tax rate per $10,000
                - PTRATIO  pupil-teacher ratio by town
                - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
                - LSTAT    % lower status of the population
                - MEDV     Median value of owner-occupied homes in $1000's
        
            :Missing Attribute Values: None
        
            :Creator: Harrison, D. and Rubinfeld, D.L.
        
        This is a copy of UCI ML housing dataset.
        http://archive.ics.uci.edu/ml/datasets/Housing
    
    
    ​    
    ​    This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
    ​    
        The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
        prices and the demand for clean air', J. Environ. Economics & Management,
        vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
        ...', Wiley, 1980.   N.B. Various transformations are used in the table on
        pages 244-261 of the latter.
        
        The Boston house-price data has been used in many machine learning papers that address regression
        problems.   
             
        **References**
        
           - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
           - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
           - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)
    
    
    
    boston.feature_names
    
    array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
           'TAX', 'PTRATIO', 'B', 'LSTAT'], 
          dtype='<U7')
    
    x = boston.data[:,5] ## 只使用房间数量这个特征
    
    x.shape
    
    (506,)
    
    y = boston.target
    
    y.shape
    
    (506,)
    
    plt.scatter(x, y)
    plt.show()
    
    image.png
    np.max(y)
    
    50.0
    
    x = x[y < 50.0]
    y = y[y < 50.0]
    
    x.shape
    
    (490,)
    
    y.shape
    
    (490,)
    
    plt.scatter(x, y)
    plt.show()
    
    image.png
    使用简单线性回归法
    from playML.model_selection import train_test_split
    
    x_train, x_test, y_train, y_test = train_test_split(x, y, seed=666)
    
    x_train.shape
    
    (392,)
    
    y_train.shape
    
    (392,)
    
    x_test.shape
    
    (98,)
    
    y_test.shape
    
    (98,)
    
    from playML.SimpleLinearRegression import SimpleLinearRegression
    
    reg = SimpleLinearRegression()
    reg.fit(x_train, y_train)
    
    SimpleLinearRegression()
    
    reg.a_
    
    7.8608543562689555
    
    reg.b_
    
    -27.459342806705543
    
    plt.scatter(x_train, y_train)
    plt.plot(x_train, reg.predict(x_train), color='r')
    plt.show()
    
    image.png
    plt.scatter(x_train, y_train)
    plt.scatter(x_test, y_test, color="c")
    plt.plot(x_train, reg.predict(x_train), color='r')
    plt.show()
    
    image.png
    y_predict = reg.predict(x_test)
    
    MSE
    mse_test = np.sum((y_predict - y_test)**2) / len(y_test)
    mse_test
    
    24.156602134387438
    
    RMSE
    from math import sqrt
    
    rmse_test = sqrt(mse_test)
    rmse_test
    
    4.914936635846635
    
    MAE
    mae_test = np.sum(np.absolute(y_predict - y_test))/len(y_test)
    mae_test
    
    3.5430974409463873
    
    封装我们自己的评测函数

    代码:

    import numpy as np
    from math import sqrt
    
    
    def accuracy_score(y_true, y_predict):
        """计算y_true和y_predict之间的准确率"""
        assert len(y_true) == len(y_predict), \
            "the size of y_true must be equal to the size of y_predict"
    
        return np.sum(y_true == y_predict) / len(y_true)
    
    
    def mean_squared_error(y_true, y_predict):
        """计算y_true和y_predict之间的MSE"""
        assert len(y_true) == len(y_predict), \
            "the size of y_true must be equal to the size of y_predict"
    
        return np.sum((y_true - y_predict)**2) / len(y_true)
    
    
    def root_mean_squared_error(y_true, y_predict):
        """计算y_true和y_predict之间的RMSE"""
    
        return sqrt(mean_squared_error(y_true, y_predict))
    
    
    def mean_absolute_error(y_true, y_predict):
        """计算y_true和y_predict之间的MAE"""
    
        return np.sum(np.absolute(y_true - y_predict)) / len(y_true)
    
    from playML.metrics import mean_squared_error
    from playML.metrics import root_mean_squared_error
    from playML.metrics import mean_absolute_error
    
    mean_squared_error(y_test, y_predict)
    
    24.156602134387438
    
    root_mean_squared_error(y_test, y_predict)
    
    4.914936635846635
    
    mean_absolute_error(y_test, y_predict)
    
    3.5430974409463873
    
    scikit-learn中的MSE和MAE
    from sklearn.metrics import mean_squared_error
    from sklearn.metrics import mean_absolute_error
    
    mean_squared_error(y_test, y_predict)
    
    24.156602134387438
    
    mean_absolute_error(y_test, y_predict)
    
    3.5430974409463873
    

    最好的衡量线性回归法的指标 R Squared

    RMSE 和 MAE的局限性
    image

    可能预测房源准确度,RMSE或者MAE的值为5,预测学生的分数,结果的误差是10,这个5和10没有判断性,因为5和10对应不同的单位和量纲,无法比较

    解决办法-R Squared简介

    image
    R Squared 意义
    image

    使用BaseLine Model产生的错误会很大,使用我们的模型预测产生的错误会相对少些(因为我们的模型充分的考虑了y和x之间的关系),用这两者相减,结果就是拟合了我们的错误指标,用1减去这个商结果就是我们的模型没有产生错误的指标

    image image

    实现 R Squared (R^2)

    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn import datasets
    
    boston = datasets.load_boston()
    x = boston.data[:,5] ## 只使用房间数量这个特征
    y = boston.target
    
    x = x[y < 50.0]
    y = y[y < 50.0]
    
    from playML.model_selection import train_test_split
    
    x_train, x_test, y_train, y_test = train_test_split(x, y, seed=666)
    
    from playML.SimpleLinearRegression import SimpleLinearRegression
    
    reg = SimpleLinearRegression()
    reg.fit(x_train, y_train)
    
    SimpleLinearRegression()
    
    reg.a_
    
    7.8608543562689555
    
    reg.b_
    
    -27.459342806705543
    
    y_predict = reg.predict(x_test)
    
    R Square
    from playML.metrics import mean_squared_error
    
    1 - mean_squared_error(y_test, y_predict)/np.var(y_test)
    
    ---------------------------------------------------------------------------
    
    NameError                                 Traceback (most recent call last)
    
    <ipython-input-2-a7a5d5c1ca17> in <module>()
          1 from playML.metrics import mean_squared_error
          2 
    ----> 3 1 - mean_squared_error(y_test, y_predict)/np.var(y_test)
    
    
    NameError: name 'y_test' is not defined
    
    封装我们自己的 R Score

    代码(playML/metrics.py)

    def r2_score(y_true, y_predict):
        """计算y_true和y_predict之间的R Square"""
    
        return 1 - mean_squared_error(y_true, y_predict)/np.var(y_true)
    
    from playML.metrics import r2_score
    
    r2_score(y_test, y_predict)
    
    0.61293168039373225
    
    scikit-learn中的 r2_score
    from sklearn.metrics import r2_score
    
    r2_score(y_test, y_predict)
    
    0.61293168039373236
    

    scikit-learn中的LinearRegression中的score返回r2_score:http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

    在我们的SimpleRegression中添加score
    import numpy as np
    from .metrics import r2_score
    
    
    class SimpleLinearRegression:
    
     
    
        def score(self, x_test, y_test):
            """根据测试数据集 x_test 和 y_test 确定当前模型的准确度"""
    
            y_predict = self.predict(x_test)
            return r2_score(y_test, y_predict)
    
    
    
    
    reg.score(x_test, y_test)
    
    0.61293168039373225
    

    多元线性回归

    多元线性回归简介和正规方程解

    image image image image

    补充(矩阵点乘:A(m行)·B(n列) = A的每一行与B的每一列相乘再相加,等到结果是m行n列的)

    image

    补充(一个1xm的行向量乘以一个mx1的列向量等于一个数)

    image
    多元线性回归公式推导过程

    多元线性回归实现

    image

    实现我们自己的 Linear Regression

    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn import datasets
    
    boston = datasets.load_boston()
    
    X = boston.data
    y = boston.target
    
    X = X[y < 50.0]
    y = y[y < 50.0]
    
    X.shape
    
    (490, 13)
    
    from playML.model_selection import train_test_split
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)
    
    使用我们自己制作 Linear Regression

    代码playML/LinearRegression.py

    import numpy as np
    from .metrics import r2_score
    
    
    class LinearRegression:
    
        def __init__(self):
            """初始化Linear Regression模型"""
    
            ## 系数向量(θ1,θ2,.....θn)
            self.coef_ = None
            ## 截距 (θ0)
            self.interception_ = None
            ## θ向量
            self._theta = None
    
        def fit_normal(self, X_train, y_train):
            """根据训练数据集X_train,y_train 训练Linear Regression模型"""
            assert X_train.shape[0] == y_train.shape[0], \
                "the size of X_train must be equal to the size of y_train"
    
            ## np.ones((len(X_train), 1)) 构造一个和X_train 同样行数的,只有一列的全是1的矩阵
            ## np.hstack 拼接矩阵
            X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
            ## X_b.T 获取矩阵的转置
            ## np.linalg.inv() 获取矩阵的逆
            ## dot() 矩阵点乘
            self._theta = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y_train)
    
            self.interception_ = self._theta[0]
            self.coef_ = self._theta[1:]
    
            return self
    
        def predict(self, X_predict):
            """给定待预测数据集X_predict,返回表示X_predict的结果向量"""
            assert self.coef_ is not None and self.interception_ is not None,\
                "must fit before predict"
            assert X_predict.shape[1] == len(self.coef_),\
                "the feature number of X_predict must be equal to X_train"
    
            X_b = np.hstack([np.ones((len(X_predict), 1)), X_predict])
            return X_b.dot(self._theta)
    
        def score(self, X_test, y_test):
            """根据测试数据集 X_test 和 y_test 确定当前模型的准确度"""
    
            y_predict = self.predict(X_test)
            return r2_score(y_test, y_predict)
    
        def __repr__(self):
            return "LinearRegression()"
    
    
    from playML.LinearRegression import LinearRegression
    
    reg = LinearRegression()
    reg.fit_normal(X_train, y_train)
    
    LinearRegression()
    
    reg.coef_
    
    array([ -1.18919477e-01,   3.63991462e-02,  -3.56494193e-02,
             5.66737830e-02,  -1.16195486e+01,   3.42022185e+00,
            -2.31470282e-02,  -1.19509560e+00,   2.59339091e-01,
            -1.40112724e-02,  -8.36521175e-01,   7.92283639e-03,
            -3.81966137e-01])
    
    reg.intercept_
    
    34.161435496224712
    
    reg.score(X_test, y_test)
    
    0.81298026026584658
    

    09 scikit-learn中的回归问题

    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn import datasets
    
    boston = datasets.load_boston()
    
    X = boston.data
    y = boston.target
    
    X = X[y < 50.0]
    y = y[y < 50.0]
    
    X.shape
    
    (490, 13)
    
    from playML.model_selection import train_test_split
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, seed=666)
    
    scikit-learn中的线性回归
    from sklearn.linear_model import LinearRegression
    
    lin_reg = LinearRegression()
    lin_reg.fit(X_train, y_train)
    
    LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
    
    lin_reg.coef_
    
    array([ -1.18919477e-01,   3.63991462e-02,  -3.56494193e-02,
             5.66737830e-02,  -1.16195486e+01,   3.42022185e+00,
            -2.31470282e-02,  -1.19509560e+00,   2.59339091e-01,
            -1.40112724e-02,  -8.36521175e-01,   7.92283639e-03,
            -3.81966137e-01])
    
    lin_reg.intercept_
    
    34.161435496246924
    
    lin_reg.score(X_test, y_test)
    
    0.81298026026584758
    
    kNN Regressor
    from sklearn.preprocessing import StandardScaler
    
    standardScaler = StandardScaler()
    standardScaler.fit(X_train, y_train)
    X_train_standard = standardScaler.transform(X_train)
    X_test_standard = standardScaler.transform(X_test)
    
    from sklearn.neighbors import KNeighborsRegressor
    
    knn_reg = KNeighborsRegressor()
    knn_reg.fit(X_train_standard, y_train)
    knn_reg.score(X_test_standard, y_test)
    
    0.84664511530389497
    
    from sklearn.model_selection import GridSearchCV
    
    param_grid = [
        {
            "weights": ["uniform"],
            "n_neighbors": [i for i in range(1, 11)]
        },
        {
            "weights": ["distance"],
            "n_neighbors": [i for i in range(1, 11)],
            "p": [i for i in range(1,6)]
        }
    ]
    
    knn_reg = KNeighborsRegressor()
    grid_search = GridSearchCV(knn_reg, param_grid, n_jobs=-1, verbose=1)
    grid_search.fit(X_train_standard, y_train)
    
    Fitting 3 folds for each of 60 candidates, totalling 180 fits
    
    
    [Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:    1.5s finished
    
    
    
    
    
    GridSearchCV(cv=None, error_score='raise',
           estimator=KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
              metric_params=None, n_jobs=1, n_neighbors=5, p=2,
              weights='uniform'),
           fit_params={}, iid=True, n_jobs=-1,
           param_grid=[{'weights': ['uniform'], 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}, {'weights': ['distance'], 'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'p': [1, 2, 3, 4, 5]}],
           pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
           scoring=None, verbose=1)
    
    grid_search.best_params_
    
    {'n_neighbors': 5, 'p': 1, 'weights': 'distance'}
    
    grid_search.best_score_
    
    0.79917999890996905
    
    grid_search.best_estimator_.score(X_test_standard, y_test)
    
    0.88099665099417701
    

    10 线性回归参数的可解释性

    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn import datasets
    
    boston = datasets.load_boston()
    
    X = boston.data
    y = boston.target
    
    X = X[y < 50.0]
    y = y[y < 50.0]
    
    from sklearn.linear_model import LinearRegression
    
    lin_reg = LinearRegression()
    lin_reg.fit(X, y)
    
    LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
    
    lin_reg.coef_
    
    array([ -1.05574295e-01,   3.52748549e-02,  -4.35179251e-02,
             4.55405227e-01,  -1.24268073e+01,   3.75411229e+00,
            -2.36116881e-02,  -1.21088069e+00,   2.50740082e-01,
            -1.37702943e-02,  -8.38888137e-01,   7.93577159e-03,
            -3.50952134e-01])
    
    np.argsort(lin_reg.coef_)
    
    array([ 4,  7, 10, 12,  0,  2,  6,  9, 11,  1,  8,  3,  5])
    
    boston.feature_names[np.argsort(lin_reg.coef_)]
    
    array(['NOX', 'DIS', 'PTRATIO', 'LSTAT', 'CRIM', 'INDUS', 'AGE', 'TAX',
           'B', 'ZN', 'RAD', 'CHAS', 'RM'], 
          dtype='<U7')
    
    print(boston.DESCR)
    
        Boston House Prices dataset
        ===========================
        
        Notes
        ------
        Data Set Characteristics:  
        
            :Number of Instances: 506 
        
            :Number of Attributes: 13 numeric/categorical predictive
            
            :Median Value (attribute 14) is usually the target
        
            :Attribute Information (in order):
                - CRIM     per capita crime rate by town
                - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
                - INDUS    proportion of non-retail business acres per town
                - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
                - NOX      nitric oxides concentration (parts per 10 million)
                - RM       average number of rooms per dwelling
                - AGE      proportion of owner-occupied units built prior to 1940
                - DIS      weighted distances to five Boston employment centres
                - RAD      index of accessibility to radial highways
                - TAX      full-value property-tax rate per $10,000
                - PTRATIO  pupil-teacher ratio by town
                - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
                - LSTAT    % lower status of the population
                - MEDV     Median value of owner-occupied homes in $1000's
        
            :Missing Attribute Values: None
        
            :Creator: Harrison, D. and Rubinfeld, D.L.
        
        This is a copy of UCI ML housing dataset.
        http://archive.ics.uci.edu/ml/datasets/Housing
    
    
    ​    
        This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
        
        The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
        prices and the demand for clean air', J. Environ. Economics & Management,
        vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
        ...', Wiley, 1980.   N.B. Various transformations are used in the table on
        pages 244-261 of the latter.
        
        The Boston house-price data has been used in many machine learning papers that address regression
        problems.   
             
        **References**
        
           - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
           - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
           - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)
    
    
    

    RM对应的是房间数,是正相关最大的特征,也就是说房间数越多,房价越高,这是很合理的
    NOX对应的是一氧化氮浓度,也就是说一氧化氮浓度越低,房价越低,这也是非常合理的
    由此说明,我们的线性回归具有可解释性,我们可以在对研究一个模型的时候,可以先用线性回归模型看一下,然后根据感性的认识去直观的判断一下是否符合我们的语气

    相关文章

      网友评论

          本文标题:03.线性回归

          本文链接:https://www.haomeiwen.com/subject/jmjnqctx.html