美文网首页
MML(skl)——C2

MML(skl)——C2

作者: thestarrunner | 来源:发表于2018-11-24 04:56 被阅读0次

    Regression problems goal: to predict the value of a continuous response/dependent variable

    Steps: training data, model, learning algorithm, and evaluation metrics

    Theoretical Part

    Linear Regression

    Data

    training data

    Training instance Diameter (in inches) Price (in dollars)
    1 6 7
    2 8 9
    3 10 13
    4 14 17.5
    5 8 18
    sample size x y
    Visualize via matplotlib
    >>> import matplotlib.pyplot as plt
    >>> X = [[6], [8], [10], [14], [18]]
    >>> y = [[7], [9], [13], [17.5], [18]]
    >>> plt.figure()
    >>> plt.title('Pizza price plotted against diameter')
    >>> plt.xlabel('Diameter in inches')
    >>> plt.ylabel('Price in dollars')
    >>> plt.plot(X, y, 'k.')
    >>> plt.axis([0, 25, 0, 25])
    >>> plt.grid(True)
    >>> plt.show()
    
    Model fitting
    >>> from sklearn.linear_model import LinearRegression
    >>> # Training data
    >>> X = [[6], [8], [10], [14], [18]]
    >>> y = [[7], [9], [13], [17.5], [18]]
    >>> # Create and fit the model
    >>> model = LinearRegression()
    >>> model.fit(X, y)
    >>> print('A 12" pizza should cost: $%.2f' % model.predict([12])[0])
    A 12" pizza should cost: $13.68
    

    The sklearn.linear_model.LinearRegression class is an estimator.
    Estimators predict a value based on the observed data. In scikit-learn, all estimators implement the fit() and predict() methods.

    Comparison
    import numpy as np
    m, n = (10000, 10000)
    xs = np.linspace(0, 20, m)
    plt.figure()
    plt.plot(xs,pizzamodel.predict(xs.reshape(-1,1)),'r')
    plt.plot(X,y,'bo',markersize=10)
    plt.grid(1)
    plt.title('predicted vs. sample')
    plt.show()
    
    LinearR_comparison.png
    Evaluation of model fitness

    some definitions:
    cost function/loss function := define and measure the error of a model
    residuals or training errors := the difference between predicted value and training data y value
    prediction errors or test errors := the difference between predicted value and test data y value

    some defs for linear regression
    residual sum of squares cost function
    LSE: least square estimators

    when we have a cost function, we can find the values of our model's parameters
    that minimize it.

    Note, unbiased estimator for a variance of a dataset should have N-1 instead of N as the denominator

    Evaluation

    test data

    Test instance Diameter (in inches) Observed Price (in dollars) Predicted price (in dollars)
    1 8 11 7759
    2 9 8.5 10.7522
    3 11 15 12.7048
    4 16 18 17.5863
    5 12 11 13.6811
    sample size x y y_predicted

    Several measures can be used to assess our model's predictive capabilities. We will
    evaluate our pizza-price predictor using r-squared
    r^2 = 1, no errors
    r^2 = .5, half of the variance in the response variable can be predicted using the model
    In the case of simple linear regression, r-squared is equal to the square of the Pearson product moment correlation coefficient, or Pearson's r.

    R^2 := 1 - \frac{cost\ function\ :=\ sum\ of\ square\ of\ residual}{total\ sum\ of\ squares}

    >>> from sklearn.linear_model import LinearRegression
    >>> X = [[6], [8], [10], [14], [18]]
    >>> y = [[7], [9], [13], [17.5], [18]]
    >>> X_test = [[8], [9], [11], [16], [12]]
    >>> y_test = [[11], [8.5], [15], [18], [11]]
    >>> model = LinearRegression()
    >>> model.fit(X, y)
    >>> print('R-squared: %.4f' % model.score(X_test, y_test))
    R-squared: 0.6620
    

    Multiple Linear Regression

    for Y = \beta \cdot X
    solution: \beta = (X^TX)^{-1}XY

    so a package to solve matrix inverse calculation is introduced: np.linalg (linear algebra)

    >>> from numpy.linalg import inv
    >>> from numpy import dot, transpose
    
     dot(inv(dot(transpose(X), X)), dot(transpose(X), y))
    

    a least squares function from numpy: np.linalg.lstsq

    Polynomial regression

    Quadratic regression

    y = \alpha + \beta_1 x + \beta_2 x^2
    e.g.

    >>> import numpy as np
    >>> import matplotlib.pyplot as plt
    >>> from sklearn.linear_model import LinearRegression
    >>> from sklearn.preprocessing import PolynomialFeatures
    >>> X_train = [[6], [8], [10], [14], [18]]
    >>> y_train = [[7], [9], [13], [17.5], [18]]
    >>> X_test = [[6], [8], [11], [16]]
    >>> y_test = [[8], [12], [15], [18]]
    >>> regressor = LinearRegression()
    >>> regressor.fit(X_train, y_train)
    >>> xx = np.linspace(0, 26, 100)
    >>> yy = regressor.predict(xx.reshape(xx.shape[0], 1))
    >>> plt.plot(xx, yy)
    
    1.png

    note here's the different part

    >>> quadratic_featurizer = PolynomialFeatures(degree=2)
    >>> X_train_quadratic = quadratic_featurizer.fit_transform(X_train)
    >>> X_test_quadratic = quadratic_featurizer.transform(X_test)
    """X_train_quadratic:
    array([[  1.,   6.,  36.],
           [  1.,   8.,  64.],
           [  1.,  10., 100.],
           [  1.,  14., 196.],
           [  1.,  18., 324.]])"""
    >>> regressor_quadratic = LinearRegression()
    >>> regressor_quadratic.fit(X_train_quadratic, y_train)
    >>> xx_quadratic = quadratic_featurizer.transform(xx.reshape(xx.shape[0], 1))
    

    PolynomialFeatures(degree=N).fit_transform(x) : x-> 1, x, x^2, ..., x^N
    MAIN POINT is to transform x into N multiple variables, still use LinearRegression()

    >>> plt.plot(xx, regressor_quadratic.predict(xx_quadratic), c='r',
    linestyle='--')
    
    2.png
    >>> plt.title('Pizza price regressed on diameter')
    >>> plt.xlabel('Diameter in inches')
    >>> plt.ylabel('Price in dollars')
    >>> plt.axis([0, 25, 0, 25])
    >>> plt.grid(True)
    >>> plt.scatter(X_train, y_train)
    >>> plt.show()
    >>> print(X_train)
    >>> print(X_train_quadratic)
    >>> print(X_test)
    >>> print(X_test_quadratic)
    >>> print('Simple linear regression r-squared', regressor.score(X_test, y_test))
    >>> print('Quadratic regression r-squared', regressor_quadratic.score(X_test_quadratic, y_test))
    
    3.png

    R^2 increases to .87

    When degree = 9, R^2 = -.09
    which is over-fitting

    Regulation

    Regularization is a collection of techniques that can be used to prevent over-fitting.
    Regularization adds information to a problem, often in the form of a penalty against complexity, to a problem.

    Occam's razor : a hypothesis with the fewest assumptions is the best

    Ridge regression (Tikhonov regularization)(L2):
    RSS_{ridge} = \sum_{i=1}^n (y - x_i^T)^2 +\lambda \sum_{j=1}^p \beta_j^2

    \lambda:Hyperparameters, which are parameters of the model that are not learned automatically and must be set manually

    Least Absolute Shrinkage and Selection Operator (LASSO).(L1):
    RSS_{ridge} = \sum_{i=1}^n (y - x_i^T)^2 +\lambda \sum_{j=1}^p \beta_j
    NOTE: The LASSO produces sparse parameters; most of the coefficients will become zero, and the model will depend on a small subset of the features, while ridge most nonzero.
    When explanatory variables are correlated, the LASSO will shrink the coefficients of one variable toward zero. Ridge regression will shrink them more uniformly.

    Elastic Net:
    RSS_{ridge} = \sum_{i=1}^n (y - x_i^T)^2 +\lambda_2 \sum_{j=1}^p \beta_j^2 +\lambda_1 \sum_{j=1}^p \beta_j

    Down To Earth

    dataset url: https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data

    Data Exploring

    import numpy as np
    import pandas as pd
    from sklearn.linear_model import LinearRegression
    import matplotlib.pyplot as plt
    data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data',header=None,names=['Alcohol','Malic_acid ','Ash','Alcalinity_of_ash','Magnesium', 'Total_phenols','Flavanoids','Nonflavanoid_phenols','Proanthocyanins','Color_intensity','Hue','OD280/OD315_of_diluted wines','Proline'])
    data.index.name='quality'
    
    plt.figure(figsize=(15,15))
    
    plt.subplot(2,2,1)
    plt.title('alcohol vs quality ')
    plt.xlabel('alcohol')
    plt.ylabel('quality')
    plt.scatter(data['Alcohol'], data.index)
    
    plt.subplot(2,2,2)
    plt.title('Ash vs quality ')
    plt.xlabel('Ash')
    plt.ylabel('quality')
    plt.scatter(data['Ash'], data.index)
    
    plt.subplot(2,2,3)
    plt.title('Proline vs quality ')
    plt.xlabel('Proline')
    plt.ylabel('quality')
    plt.scatter(data['Proline'], data.index)
    
    plt.subplot(2,2,4)
    plt.title('Hue vs quality ')
    plt.xlabel('Hue')
    plt.ylabel('quality')
    plt.scatter(data['Hue'], data.index)
    
    plt.show()
    
    4.png

    Model Fitting

    Hold-out validation

    import numpy as np
    import pandas as pd
    from sklearn.linear_model import LinearRegression
    import matplotlib.pyplot as plt
    from sklearn.model_selection import train_test_split
    
    data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data',header=None,names=['Alcohol','Malic_acid ','Ash','Alcalinity_of_ash','Magnesium', 'Total_phenols','Flavanoids','Nonflavanoid_phenols','Proanthocyanins','Color_intensity','Hue','OD280/OD315_of_diluted wines','Proline'])
    data.index.name='quality'
    
    X = data.loc[:,['Alcohol','Ash','Proline','Hue']]
    y = data.index
    X_train, X_test, y_train, y_test = train_test_split(X, y,random_state = 40)
    regressor = LinearRegression()
    regressor.fit(X_train, y_train)
    y_predictions = regressor.predict(X_test)
    print('R-squared:', regressor.score(X_test, y_test))
    
    
    R-squared: 0.630209361477557
    
    1. load data
    2. split data set via model_selection.train_test_split
      Note
      i. train_test_split(data, label, stratify=y ,test_size=0.25(by default), random_state=40), and this is hold-out method with random/stratify sampling
      ii. with stratified split the R-squared increases to 0.6554701296431691
    3. train the model and evaluate it on the test set.

    Cross validation

    cross_validation.cross_val_score(classifier,data,target,cv=5) ,when cv is a nunmber k -> k-fold

    GD

    reasons: decrease computational complexity; matrix may not be inverted
    Gradient Descent is an optimization algorithm that can be used to estimate the local minimum of a function. Fortunately, the residual sum of the squares cost function is convex.

    to minimize SS_{res} = \sum_{i=1}^n (y - f(x_i))^2
    learning rate: too big -> hang around the bottom; too small -> taking too long time

    (Batch) gradient descent

    uses all of the training instances to update the model parameters in each iteration

    Stochastic Gradient Descent (SGD)

    updates the parameters using only a single training instance in each iteration. The training instance is usually selected randomly.

    import numpy as np
    import pandas as pd
    from sklearn.datasets import load_boston
    from sklearn.linear_model import SGDRegressor
    from sklearn.model_selection import cross_val_score
    from sklearn.preprocessing import StandardScaler
    from sklearn.model_selection import train_test_split
    data = load_boston()
    X_train, X_test, y_train, y_test = train_test_split(data.data,data.target)
    X_scaler = StandardScaler()
    y_scaler = StandardScaler()
    X_train = X_scaler.fit_transform(X_train)
    y_train = y_scaler.fit_transform(y_train.reshape(-1,1))
    X_test = X_scaler.transform(X_test)
    y_test = y_scaler.transform(y_test.reshape(-1,1))
    regressor = SGDRegressor(loss = 'squared_loss')
    scores = cross_val_score(regressor, X_train, y_train, cv = 5)
    print('Cross validation r-squared scores:', scores)
    print('Average cross validation r-squared score:', np.mean(scores))
    regressor.fit_transform(X_train, y_train)
    print('Test set r-squared score', regressor.score(X_test, y_test))
    
    Cross validation r-squared scores: [0.59439483 0.613529   0.72415499 0.78472194 0.69196096]
    Average cross validation r-squared score: 0.6817523439301019
    
    

    相关文章

      网友评论

          本文标题:MML(skl)——C2

          本文链接:https://www.haomeiwen.com/subject/zzgsqqtx.html