美文网首页
03 - linear regression

03 - linear regression

作者: 西瓜三茶 | 来源:发表于2017-07-15 14:07 被阅读0次

    Pisa Tower Example

    import pandas
    import matplotlib.pyplot as plt
    
    pisa = pandas.DataFrame({"year": range(1975, 1988), 
                             "lean": [2.9642, 2.9644, 2.9656, 2.9667, 2.9673, 2.9688, 2.9696, 
                                      2.9698, 2.9713, 2.9717, 2.9725, 2.9742, 2.9757]})
    
    print(pisa)
    
    plt.scatter(pisa["year"], pisa["lean"])
    plt.show()
    

    增加一列截距项

    import statsmodels.api as sm
    
    y = pisa.lean # target
    X = pisa.year  # features
    X = sm.add_constant(X)  # add a column of 1's as the constant term
    
    # OLS -- Ordinary Least Squares Fit
    linear = sm.OLS(y, X)
    # fit model
    linearfit = linear.fit()
    print (linearfit.summary())
    

    计算residual value

    # Our predicted values of y
    yhat = linearfit.predict(X)
    print(yhat)
    
    residuals = yhat - y
    print (residuals)
    
    #plot residuals and if it looks similar to a bell curve then we will accept the assumption of normality
    plt.hist(residuals, bins = 5)
    plt.show()
    

    3个summary of squares

    Sum of Square Error (SSE)  
    Regression Sum of Squares (RSS) 
    Total Sum of Squares (TSS)   # total variance
    
    SSE = np.sum((y.values-yhat)**2)
    RSS = np.sum((y.mean()-yhat)**2)
    TSS = np.sum((y.values-yhat.mean())**2)
    

    TSS=RSS+SSE

    • SSE is the sum of all residuals giving us a measure between the model's prediction and the observed values.
    • RSS measures the amount of explained variance which our model accounts for.
    • A large RSS and small SSE can be an indicator of a strong model.
    • TSS = RSS + SSE - the total amount of variance in the data is captured by the variance explained by the model plus the variance missed by the model.
    R_squared = 1 - SSE / TSS = RSS / TSS
    # what the percentage of variation in the target variable is explained by our model.
    

    Show params for the regression model

    print("\n",linearfit.params)
    

    The variance of each of the coefficients is an important and powerful measure.

    # Compute SSE
    SSE = np.sum((y.values - yhat)**2)
    # Compute variance in X
    xvar = np.sum((pisa.year - pisa.year.mean())**2)
    # Compute variance in b1 
    s2b1 = SSE / ((y.shape[0] - 2) * xvar)
    

    pdf and cdf

    from scipy.stats import t
    
    # 100 values between -3 and 3
    x = np.linspace(-3,3,100)
    
    # Compute the pdf with 3 degrees of freedom
    tdist3 = t.pdf(x=x, df=3)
    tdist30 = t.pdf(x=x, df=30)
                
    plt.plot(x, tdist3)
    plt.plot(x, tdist30)
    

    计算系数的significance

    The variable s2b1 is in memory.  The variance of beta_1
    tstat = linearfit.params["year"] / np.sqrt(s2b1)
    

    t检验

    # At the 95% confidence interval for a two-sided t-test we must use a p-value of 0.975
    pval = 0.975
    
    # The degrees of freedom
    df = pisa.shape[0] - 2
    
    # The probability to test against
    p = t.cdf(tstat, df=df)
    
    beta1_test = p > pval
    

    相关文章

      网友评论

          本文标题:03 - linear regression

          本文链接:https://www.haomeiwen.com/subject/pgwshxtx.html