03 - linear regression

作者: 西瓜三茶 | 来源:发表于2017-07-15 14:07 被阅读0次

tensorflow 已经完成高级别的模型封装种类
R-建模及预测
03 - linear regression
Linear Regression（线性回归）
Regression Tree (VS Linear Regre
tensorflow学习笔记（3）|线性回归
Logistic Regression
M.L.-Classification and Represen
BA: Predictive Data
Linear Regression线性回归

Pisa Tower Example

import pandas
import matplotlib.pyplot as plt

pisa = pandas.DataFrame({"year": range(1975, 1988), 
                         "lean": [2.9642, 2.9644, 2.9656, 2.9667, 2.9673, 2.9688, 2.9696, 
                                  2.9698, 2.9713, 2.9717, 2.9725, 2.9742, 2.9757]})

print(pisa)

plt.scatter(pisa["year"], pisa["lean"])
plt.show()

增加一列截距项

import statsmodels.api as sm

y = pisa.lean # target
X = pisa.year  # features
X = sm.add_constant(X)  # add a column of 1's as the constant term

# OLS -- Ordinary Least Squares Fit
linear = sm.OLS(y, X)
# fit model
linearfit = linear.fit()
print (linearfit.summary())

计算residual value

# Our predicted values of y
yhat = linearfit.predict(X)
print(yhat)

residuals = yhat - y
print (residuals)

#plot residuals and if it looks similar to a bell curve then we will accept the assumption of normality
plt.hist(residuals, bins = 5)
plt.show()

3个summary of squares

Sum of Square Error (SSE)  
Regression Sum of Squares (RSS) 
Total Sum of Squares (TSS)   # total variance

SSE = np.sum((y.values-yhat)**2)
RSS = np.sum((y.mean()-yhat)**2)
TSS = np.sum((y.values-yhat.mean())**2)

TSS=RSS+SSE

SSE is the sum of all residuals giving us a measure between the model's prediction and the observed values.
RSS measures the amount of explained variance which our model accounts for.
A large RSS and small SSE can be an indicator of a strong model.
TSS = RSS + SSE - the total amount of variance in the data is captured by the variance explained by the model plus the variance missed by the model.

R_squared = 1 - SSE / TSS = RSS / TSS
# what the percentage of variation in the target variable is explained by our model.

Show params for the regression model

print("\n",linearfit.params)

The variance of each of the coefficients is an important and powerful measure.

# Compute SSE
SSE = np.sum((y.values - yhat)**2)
# Compute variance in X
xvar = np.sum((pisa.year - pisa.year.mean())**2)
# Compute variance in b1 
s2b1 = SSE / ((y.shape[0] - 2) * xvar)

pdf and cdf

from scipy.stats import t

# 100 values between -3 and 3
x = np.linspace(-3,3,100)

# Compute the pdf with 3 degrees of freedom
tdist3 = t.pdf(x=x, df=3)
tdist30 = t.pdf(x=x, df=30)
            
plt.plot(x, tdist3)
plt.plot(x, tdist30)

计算系数的significance

The variable s2b1 is in memory.  The variance of beta_1
tstat = linearfit.params["year"] / np.sqrt(s2b1)

t检验

# At the 95% confidence interval for a two-sided t-test we must use a p-value of 0.975
pval = 0.975

# The degrees of freedom
df = pisa.shape[0] - 2

# The probability to test against
p = t.cdf(tstat, df=df)

beta1_test = p > pval