Udacity- 线性回归

作者: 涂大宝 | 来源:发表于2017-12-02 23:06 被阅读0次

Udacity- 线性回归
机器学习实战——回归
线性回归模型
通俗得说线性回归算法（二）线性回归实战
第一次打卡
2020-02-14
逻辑回归和线性回归对比
算法概述-02
【机器学习实践】有监督学习：线性分类、回归模型
统计学习基础复习浓缩版

continuous supervised learning 连续变量监督学习

regression 回归

continuous 有一定次序，且可以比较大小

1. Concept

slope：斜率

intercept：截距

coefficient：系数

2.Coding

import numpy
import matplotlib.pyplot as plt

from ages_net_worths import ageNetWorthData

ages_train, ages_test, net_worths_train, net_worths_test = ageNetWorthData()



from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(ages_train, net_worths_train)

### get Katie's net worth (she's 27)
### sklearn predictions are returned in an array, so you'll want to index into
### the output to get what you want, e.g. net_worth = predict([[27]])[0][0] (not
### exact syntax, the point is the [0] at the end). In addition, make sure the
### argument to your prediction function is in the expected format - if you get
### a warning about needing a 2d array for your data, a list of lists will be
### interpreted by sklearn as such (e.g. [[27]]).
km_net_worth = 1.0 ### fill in the line of code to get the right value
km_net_worth = reg.predict([[27]])[0][0]
### get the slope
### again, you'll get a 2-D array, so stick the [0][0] at the end
slope = 0. ### fill in the line of code to get the right value
slope = reg.coef_[0][0]
### get the intercept
### here you get a 1-D array, so stick [0] on the end to access
### the info we want
intercept = 0. ### fill in the line of code to get the right value
intercept = reg.intercept_[0]

### get the score on test data
test_score = 0. ### fill in the line of code to get the right value
test_score = reg.score(ages_test,net_worths_test)

### get the score on the training data
training_score = 0. ### fill in the line of code to get the right value
training_score = reg.score(ages_train,net_worths_train)


def submitFit():
    # all of the values in the returned dictionary are expected to be
    # numbers for the purpose of the grader.
    return {"networth":km_net_worth,
            "slope":slope,
            "intercept":intercept,
            "stats on test":test_score,
            "stats on training": training_score}

3.线性回归误差

最好的线性回归是最小化误差平方和的回归

4.最小化误差平方和的算法

ordinary least squares(OLS)
gradient descent

5.SSE的问题

sum of squared errors(SSE)

6.回归的R平方指标

0<R^2<1 越接近1，表明拟合表现的越好
优点：与训练点的数量无关，比误差平方和更可靠一点
在SKlearn中，用reg.score获取r的平方

7.分类与回归的比较

image.png

8.多变量回归

image.png

9.迷你项目

#!/usr/bin/python

"""
    Starter code for the regression mini-project.
    
    Loads up/formats a modified version of the dataset
    (why modified?  we've removed some trouble points
    that you'll find yourself in the outliers mini-project).

    Draws a little scatterplot of the training/testing data

    You fill in the regression code where indicated:
"""    


import sys
import pickle
sys.path.append("../tools/")
from feature_format import featureFormat, targetFeatureSplit
dictionary = pickle.load( open("../final_project/final_project_dataset_modified.pkl", "r") )

### list the features you want to look at--first item in the 
### list will be the "target" feature
features_list = ["bonus", "salary"]
data = featureFormat( dictionary, features_list, remove_any_zeroes=True)
target, features = targetFeatureSplit( data )

### training-testing split needed in regression, just like classification
from sklearn.cross_validation import train_test_split
feature_train, feature_test, target_train, target_test = train_test_split(features, target, test_size=0.5, random_state=42)
train_color = "b"
test_color = "b"



### Your regression goes here!
### Please name it reg, so that the plotting code below picks it up and 
### plots it correctly. Don't forget to change the test_color above from "b" to
### "r" to differentiate training points from test points.
from sklearn import linear_model
reg = linear_model.LinearRegression()
reg.fit(feature_train, target_train)
print reg.coef_
print reg.intercept_
print reg.score(feature_train, target_train)
print reg.score(feature_test, target_test)



### draw the scatterplot, with color-coded training and testing points
import matplotlib.pyplot as plt
for feature, target in zip(feature_test, target_test):
    plt.scatter( feature, target, color=test_color ) 
for feature, target in zip(feature_train, target_train):
    plt.scatter( feature, target, color=train_color ) 

### labels for the legend
plt.scatter(feature_test[0], target_test[0], color=test_color, label="test")
plt.scatter(feature_test[0], target_test[0], color=train_color, label="train")




### draw the regression line, once it's coded
try:
    plt.plot( feature_test, reg.predict(feature_test) )
except NameError:
    pass
plt.xlabel(features_list[1])
plt.ylabel(features_list[0])
plt.legend()
plt.show()

image.png

根据LTI回归奖金

我们有许多可用的财务特征，就预测个人奖金而言，其中一些特征可能比余下的特征更为强大。例如，假设你对数据做出了思考，并且推测出“long_term_incentive”特征（为公司长期的健康发展做出贡献的雇员应该得到这份奖励）可能与奖金而非工资的关系更密切。

证明你的假设是正确的一种方式是根据长期激励回归奖金，然后看看回归是否显著高于根据工资回归奖金。根据长期奖励回归奖金—测试数据的分数是多少？

features_list = ["bonus", "long_term_incentive"]

image.png

异常值破坏回归

这是下节课的内容简介，关于异常值的识别和删除。返回至之前的一个设置，你在其中使用工资预测奖金，并且重新运行代码来回顾数据。你可能注意到，少量数据点落在了主趋势之外，即某人拿到高工资（超过 1 百万美元！）却拿到相对较少的奖金。此为异常值的一个示例，我们将在下节课中重点讲述它们。

类似的这种点可以对回归造成很大的影响：如果它落在训练集内，它可能显著影响斜率/截距。如果它落在测试集内，它可能比落在测试集外要使分数低得多。就目前情况来看，此点落在测试集内（而且最终很可能降低分数）。

现在，我们将绘制两条回归线，一条在测试数据上拟合（有异常值），一条在训练数据上拟合（无异常值）。来看看现在的图形，有很大差别，对吧？单一的异常值会引起很大的差异。

image.png

网友评论

本文标题：Udacity- 线性回归

本文链接：https://www.haomeiwen.com/subject/ugwnbxtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

Udacity- 线性回归

1. Concept

2.Coding

3.线性回归误差

4.最小化误差平方和的算法

5.SSE的问题

6.回归的R平方指标

7.分类与回归的比较

8.多变量回归

9.迷你项目

根据LTI回归奖金

异常值破坏回归

相关文章