美文网首页
Udacity- 线性回归

Udacity- 线性回归

作者: 涂大宝 | 来源:发表于2017-12-02 23:06 被阅读0次

    continuous supervised learning 连续变量监督学习

    regression 回归

    continuous 有一定次序,且可以比较大小

    1. Concept

    slope:斜率

    intercept:截距

    coefficient:系数

    2.Coding

    import numpy
    import matplotlib.pyplot as plt
    
    from ages_net_worths import ageNetWorthData
    
    ages_train, ages_test, net_worths_train, net_worths_test = ageNetWorthData()
    
    
    
    from sklearn.linear_model import LinearRegression
    
    reg = LinearRegression()
    reg.fit(ages_train, net_worths_train)
    
    ### get Katie's net worth (she's 27)
    ### sklearn predictions are returned in an array, so you'll want to index into
    ### the output to get what you want, e.g. net_worth = predict([[27]])[0][0] (not
    ### exact syntax, the point is the [0] at the end). In addition, make sure the
    ### argument to your prediction function is in the expected format - if you get
    ### a warning about needing a 2d array for your data, a list of lists will be
    ### interpreted by sklearn as such (e.g. [[27]]).
    km_net_worth = 1.0 ### fill in the line of code to get the right value
    km_net_worth = reg.predict([[27]])[0][0]
    ### get the slope
    ### again, you'll get a 2-D array, so stick the [0][0] at the end
    slope = 0. ### fill in the line of code to get the right value
    slope = reg.coef_[0][0]
    ### get the intercept
    ### here you get a 1-D array, so stick [0] on the end to access
    ### the info we want
    intercept = 0. ### fill in the line of code to get the right value
    intercept = reg.intercept_[0]
    
    ### get the score on test data
    test_score = 0. ### fill in the line of code to get the right value
    test_score = reg.score(ages_test,net_worths_test)
    
    ### get the score on the training data
    training_score = 0. ### fill in the line of code to get the right value
    training_score = reg.score(ages_train,net_worths_train)
    
    
    def submitFit():
        # all of the values in the returned dictionary are expected to be
        # numbers for the purpose of the grader.
        return {"networth":km_net_worth,
                "slope":slope,
                "intercept":intercept,
                "stats on test":test_score,
                "stats on training": training_score}
    

    3.线性回归误差

    最好的线性回归是最小化误差平方和的回归

    4.最小化误差平方和的算法

    ordinary least squares(OLS)
    gradient descent

    5.SSE的问题

    sum of squared errors(SSE)

    6.回归的R平方指标

    0<R^2<1 越接近1,表明拟合表现的越好
    优点:与训练点的数量无关,比误差平方和更可靠一点
    在SKlearn中,用reg.score获取r的平方

    7.分类与回归的比较

    image.png

    8.多变量回归

    image.png

    9.迷你项目

    #!/usr/bin/python
    
    """
        Starter code for the regression mini-project.
        
        Loads up/formats a modified version of the dataset
        (why modified?  we've removed some trouble points
        that you'll find yourself in the outliers mini-project).
    
        Draws a little scatterplot of the training/testing data
    
        You fill in the regression code where indicated:
    """    
    
    
    import sys
    import pickle
    sys.path.append("../tools/")
    from feature_format import featureFormat, targetFeatureSplit
    dictionary = pickle.load( open("../final_project/final_project_dataset_modified.pkl", "r") )
    
    ### list the features you want to look at--first item in the 
    ### list will be the "target" feature
    features_list = ["bonus", "salary"]
    data = featureFormat( dictionary, features_list, remove_any_zeroes=True)
    target, features = targetFeatureSplit( data )
    
    ### training-testing split needed in regression, just like classification
    from sklearn.cross_validation import train_test_split
    feature_train, feature_test, target_train, target_test = train_test_split(features, target, test_size=0.5, random_state=42)
    train_color = "b"
    test_color = "b"
    
    
    
    ### Your regression goes here!
    ### Please name it reg, so that the plotting code below picks it up and 
    ### plots it correctly. Don't forget to change the test_color above from "b" to
    ### "r" to differentiate training points from test points.
    from sklearn import linear_model
    reg = linear_model.LinearRegression()
    reg.fit(feature_train, target_train)
    print reg.coef_
    print reg.intercept_
    print reg.score(feature_train, target_train)
    print reg.score(feature_test, target_test)
    
    
    
    ### draw the scatterplot, with color-coded training and testing points
    import matplotlib.pyplot as plt
    for feature, target in zip(feature_test, target_test):
        plt.scatter( feature, target, color=test_color ) 
    for feature, target in zip(feature_train, target_train):
        plt.scatter( feature, target, color=train_color ) 
    
    ### labels for the legend
    plt.scatter(feature_test[0], target_test[0], color=test_color, label="test")
    plt.scatter(feature_test[0], target_test[0], color=train_color, label="train")
    
    
    
    
    ### draw the regression line, once it's coded
    try:
        plt.plot( feature_test, reg.predict(feature_test) )
    except NameError:
        pass
    plt.xlabel(features_list[1])
    plt.ylabel(features_list[0])
    plt.legend()
    plt.show()
    
    image.png
    image.png
    根据LTI回归奖金

    我们有许多可用的财务特征,就预测个人奖金而言,其中一些特征可能比余下的特征更为强大。例如,假设你对数据做出了思考,并且推测出“long_term_incentive”特征(为公司长期的健康发展做出贡献的雇员应该得到这份奖励)可能与奖金而非工资的关系更密切。

    证明你的假设是正确的一种方式是根据长期激励回归奖金,然后看看回归是否显著高于根据工资回归奖金。根据长期奖励回归奖金—测试数据的分数是多少?

    features_list = ["bonus", "long_term_incentive"]
    
    image.png
    image.png
    异常值破坏回归

    这是下节课的内容简介,关于异常值的识别和删除。返回至之前的一个设置,你在其中使用工资预测奖金,并且重新运行代码来回顾数据。你可能注意到,少量数据点落在了主趋势之外,即某人拿到高工资(超过 1 百万美元!)却拿到相对较少的奖金。此为异常值的一个示例,我们将在下节课中重点讲述它们。

    类似的这种点可以对回归造成很大的影响:如果它落在训练集内,它可能显著影响斜率/截距。如果它落在测试集内,它可能比落在测试集外要使分数低得多。就目前情况来看,此点落在测试集内(而且最终很可能降低分数)。

    现在,我们将绘制两条回归线,一条在测试数据上拟合(有异常值),一条在训练数据上拟合(无异常值)。来看看现在的图形,有很大差别,对吧?单一的异常值会引起很大的差异。

    image.png

    相关文章

      网友评论

          本文标题:Udacity- 线性回归

          本文链接:https://www.haomeiwen.com/subject/ugwnbxtx.html