美文网首页
机器学习作业2-逻辑回归

机器学习作业2-逻辑回归

作者: 天叔 | 来源:发表于2021-02-28 01:27 被阅读0次

    一、算法要求

    学生有两门考试成绩,预测学生的入学结果,即两个参数的拟合情况

    • 题一
      线性拟合,预测学生录取情况
    • 题二
      非线性拟合,预测学生录取情况,正则化逻辑回归
      通过做这道题发现,选择哪种拟合方式,要先看输入数据的特点,可以总结机器学习解决工程问题的步骤:
    1. 绘图观察输入参数特点(如果是三维以内,超过三维无法成图)
    2. 设计拟合函数,即估计算法{h\left(\theta\right)}
    3. 得出损失函数cost,即{J_\theta},逻辑回归用到sigmoid函数
    4. 求出梯度下降函数,即偏导,用来做梯度下降
    5. 开始数据训练,求出{\theta}向量
    6. 根据求出的{h\left(\theta\right)},绘图观察拟合情况,是否有过拟合或者欠拟合,根据观察结果调参

    二、线性逻辑回归

    准备数据

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    plt.style.use('fivethirtyeight')
    import matplotlib.pyplot as plt
    # import tensorflow as tf
    from sklearn.metrics import classification_report#这个包是评价报告
    
    # 加载样本数据
    data = pd.read_csv('ex2data1.txt', names=['exam1', 'exam2', 'admitted'])
    data.head()#看前五行
    data.describe()
    
    # 绘制样本分布
    sns.set(context="notebook", style="darkgrid", 
    palette=sns.color_palette("RdBu", 2)) # RdBu指一种风格,2指取两种颜色
    sns.palplot(sns.color_palette("RdBu", n_colors=5)) # 绘制出颜色条
    sns.lmplot(x='exam1', y='exam2', hue='admitted', data=data,
               height=6,
               fit_reg=False,
               scatter_kws={"s": 50}
              )
    plt.show()#看下数据的样子
    
    样本数据 样本特征
    样本数据 样本特征
    sns.color_palett设为5 样本分布
    sns.color_palett设为5 样本分布

    定义必要的函数,获取X 和 y

    def get_X(df):#读取特征
    #     """
    #     use concat to add intersect feature to avoid side effect
    #     not efficient for big dataset though
    #     """
        ones = pd.DataFrame({'ones': np.ones(len(df))})#ones是m行1列的dataframe
        data = pd.concat([ones, df], axis=1)  # 合并数据,根据列合并 axis表示按"行"还是"列合并"
        return data.iloc[:, :-1].values  # 这个操作返回 ndarray,不是矩阵
    
    
    def get_y(df):#读取标签
    #     '''assume the last column is the target'''
        return np.array(df.iloc[:, -1])#df.iloc[:, -1]是指df的最后一列
    
    
    def normalize_feature(df):
    #     """Applies function along input axis(default 0) of DataFrame."""
        # lumbda 默认column作为参数
        return df.apply(lambda column: (column - column.mean()) / column.std())#特征缩放
    
    X = get_X(data)
    print(X.shape) 
    y = get_y(data)
    print(y.shape)
    # (100, 3)  (100,)
    

    逻辑回归基于sigmoid函数实现

    g 代表一个常用的逻辑函数(logistic function)为S形函数(Sigmoid function),公式为: [g\left( z \right)=\frac{1}{1+{{e}^{-z}}}]
    合起来,我们得到逻辑回归模型的假设函数:
    [{{h}_{\theta }}\left( x \right)=\frac{1}{1+{{e}^{-{{\theta }^{T}}X}}}]

    sigmoid的编程实现:

    def sigmoid(z):
        return 1 / (1 + np.exp(-z))
    
    fig, ax = plt.subplots(figsize=(8, 5))
    ax.plot(np.arange(-10, 10, step=0.01),
            sigmoid(np.arange(-10, 10, step=0.01)))
    ax.set_ylim((-0.1,1.1))
    ax.set_xlabel('z', fontsize=18)
    ax.set_ylabel('g(z)', fontsize=18)
    ax.set_title('sigmoid function', fontsize=18)
    plt.show()
    
    sigmoid function

    定义cost functiion(代价函数)

    • max(\ell(\theta)) = min(-\ell(\theta))
    • choose -\ell(\theta) as the cost function

    \begin{align} & J\left( \theta \right)=-\frac{1}{m}\sum\limits_{i=1}^{m}{[{{y}^{(i)}}\log \left( {{h}_{\theta }}\left( {{x}^{(i)}} \right) \right)+\left( 1-{{y}^{(i)}} \right)\log \left( 1-{{h}_{\theta }}\left( {{x}^{(i)}} \right) \right)]} \\ & =\frac{1}{m}\sum\limits_{i=1}^{m}{[-{{y}^{(i)}}\log \left( {{h}_{\theta }}\left( {{x}^{(i)}} \right) \right)-\left( 1-{{y}^{(i)}} \right)\log \left( 1-{{h}_{\theta }}\left( {{x}^{(i)}} \right) \right)]} \\ \end{align}

    开始处理theta

    theta = theta=np.zeros(3) # X(m*n) so theta is n*1
    def cost(theta, X, y):
        ''' cost fn is -l(theta) for you to minimize'''
        return np.mean(-y * np.log(sigmoid(X @ theta)) - (1 - y) * np.log(1 - sigmoid(X @ theta)))
    
    # X @ theta与X.dot(theta)等价
    
    cost(theta, X, y)
    # 得到一个初始的损失值:0.6931471805599453
    

    定义gradient descent(梯度下降)

    • 这是批量梯度下降(batch gradient descent)
    • 转化为向量化计算: \frac{1}{m} X^T( Sigmoid(X\theta) - y )
      \frac{\partial J\left( \theta \right)}{\partial {{\theta }_{j}}}=\frac{1}{m}\sum\limits_{i=1}^{m}{({{h}_{\theta }}\left( {{x}^{(i)}} \right)-{{y}^{(i)}})x_{_{j}}^{(i)}}
    def gradient(theta, X, y):
    #     '''just 1 batch gradient'''
        return (1 / len(X)) * X.T @ (sigmoid(X @ theta) - y)
    
    gradient(theta, X, y)
    # 调一次,看看第一步的梯度下降
    # array([ -0.1       , -12.00921659, -11.26284221])
    

    拟合参数

    import scipy.optimize as opt
    
    res = opt.minimize(fun=cost, x0=theta, args=(X, y), method='Newton-CG', jac=gradient)
    # fun-损失函数,x0-待拟合函数的参数,args-输入样本数据,method-梯度下降的处理方法,jac-训练方法,这里选择梯度下降
    
    print(res)
         fun: 0.20349770426553998
         jac: array([-2.85342794e-06, -3.50853296e-05, -1.62061639e-04])
     message: 'Optimization terminated successfully.'
        nfev: 71
        nhev: 0
         nit: 27
        njev: 178
      status: 0
     success: True
           x: array([-25.16557602,   0.20626565,   0.20150593])
    
    

    用训练集预测和验证

    实际工程中,不应该用训练集来做预测和验证,交叉校验和册数数据的选择另有讲究,这里是练习题,不做那么多讲究了,达到学习目的即可。

    def predict(x, theta):
        prob = sigmoid(x @ theta)
        # 这里不能直接判断 x @ theta > 0,因为有截距存在
        return (prob >= 0.5).astype(int)
    
    final_theta = res.x
    y_pred = predict(X, final_theta)
    print(classification_report(y, y_pred))
    
                  precision    recall  f1-score   support
    
               0       0.87      0.85      0.86        40
               1       0.90      0.92      0.91        60
    
        accuracy                           0.89       100
       macro avg       0.89      0.88      0.88       100
    weighted avg       0.89      0.89      0.89       100
    

    寻找决策边界

    http://stats.stackexchange.com/questions/93569/why-is-logistic-regression-a-linear-classifier

    X \times \theta = 0 (this is the line)

    其实就是解方程{\theta_0 + {\theta_1}*x + {\theta_2*y}} = 0
    coef = -(res.x / res.x[2]) # find the equation

    print(res.x) # this is final theta
    [-25.16557602   0.20626565   0.20150593]
    
    coef = -(res.x / res.x[2])  # find the equation,
    print(coef)
    
    # 根据theta,绘制一组数据,表达拟合的方程
    x = np.arange(130, step=0.1)
    y = coef[0] + coef[1]*x
    [124.8875223   -1.02362075  -1.        ]
    
    data.describe()  # find the range of x and y
    
    sns.set(context="notebook", style="ticks", font_scale=1.5)
    
    sns.lmplot(x='exam1', y='exam2', hue='admitted', data=data,
               height=6,
               fit_reg=False, 
               scatter_kws={"s": 25}
              )
    
    plt.plot(x, y, 'grey')
    plt.xlim(0, 130)
    plt.ylim(0, 130)
    plt.title('Decision Boundary')
    plt.show()
    
    决策边界

    三、非线性逻辑回归

    基本的逻辑同上,相同点不做赘述。

    加载样本数据

    df = pd.read_csv('ex2data2.txt', names=['test1', 'test2', 'accepted'])
    df.head()
    
    sns.set(context="notebook", style="ticks", font_scale=1.5)
    sns.lmplot('test1', 'test2', hue='accepted', data=df, 
               size=6, 
               fit_reg=False, 
               scatter_kws={"s": 50}
              )
    plt.title('Regularized Logistic Regression')
    plt.show()
    

    观察到题二的数据是非线性的,需要通过更复杂的多项式来拟合


    feature mapping(特征映射)

    polynomial expansion


    多项式拓展

    实现代码逻辑:

    for i in 0..i
      for p in 0..i:
        output x^(i-p) * y^p
    

    注意,notebook不支持local image

    把原先的两列数据拓展成n列

    def feature_mapping(x, y, power, as_ndarray=False):
    #     """return mapped features as ndarray or dataframe"""
        # data = {}
        # # inclusive
        # for i in np.arange(power + 1):
        #     for p in np.arange(i + 1):
        #         data["f{}{}".format(i - p, p)] = np.power(x, i - p) * np.power(y, p)
    
        # {}表示字典,即map数据集合,后面两个for用到了脚本的方式进行条件操作
        data = {"f{}{}".format(i - p, p): np.power(x, i - p) * np.power(y, p)
                    for i in np.arange(power + 1)
                    for p in np.arange(i + 1)
                }
    
        if as_ndarray:
            return pd.DataFrame(data).values
        else:
            return pd.DataFrame(data)
    
    x1 = np.array(df.test1)
    x2 = np.array(df.test2)
    
    data = feature_mapping(x1, x2, power=6)
    print(data.shape)
    data.head()
    

    拓展后的数据集:


    regularized cost(正则化代价函数)

    J\left( \theta \right)=\frac{1}{m}\sum\limits_{i=1}^{m}{[-{{y}^{(i)}}\log \left( {{h}_{\theta }}\left( {{x}^{(i)}} \right) \right)-\left( 1-{{y}^{(i)}} \right)\log \left( 1-{{h}_{\theta }}\left( {{x}^{(i)}} \right) \right)]}+\frac{\lambda }{2m}\sum\limits_{j=1}^{n}{\theta _{j}^{2}}

    theta = np.zeros(data.shape[1])
    X = feature_mapping(x1, x2, power=6, as_ndarray=True)
    print(X.shape)
    
    y = get_y(df)
    print(y.shape)
    
    (118, 28)
    (118,)
    
    def regularized_cost(theta, X, y, l=1):
    #     '''you don't penalize theta_0'''
    # theta_0表示截距,不做“惩罚”
        theta_j1_to_n = theta[1:]
        regularized_term = (l / (2 * len(X))) * np.power(theta_j1_to_n, 2).sum()
    
        return cost(theta, X, y) + regularized_term
    #正则化代价函数
    
    regularized_cost(theta, X, y, l=1)
    0.6931471805599454
    # this is the same as the not regularized cost because we init theta as zeros...
    # 因为我们设置theta为0,所以这个正则化代价函数与代价函数的值相同
    
    

    regularized gradient(正则化梯度)

    \frac{\partial J\left( \theta \right)}{\partial {{\theta }_{j}}}=\left( \frac{1}{m}\sum\limits_{i=1}^{m}{\left( {{h}_{\theta }}\left( {{x}^{\left( i \right)}} \right)-{{y}^{\left( i \right)}} \right)} \right)+\frac{\lambda }{m}{{\theta }_{j}}\text{ }\text{ for j}\ge \text{1}

    def regularized_gradient(theta, X, y, l=1):
    #     '''still, leave theta_0 alone'''
        theta_j1_to_n = theta[1:]
        regularized_theta = (l / len(X)) * theta_j1_to_n
    
        # by doing this, no offset is on theta_0
        # theta_0填充为0,即正则化不影响theta_0
        regularized_term = np.concatenate([np.array([0]), regularized_theta])
    
        return gradient(theta, X, y) + regularized_term
    
    regularized_gradient(theta, X, y)
    

    拟合参数

    import scipy.optimize as opt
    
    print('init cost = {}'.format(regularized_cost(theta, X, y)))
    res = opt.minimize(fun=regularized_cost, x0=theta, args=(X, y), method='Newton-CG', jac=regularized_gradient)
    res
    init cost = 0.6931471805599454
    
         fun: 0.5290027297128722
         jac: array([ 4.64317436e-08,  1.04331373e-08, -3.61802419e-08, -3.44397841e-08,
            2.46408233e-08,  1.12246256e-08, -5.13021764e-09, -1.11582358e-08,
            1.23402171e-09,  2.62428040e-08, -3.94916642e-08,  5.21264511e-09,
           -8.98645551e-09,  1.04022216e-08,  3.24793498e-08, -5.20025250e-09,
           -5.87238749e-09,  7.46797548e-10, -4.52162093e-09, -1.47871366e-09,
            1.80405423e-08, -1.62958045e-08,  7.26901438e-10, -5.53694209e-09,
            3.57089897e-09, -3.35135784e-09,  5.51302048e-09,  2.59989451e-08])
     message: 'Optimization terminated successfully.'
        nfev: 7
        nhev: 0
         nit: 6
        njev: 55
      status: 0
     success: True
           x: array([ 1.27274054,  0.62527229,  1.18108684, -2.01996217, -0.91742229,
           -1.43166588,  0.1240061 , -0.36553467, -0.35724013, -0.1751284 ,
           -1.45815894, -0.050989  , -0.61555564, -0.27470555, -1.192815  ,
           -0.24218818, -0.20600633, -0.04473079, -0.27778484, -0.29537856,
           -0.45635706, -1.04320283,  0.02777156, -0.29243164,  0.01556705,
           -0.3273799 , -0.14388646, -0.92465161])
    

    预测

    final_theta = res.x
    y_pred = predict(X, final_theta)
    
    print(classification_report(y, y_pred))
                  precision    recall  f1-score   support
    
               0       0.90      0.75      0.82        60
               1       0.78      0.91      0.84        58
    
        accuracy                           0.83       118
       macro avg       0.84      0.83      0.83       118
    weighted avg       0.84      0.83      0.83       118
    
    

    使用不同的 \lambda (即正则化的权重,这个是常数), 画出决策边界

    我们找到所有满足 X\times \theta = 0 的x

    • instead of solving polynomial equation, just create a coridate x,y grid that is dense enough, and find all those X\times \theta that is close enough to 0, then plot them
    def draw_boundary(power, l):
    #     """
    #     power: polynomial power for mapped feature
    #     l: lambda constant
    #     """
        density = 1000
        threshhold = 2 * 10**-3
    
        final_theta = feature_mapped_logistic_regression(power, l)
        x, y = find_decision_boundary(density, power, final_theta, threshhold)
    
        df = pd.read_csv('ex2data2.txt', names=['test1', 'test2', 'accepted'])
        sns.lmplot(x='test1', y='test2', hue='accepted', data=df, height=6, fit_reg=False, scatter_kws={"s": 100})
    
        plt.scatter(x, y, c='red', s=10)
        plt.title('Decision boundary')
        plt.show()
    
    def feature_mapped_logistic_regression(power, l):
    #     """for drawing purpose only.. not a well generealize logistic regression
    #     power: int
    #         raise x1, x2 to polynomial power
    #     l: int
    #         lambda constant for regularization term
    #     """
        df = pd.read_csv('ex2data2.txt', names=['test1', 'test2', 'accepted'])
        x1 = np.array(df.test1)
        x2 = np.array(df.test2)
        y = get_y(df)
    
        X = feature_mapping(x1, x2, power, as_ndarray=True)
        theta = np.zeros(X.shape[1])
    
        res = opt.minimize(fun=regularized_cost,
                           x0=theta,
                           args=(X, y, l),
                           method='TNC',
                           jac=regularized_gradient)
        final_theta = res.x
    
        return final_theta
    
    
    def find_decision_boundary(density, power, theta, threshhold):
        t1 = np.linspace(-1, 1.5, density)
        t2 = np.linspace(-1, 1.5, density)
    
        cordinates = [(x, y) for x in t1 for y in t2]
        # zip(a, b):打包;zip(*c):解压
        x_cord, y_cord = zip(*cordinates)
        mapped_cord = feature_mapping(x_cord, y_cord, power)  # this is a dataframe
    
        inner_product = mapped_cord.values @ theta
    
        decision = mapped_cord[np.abs(inner_product) < threshhold]
    
        return decision.f10, decision.f01
    #寻找决策边界函数
    

    多尝试几个{\lambda}

    draw_boundary(power=6, l=1) {\lambda = 1、0、100、10}

    {\lambda = 1} 略欠拟合 {\lambda = 0} 过拟合 {\lambda = 100} 欠拟合 {\lambda = 10} 比较适中
    略欠拟合 过拟合 欠拟合 比较适中

    注意 notebook 不支持 加载本地image资源

    Local images cannot be inserted in jupyter notebook

    参考:

    https://github.com/fengdu78/Coursera-ML-AndrewNg-Notes
    matplotlib.pyplot style美化
    matplotlib 美化大全
    Seaborn 0.9 中文文档
    逻辑回归LogisticRegression参数解析之penalty & C
    Coursera-ML-AndrewNg 作业

    相关文章

      网友评论

          本文标题:机器学习作业2-逻辑回归

          本文链接:https://www.haomeiwen.com/subject/chwmfltx.html