机器学习作业2-逻辑回归

作者: 天叔 | 来源:发表于2021-02-28 01:27 被阅读0次

机器学习作业2-逻辑回归
机器学习之Logistic回归(逻辑蒂斯回归）
逻辑回归
机器学习算法的优缺点
逻辑回归模型
逻辑回归为什么用交叉熵损失函数
机器学习预测乳腺癌良恶性（1）（逻辑回归）
回归算法学习
Python从零开始第六章机器学习②逻辑回归实战（2）
机器学习实践系列3——二项逻辑回归

一、算法要求

学生有两门考试成绩，预测学生的入学结果，即两个参数的拟合情况

题一
线性拟合，预测学生录取情况
题二
非线性拟合，预测学生录取情况，正则化逻辑回归
通过做这道题发现，选择哪种拟合方式，要先看输入数据的特点，可以总结机器学习解决工程问题的步骤：

绘图观察输入参数特点(如果是三维以内，超过三维无法成图)
设计拟合函数，即估计算法 ${h\left(\theta\right)}$
得出损失函数cost，即 ${J_\theta}$ ，逻辑回归用到sigmoid函数
求出梯度下降函数，即偏导，用来做梯度下降
开始数据训练，求出 ${\theta}$ 向量
根据求出的 ${h\left(\theta\right)}$ ，绘图观察拟合情况，是否有过拟合或者欠拟合，根据观察结果调参

二、线性逻辑回归

准备数据

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
import matplotlib.pyplot as plt
# import tensorflow as tf
from sklearn.metrics import classification_report#这个包是评价报告

# 加载样本数据
data = pd.read_csv('ex2data1.txt', names=['exam1', 'exam2', 'admitted'])
data.head()#看前五行
data.describe()

# 绘制样本分布
sns.set(context="notebook", style="darkgrid", 
palette=sns.color_palette("RdBu", 2)) # RdBu指一种风格，2指取两种颜色
sns.palplot(sns.color_palette("RdBu", n_colors=5)) # 绘制出颜色条
sns.lmplot(x='exam1', y='exam2', hue='admitted', data=data,
           height=6,
           fit_reg=False,
           scatter_kws={"s": 50}
          )
plt.show()#看下数据的样子

样本数据	样本特征
样本数据	样本特征

sns.color_palett设为5	样本分布
sns.color_palett设为5	样本分布

定义必要的函数，获取X 和 y

def get_X(df):#读取特征
#     """
#     use concat to add intersect feature to avoid side effect
#     not efficient for big dataset though
#     """
    ones = pd.DataFrame({'ones': np.ones(len(df))})#ones是m行1列的dataframe
    data = pd.concat([ones, df], axis=1)  # 合并数据，根据列合并 axis表示按"行"还是"列合并"
    return data.iloc[:, :-1].values  # 这个操作返回 ndarray,不是矩阵


def get_y(df):#读取标签
#     '''assume the last column is the target'''
    return np.array(df.iloc[:, -1])#df.iloc[:, -1]是指df的最后一列


def normalize_feature(df):
#     """Applies function along input axis(default 0) of DataFrame."""
    # lumbda 默认column作为参数
    return df.apply(lambda column: (column - column.mean()) / column.std())#特征缩放

X = get_X(data)
print(X.shape) 
y = get_y(data)
print(y.shape)
# (100, 3)  (100,)

逻辑回归基于sigmoid函数实现

g 代表一个常用的逻辑函数（logistic function）为S形函数（Sigmoid function），公式为： $[g\left( z \right)=\frac{1}{1+{{e}^{-z}}}]$
合起来，我们得到逻辑回归模型的假设函数：
$[{{h}_{\theta }}\left( x \right)=\frac{1}{1+{{e}^{-{{\theta }^{T}}X}}}]$

sigmoid的编程实现：

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(np.arange(-10, 10, step=0.01),
        sigmoid(np.arange(-10, 10, step=0.01)))
ax.set_ylim((-0.1,1.1))
ax.set_xlabel('z', fontsize=18)
ax.set_ylabel('g(z)', fontsize=18)
ax.set_title('sigmoid function', fontsize=18)
plt.show()

sigmoid function

定义cost functiion(代价函数)

$max(\ell(\theta)) = min(-\ell(\theta))$

choose $-\ell(\theta)$ as the cost function

$\begin{align} & J\left( \theta \right)=-\frac{1}{m}\sum\limits_{i=1}^{m}{[{{y}^{(i)}}\log \left( {{h}_{\theta }}\left( {{x}^{(i)}} \right) \right)+\left( 1-{{y}^{(i)}} \right)\log \left( 1-{{h}_{\theta }}\left( {{x}^{(i)}} \right) \right)]} \\ & =\frac{1}{m}\sum\limits_{i=1}^{m}{[-{{y}^{(i)}}\log \left( {{h}_{\theta }}\left( {{x}^{(i)}} \right) \right)-\left( 1-{{y}^{(i)}} \right)\log \left( 1-{{h}_{\theta }}\left( {{x}^{(i)}} \right) \right)]} \\ \end{align}$

开始处理theta

theta = theta=np.zeros(3) # X(m*n) so theta is n*1
def cost(theta, X, y):
    ''' cost fn is -l(theta) for you to minimize'''
    return np.mean(-y * np.log(sigmoid(X @ theta)) - (1 - y) * np.log(1 - sigmoid(X @ theta)))

# X @ theta与X.dot(theta)等价

cost(theta, X, y)
# 得到一个初始的损失值：0.6931471805599453

定义gradient descent(梯度下降)

这是批量梯度下降（batch gradient descent）
转化为向量化计算： $\frac{1}{m} X^T( Sigmoid(X\theta) - y )$
$\frac{\partial J\left( \theta \right)}{\partial {{\theta }_{j}}}=\frac{1}{m}\sum\limits_{i=1}^{m}{({{h}_{\theta }}\left( {{x}^{(i)}} \right)-{{y}^{(i)}})x_{_{j}}^{(i)}}$

def gradient(theta, X, y):
#     '''just 1 batch gradient'''
    return (1 / len(X)) * X.T @ (sigmoid(X @ theta) - y)

gradient(theta, X, y)
# 调一次，看看第一步的梯度下降
# array([ -0.1       , -12.00921659, -11.26284221])

拟合参数

这里使用 scipy.optimize.minimize 去寻找参数

import scipy.optimize as opt

res = opt.minimize(fun=cost, x0=theta, args=(X, y), method='Newton-CG', jac=gradient)
# fun-损失函数，x0-待拟合函数的参数，args-输入样本数据，method-梯度下降的处理方法，jac-训练方法，这里选择梯度下降

print(res)
     fun: 0.20349770426553998
     jac: array([-2.85342794e-06, -3.50853296e-05, -1.62061639e-04])
 message: 'Optimization terminated successfully.'
    nfev: 71
    nhev: 0
     nit: 27
    njev: 178
  status: 0
 success: True
       x: array([-25.16557602,   0.20626565,   0.20150593])

用训练集预测和验证

实际工程中，不应该用训练集来做预测和验证，交叉校验和册数数据的选择另有讲究，这里是练习题，不做那么多讲究了，达到学习目的即可。

def predict(x, theta):
    prob = sigmoid(x @ theta)
    # 这里不能直接判断 x @ theta > 0，因为有截距存在
    return (prob >= 0.5).astype(int)

final_theta = res.x
y_pred = predict(X, final_theta)
print(classification_report(y, y_pred))

              precision    recall  f1-score   support

           0       0.87      0.85      0.86        40
           1       0.90      0.92      0.91        60

    accuracy                           0.89       100
   macro avg       0.89      0.88      0.88       100
weighted avg       0.89      0.89      0.89       100

寻找决策边界

http://stats.stackexchange.com/questions/93569/why-is-logistic-regression-a-linear-classifier

$X \times \theta = 0$ (this is the line)

其实就是解方程 ${\theta_0 + {\theta_1}*x + {\theta_2*y}} = 0$
coef = -(res.x / res.x[2]) # find the equation

print(res.x) # this is final theta
[-25.16557602   0.20626565   0.20150593]

coef = -(res.x / res.x[2])  # find the equation，
print(coef)

# 根据theta，绘制一组数据，表达拟合的方程
x = np.arange(130, step=0.1)
y = coef[0] + coef[1]*x
[124.8875223   -1.02362075  -1.        ]

data.describe()  # find the range of x and y

sns.set(context="notebook", style="ticks", font_scale=1.5)

sns.lmplot(x='exam1', y='exam2', hue='admitted', data=data,
           height=6,
           fit_reg=False, 
           scatter_kws={"s": 25}
          )

plt.plot(x, y, 'grey')
plt.xlim(0, 130)
plt.ylim(0, 130)
plt.title('Decision Boundary')
plt.show()

决策边界

三、非线性逻辑回归

基本的逻辑同上，相同点不做赘述。

加载样本数据

df = pd.read_csv('ex2data2.txt', names=['test1', 'test2', 'accepted'])
df.head()

sns.set(context="notebook", style="ticks", font_scale=1.5)
sns.lmplot('test1', 'test2', hue='accepted', data=df, 
           size=6, 
           fit_reg=False, 
           scatter_kws={"s": 50}
          )
plt.title('Regularized Logistic Regression')
plt.show()

观察到题二的数据是非线性的，需要通过更复杂的多项式来拟合

feature mapping（特征映射）

polynomial expansion

多项式拓展

实现代码逻辑：

for i in 0..i
  for p in 0..i:
    output x^(i-p) * y^p

注意，notebook不支持local image

把原先的两列数据拓展成n列

def feature_mapping(x, y, power, as_ndarray=False):
#     """return mapped features as ndarray or dataframe"""
    # data = {}
    # # inclusive
    # for i in np.arange(power + 1):
    #     for p in np.arange(i + 1):
    #         data["f{}{}".format(i - p, p)] = np.power(x, i - p) * np.power(y, p)

    # {}表示字典，即map数据集合，后面两个for用到了脚本的方式进行条件操作
    data = {"f{}{}".format(i - p, p): np.power(x, i - p) * np.power(y, p)
                for i in np.arange(power + 1)
                for p in np.arange(i + 1)
            }

    if as_ndarray:
        return pd.DataFrame(data).values
    else:
        return pd.DataFrame(data)

x1 = np.array(df.test1)
x2 = np.array(df.test2)

data = feature_mapping(x1, x2, power=6)
print(data.shape)
data.head()

拓展后的数据集：

regularized cost（正则化代价函数）

$J\left( \theta \right)=\frac{1}{m}\sum\limits_{i=1}^{m}{[-{{y}^{(i)}}\log \left( {{h}_{\theta }}\left( {{x}^{(i)}} \right) \right)-\left( 1-{{y}^{(i)}} \right)\log \left( 1-{{h}_{\theta }}\left( {{x}^{(i)}} \right) \right)]}+\frac{\lambda }{2m}\sum\limits_{j=1}^{n}{\theta _{j}^{2}}$

theta = np.zeros(data.shape[1])
X = feature_mapping(x1, x2, power=6, as_ndarray=True)
print(X.shape)

y = get_y(df)
print(y.shape)

(118, 28)
(118,)

def regularized_cost(theta, X, y, l=1):
#     '''you don't penalize theta_0'''
# theta_0表示截距，不做“惩罚”
    theta_j1_to_n = theta[1:]
    regularized_term = (l / (2 * len(X))) * np.power(theta_j1_to_n, 2).sum()

    return cost(theta, X, y) + regularized_term
#正则化代价函数

regularized_cost(theta, X, y, l=1)
0.6931471805599454
# this is the same as the not regularized cost because we init theta as zeros...
# 因为我们设置theta为0，所以这个正则化代价函数与代价函数的值相同

regularized gradient(正则化梯度)

$\frac{\partial J\left( \theta \right)}{\partial {{\theta }_{j}}}=\left( \frac{1}{m}\sum\limits_{i=1}^{m}{\left( {{h}_{\theta }}\left( {{x}^{\left( i \right)}} \right)-{{y}^{\left( i \right)}} \right)} \right)+\frac{\lambda }{m}{{\theta }_{j}}\text{ }\text{ for j}\ge \text{1}$

def regularized_gradient(theta, X, y, l=1):
#     '''still, leave theta_0 alone'''
    theta_j1_to_n = theta[1:]
    regularized_theta = (l / len(X)) * theta_j1_to_n

    # by doing this, no offset is on theta_0
    # theta_0填充为0，即正则化不影响theta_0
    regularized_term = np.concatenate([np.array([0]), regularized_theta])

    return gradient(theta, X, y) + regularized_term

regularized_gradient(theta, X, y)

拟合参数

import scipy.optimize as opt

print('init cost = {}'.format(regularized_cost(theta, X, y)))
res = opt.minimize(fun=regularized_cost, x0=theta, args=(X, y), method='Newton-CG', jac=regularized_gradient)
res
init cost = 0.6931471805599454

     fun: 0.5290027297128722
     jac: array([ 4.64317436e-08,  1.04331373e-08, -3.61802419e-08, -3.44397841e-08,
        2.46408233e-08,  1.12246256e-08, -5.13021764e-09, -1.11582358e-08,
        1.23402171e-09,  2.62428040e-08, -3.94916642e-08,  5.21264511e-09,
       -8.98645551e-09,  1.04022216e-08,  3.24793498e-08, -5.20025250e-09,
       -5.87238749e-09,  7.46797548e-10, -4.52162093e-09, -1.47871366e-09,
        1.80405423e-08, -1.62958045e-08,  7.26901438e-10, -5.53694209e-09,
        3.57089897e-09, -3.35135784e-09,  5.51302048e-09,  2.59989451e-08])
 message: 'Optimization terminated successfully.'
    nfev: 7
    nhev: 0
     nit: 6
    njev: 55
  status: 0
 success: True
       x: array([ 1.27274054,  0.62527229,  1.18108684, -2.01996217, -0.91742229,
       -1.43166588,  0.1240061 , -0.36553467, -0.35724013, -0.1751284 ,
       -1.45815894, -0.050989  , -0.61555564, -0.27470555, -1.192815  ,
       -0.24218818, -0.20600633, -0.04473079, -0.27778484, -0.29537856,
       -0.45635706, -1.04320283,  0.02777156, -0.29243164,  0.01556705,
       -0.3273799 , -0.14388646, -0.92465161])

预测

final_theta = res.x
y_pred = predict(X, final_theta)

print(classification_report(y, y_pred))
              precision    recall  f1-score   support

           0       0.90      0.75      0.82        60
           1       0.78      0.91      0.84        58

    accuracy                           0.83       118
   macro avg       0.84      0.83      0.83       118
weighted avg       0.84      0.83      0.83       118

使用不同的 $\lambda$ （即正则化的权重，这个是常数），画出决策边界

我们找到所有满足 $X\times \theta = 0$ 的x

instead of solving polynomial equation, just create a coridate x,y grid that is dense enough, and find all those $X\times \theta$ that is close enough to 0, then plot them

def draw_boundary(power, l):
#     """
#     power: polynomial power for mapped feature
#     l: lambda constant
#     """
    density = 1000
    threshhold = 2 * 10**-3

    final_theta = feature_mapped_logistic_regression(power, l)
    x, y = find_decision_boundary(density, power, final_theta, threshhold)

    df = pd.read_csv('ex2data2.txt', names=['test1', 'test2', 'accepted'])
    sns.lmplot(x='test1', y='test2', hue='accepted', data=df, height=6, fit_reg=False, scatter_kws={"s": 100})

    plt.scatter(x, y, c='red', s=10)
    plt.title('Decision boundary')
    plt.show()

def feature_mapped_logistic_regression(power, l):
#     """for drawing purpose only.. not a well generealize logistic regression
#     power: int
#         raise x1, x2 to polynomial power
#     l: int
#         lambda constant for regularization term
#     """
    df = pd.read_csv('ex2data2.txt', names=['test1', 'test2', 'accepted'])
    x1 = np.array(df.test1)
    x2 = np.array(df.test2)
    y = get_y(df)

    X = feature_mapping(x1, x2, power, as_ndarray=True)
    theta = np.zeros(X.shape[1])

    res = opt.minimize(fun=regularized_cost,
                       x0=theta,
                       args=(X, y, l),
                       method='TNC',
                       jac=regularized_gradient)
    final_theta = res.x

    return final_theta


def find_decision_boundary(density, power, theta, threshhold):
    t1 = np.linspace(-1, 1.5, density)
    t2 = np.linspace(-1, 1.5, density)

    cordinates = [(x, y) for x in t1 for y in t2]
    # zip(a, b)：打包；zip(*c)：解压
    x_cord, y_cord = zip(*cordinates)
    mapped_cord = feature_mapping(x_cord, y_cord, power)  # this is a dataframe

    inner_product = mapped_cord.values @ theta

    decision = mapped_cord[np.abs(inner_product) < threshhold]

    return decision.f10, decision.f01
#寻找决策边界函数

多尝试几个 ${\lambda}$ 值

draw_boundary(power=6, l=1) ${\lambda = 1、0、100、10}$

${\lambda = 1}$ 略欠拟合	${\lambda = 0}$ 过拟合	${\lambda = 100}$ 欠拟合	${\lambda = 10}$ 比较适中
略欠拟合	过拟合	欠拟合	比较适中

注意 notebook 不支持加载本地image资源

Local images cannot be inserted in jupyter notebook

参考：

https://github.com/fengdu78/Coursera-ML-AndrewNg-Notes
matplotlib.pyplot style美化
 matplotlib 美化大全
 Seaborn 0.9 中文文档
 逻辑回归LogisticRegression参数解析之penalty & C
Coursera-ML-AndrewNg 作业

机器学习作业2-逻辑回归
一、算法要求学生有两门考试成绩，预测学生的入学结果，即两个参数的拟合情况题一线性拟合，预测学生录取情况题二非...
机器学习之Logistic回归(逻辑蒂斯回归）
机器学习之Logistic回归(逻辑蒂斯回归）
逻辑回归
点击链接：逻辑回归 NG机器学习公开课笔记：机器学习笔记
机器学习算法的优缺点
机器学习算法的优缺点机器学习算法的优缺点线性回归 Linear Regression 逻辑回归 Logisti...
逻辑回归模型
1.逻辑回归介绍2.机器学习中的逻辑回归3.逻辑回归面试总结4.逻辑回归算法原理推导5.逻辑回归(logistic...
逻辑回归为什么用交叉熵损失函数
1 Logistic Regression（逻辑回归）逻辑回归是机器学习中的一个非常常见的模型，逻辑回归模型其...
机器学习预测乳腺癌良恶性（1）（逻辑回归）
逻辑回归算法思想：逻辑回归（LogisticRegression）是当前机器学习算法比较常用的方法，可以用来回归分...
回归算法学习
回归算法是机器学习中的经典算法之一，本文是对学习线性回归和逻辑回归算法进行的总结，线性回归与逻辑回归算法解决的分别...
Python从零开始第六章机器学习②逻辑回归实战（2）
目录 Python从零开始第六章机器学习①逻辑回归实战（1） Python从零开始第六章机器学习②逻辑回归实战（2...
机器学习实践系列3——二项逻辑回归
本文主要介绍逻辑回归的基本概念，并结合实际案例说明逻辑回归的应用。从线性回归到逻辑回归在【机器学习系列2——线...

机器学习作业2-逻辑回归

一、算法要求

二、线性逻辑回归

准备数据

定义必要的函数，获取X 和 y

逻辑回归基于sigmoid函数实现

定义cost functiion(代价函数)

定义gradient descent(梯度下降)

拟合参数

用训练集预测和验证

寻找决策边界

三、非线性逻辑回归

加载样本数据

feature mapping（特征映射）

把原先的两列数据拓展成n列

regularized cost（正则化代价函数）

regularized gradient(正则化梯度)

拟合参数

预测

使用不同的 $\lambda$ （即正则化的权重，这个是常数），画出决策边界

多尝试几个 ${\lambda}$ 值

注意 notebook 不支持加载本地image资源

参考：

相关文章