线性模型

作者: 单调不减 | 来源:发表于2019-07-26 00:27 被阅读0次

这一篇的内容旨在对之前学习过的四种不同线性分类模型（Logistic 回归、softmax回归、感知器和支持向量机）做一个整理。这些模型的区别主要在于使用了不同的损失函数。

线性模型（Linear Model）是机器学习中应用最广泛的模型，指通过样本特征的线性组合来进行预测的模型。给定一个 $d$ 维样本 $[x_1,\dots, x_d]^T$ ，其线性组合函数为：

$\begin{aligned} f(\mathrm{x} ; \mathrm{w}) &=w_{1} x_{1}+w_{2} x_{2}+\cdots+w_{d} x_{d}+b \\ &=\mathrm{w}^{\mathrm{T}} \mathrm{x}+b \end{aligned}$

其中 $w=\left[w_{1}, \cdots, w_{d}\right]^{\mathrm{T}}$ 为 $d$ 维的权重向量， $b$ 为偏置。线性回归就是典型的线性模型，直接用 $f(x;w)$ 来预测输出目标 $y = f(x;w)$ 。

在分类问题中，由于输出 $y$ 是一些离散的标签， $f(x;w)$ 的值域为实数，因此无法直接用 $f(x;w)$ 来进行预测，需要引入一个非线性的决策函数（Decision Function） $g(.)$ 来预测输出目标：

$y=g(f(\mathbf{x} ; \mathbf{w}))$

其中 $f(x;w)$ 也称为判别函数（Discriminant Function）。

也就是说，一个线性分类模型（Linear Classification Model）或线性分类器（Linear Classifier），是由一个（或多个）线性的判别函数 $f(x;w) =w^Tx + b$ 和非线性的决策函数 $g(·)$ 组成。

1、二分类与多分类

二分类（Binary Classification）的类标签 $y$ 只有两种取值，通常可设为 $\{ +1,−1\}$ 。

在二分类中我们只需要一个线性判别函数 $f(\mathrm{x} ; \mathrm{w})=\mathrm{w}^{\mathrm{T}} \mathrm{x}+b$ 。所有满足 $f(x;w) = 0$ 的点组成一个分割超平面（Hyperplane），称为决策边界（Decision Boundary）或决策平面（Decision Surface）。决策边界将特征空间一分为二，划分成两个区域，每个区域对应一个类别。

线性模型试图学习到参数 $w^∗$ ，使得对于每个样本 $(x^{(n)},y^{(n)})$ 尽量满足：

$\begin{array}{ll}{f\left(\mathbf{x}^{(n)} ; \mathbf{w}^{*}\right)>0} & {\text { if } \quad y^{(n)}=1} \\ {f\left(\mathbf{x}^{(n)} ; \mathbf{w}^{*}\right)<0} & {\text { if } \quad y^{(n)}=-1}\end{array}$

上面两个公式也可以合并，即参数 $w^∗$ 尽量满足：

$y^{(n)} f\left(\mathrm{x}^{(n)} ; \mathrm{w}^{*}\right)>0, \quad \forall n \in[1, N]$

接下来我们会看到，对线性模型来说， $y^{(n)} f\left(\mathrm{x}^{(n)} ; \mathrm{w}^{*}\right)$ 这一表达式是十分重要的。

于是我们可以定义：对于训练集 $\mathcal{D}=\left\{\left(\mathbf{x}^{(n)}, y^{(n)}\right)\right\}_{n=1}^{N}$ ，若存在权重向量 $w^∗$ ，对所有样本都满足 $yf(x;w^∗) > 0$ ，那么训练集 $D$ 是线性可分的。

接下来我们考虑多分类的情形，假设一个多类分类问题的类别为 $\{1,2, \cdots, C \}$ ，常用的方式有以下三种：

“一对其余”：把多类分类问题转换为 $C$ 个“一对其余”的两类分类问题。这种方式共需要 $C$ 个判别函数，其中第 $c$ 个判别函数 $f_c$ 是将类 $c$ 的样本和不属于类 $c$ 的样本分开。
“一对一”：把多类分类问题转换为 $C(C − 1)/2$ 个“一对一”的两类分类问题。这种方式共需要 $C(C − 1)/2$ 个判别函数，其中第 $(i, j)$ 个判别函数是把类 $i$ 和类 $j$ 的样本分开。
“argmax”：这是一种改进的“一对其余”方式，共需要 $C$ 个判别函数：

$f_{c}\left(\mathrm{x} ; \mathrm{w}_{c}\right)=\mathrm{w}_{c}^{\mathrm{T}} \mathrm{x}+b_{c}, \quad c=[1, \cdots, C]$

若存在类别 $c$ ，对所有的其他类别 $\tilde{c}(\tilde{c} \neq c)$ ，都满足 $f_{c}\left(\mathbf{x} ; \mathbf{w}_{c}\right)>f_{\tilde{c}}\left(\mathbf{x}, \mathbf{w}_{\tilde{c}}\right)$ ，那么 $x$ 属于类别 $c$ 。即：

$y=\underset{\ \ \ \ \ \ \ c}{\arg \max } f_{c}\left(\mathbf{x} ; \mathbf{w}_{c}\right)$

“一对其余”和“一对一”都存在一个缺陷：特征空间中会存在一些难以确定类别的区域，而“argmax”方式很好地解决了这个问题。下图给出了用这三种方式进行三类分类的示例，其中不同颜色的区域表示预测的类别，红色直线表示判别函数 $f(·) = 0$ 的直线。在“argmax”方式中，类 $i$ 和类 $j$ 的决策边界实际上由 $f_{i}\left(\mathbf{x}, \mathbf{w}_{i}\right)-f_{j}\left(\mathbf{x}, \mathbf{w}_{j}\right)=0$ 决定，其法向量为 $w_i − w_j$ 。

2、Logistic回归

在本节中我们采用 $y ∈ \{0, 1 \}$ 以符合Logistic 回归的描述习惯。

为解决连续线性函数不适合进行分类的问题，我们引入非线性函数 $g:\mathbb{R}^{d} \rightarrow (0,1)$ 来预测类别标签的后验概率 $p(y = 1|x)$ ：

$p(y=1 | \mathbf{x})=g(f(\mathbf{x} ; \mathbf{w}))$

其中 $g(·)$ 通常称为激活函数（Activation Function），其作用是把线性函数的值域从实数区间“挤压”到了 $(0, 1)$ 之间，可以用来表示概率。

在Logistic回归中，我们使用Logistic函数来作为激活函数。标签 $y = 1$ 的后验概率为：

$\begin{aligned} p(y=1 | \mathrm{x}) &=\sigma\left(\mathbf{w}^{\mathrm{T}} \mathbf{x}\right) \\ & \triangleq \frac{1}{1+\exp \left(-\mathbf{w}^{\mathrm{T}} \mathbf{x}\right)} \end{aligned}$

标签 $y = 0$ 的后验概率为：

$\begin{aligned} p(y=0 | \mathbf{x}) &=1-p(y=1 | \mathbf{x}) \\ &=\frac{\exp \left(-\mathbf{w}^{\mathrm{T}} \mathbf{x}\right)}{1+\exp \left(-\mathbf{w}^{\mathrm{T}} \mathbf{x}\right)} \end{aligned}$

进行变换后得到：

$\begin{aligned} \mathbf{w}^{\mathrm{T}} \mathbf{x} &=\log \frac{p(y=1 | \mathbf{x})}{1-p(y=1 | \mathbf{x})} \\ &=\log \frac{p(y=1 | \mathbf{x})}{p(y=0 | \mathbf{x})} \end{aligned}$

其中 $\frac{p(y=1 | \mathbf{x})}{p(y=0 | \mathbf{x})}$ 为样本 $x$ 为正反例后验概率的比值，称为几率（Odds），几率的对数称为对数几率。因此，Logistic 回归也称为对数几率回归（Logit Regression）。

Logistic回归采用交叉熵作为损失函数，并使用梯度下降法来对参数进行优化。其风险函数为：

$\mathcal{R}(\mathbf{w})=-\frac{1}{N} \sum_{n=1}^{N}\left(p_{r}\left(y^{(n)}=1 | \mathbf{x}^{(n)}\right) \log \hat{y}^{(n)}+p_{r}\left(y^{(n)}=0 | \mathbf{x}^{(n)}\right) \log \left(1-\hat{y}^{(n)}\right)\right)$

风险函数 $R(w)$ 是关于参数 $w$ 的连续可导的凸函数。因此Logistic回归除了梯度下降法之外还可以用高阶的优化方法，比如牛顿法，来进行优化。

3、Softmax回归

Softmax回归，也称为多项（multinomial）或多类（multi-class）的Logistic回归，是Logistic回归在多类分类问题上的推广。

对于多类问题，类别标签 $y \in\{1,2, \cdots, C\}$ 可以有 $C$ 个取值。给定一个样本x，Softmax回归预测的属于类别 $c$ 的条件概率为：

$\begin{aligned} p(y=c | \mathbf{x}) &=\operatorname{softmax}\left(\mathbf{w}_{c}^{\mathrm{T}} \mathbf{x}\right) \\ &=\frac{\exp \left(\mathbf{w}_{c}^{\mathrm{T}} \mathbf{x}\right)}{\sum_{c^{\prime}=1}^{C} \exp \left(\mathbf{w}_{c^{\prime}}^{\mathrm{T}} \mathbf{x}\right)}\\ \end{aligned}$

其中 $w_c$ 是第 $c$ 类的权重向量。

Softmax回归的决策函数可以表示为：

$\begin{aligned} \hat{y} &=\underset{c=1}{\arg \max } p(y=c | \mathbf{x}) \\ &=\underset{c=1}{\arg \max } \mathbf{w}_{c}^{\mathrm{T}} \mathbf{x}\\ \end{aligned}$

给定 $N$ 个训练样本 $\left\{\left(\mathbf{x}^{(n)}, y^{(n)}\right)\right\}_{n=1}^{N}$ ，Softmax回归使用交叉熵损失函数来学习最优的参数矩阵 $W$ 。

采用交叉熵损失函数，Softmax回归模型的风险函数为：

$\begin{aligned} \mathcal{R}(W) &=-\frac{1}{N} \sum_{n=1}^{N} \sum_{c=1}^{C} \mathbf{y}_{c}^{(n)} \log \hat{\mathbf{y}}_{c}^{(n)} \\ &=-\frac{1}{N} \sum_{n=1}^{N}\left(\mathbf{y}^{(n)}\right)^{\mathrm{T}} \log \hat{\mathbf{y}}^{(n)} \end{aligned}$

风险函数 $R(W)$ 关于 $W$ 的梯度为：

$\frac{\partial \mathcal{R}(W)}{\partial W}=-\frac{1}{N} \sum_{n=1}^{N} \mathrm{x}^{(n)}\left(\mathrm{y}^{(n)}-\hat{\mathrm{y}}^{(n)}\right)^{\mathrm{T}}$

采用梯度下降法，Softmax回归的训练过程为：

$W_{t+1} \leftarrow W_{t}+\alpha\left(\frac{1}{N} \sum_{n=1}^{N} \mathrm{x}^{(n)}\left(\mathrm{y}^{(n)}-\hat{\mathrm{y}}_{W_{t}}^{(n)}\right)^{\mathrm{T}}\right)$

4、感知器

感知器可谓是最简单的人工神经网络，只有一个神经元。

感知器是一种简单的两类线性分类模型，其分类准则为：

$\hat{y}=\operatorname{sgn}\left(\mathbf{w}^{\mathrm{T}} \mathbf{x}\right)$

给定N 个样本的训练集： $\left\{\left(\mathrm{x}^{(n)}, y^{(n)}\right)\right\}_{n=1}^{N}$ ，其中 $y^{(n)} \in\{+1,-1 \}$ ，感知器学习算法试图找到一组参数 $w^∗$ ，使得对于每个样本 $\left(\mathbf{x}^{(n)}, y^{(n)}\right)$ 有

$y^{(n)} \mathbf{w}^{* \mathrm{T}} \mathbf{x}^{(n)}>0, \quad \forall n \in[1, N]$

感知器的学习算法是一种错误驱动的在线学习算法，先初始化一个权重向量 $w ← 0$ （通常是全零向量），然后每次分错一个样本 $(x, y)$ 时，即 $y \mathbf{w}^{\mathrm{T}} \mathbf{x}<0$ ，就用这个样本来更新权重：

$\mathbf{w} \leftarrow \mathbf{w}+y \mathbf{x}$

根据感知器的学习策略，可以反推出感知器的损失函数为：

$\mathcal{L}(\mathbf{w} ; \mathbf{x}, y)=\max \left(0,-y \mathbf{w}^{\mathrm{T}} \mathbf{x}\right)$

采用随机梯度下降，其每次更新的梯度为：

$\frac{\partial \mathcal{L}(\mathbf{w} ; \mathbf{x}, y)}{\partial \mathbf{w}}=\left\{\begin{array}{ll}{0} & {\text { if }\quad y \mathbf{w}^{\mathrm{T}} \mathbf{x}>0} \\ {-y \mathbf{x}} & {\text { if } \quad y \mathbf{w}^{\mathrm{T}} \mathbf{x}<0}\end{array}\right.$

感知器收敛性不再证明，参考之前的笔记https://www.jianshu.com/p/d83aa6c8068f。

感知器的学习到的权重向量和训练样本的顺序相关。在迭代次序上排在后面的错误样本，比前面的错误样本对最终的权重向量影响更大。比如有1000个训练样本，在迭代100个样本后，感知器已经学习到一个很好的权重向量。在接下来的899个样本上都预测正确，也没有更新权重向量。但是在最后第1000 个样本时预测错误，并更新了权重。这次更新可能反而使得权重向量变差。

为了改善这种情况，可以使用“参数平均”的策略来提高感知器的鲁棒性，也叫投票感知器（投票感知器是一种集成模型）。

投票感知器记录第 $k$ 次更新后得到的权重 $w_k$ 在之后的训练过程中正确分类样本的次数 $c_k$ 。这样最后的分类器形式为：

$\hat{y}=\operatorname{sgn}\left(\sum_{k=1}^{K} c_{k} \operatorname{sgn}\left(\mathbf{w}_{k}^{\mathrm{T}} \mathbf{x}\right)\right)$

投票感知器虽然提高了感知器的泛化能力，但是需要保存 $K$ 个权重向量。在实际操作中会带来额外的开销。因此，人们经常会使用一个简化的版本，也叫做平均感知器：

$\begin{aligned} \hat{y} &=\operatorname{sgn}\left(\sum_{k=1}^{K} c_{k}\left(\mathbf{w}_{k}^{\mathrm{T}} \mathbf{x}\right)\right) \\ &=\operatorname{sgn}\left(\left(\sum_{k=1}^{K} c_{k} \mathbf{w}_{k}\right)^{\mathrm{T}} \mathbf{x}\right) \\ &=\operatorname{sgn}\left(\overline{\mathbf{w}}^{\mathrm{T}} \mathbf{x}\right) \end{aligned}$

其中 $\overline{w}$ 为平均的权重向量。

5、支持向量机

给定N 个样本的训练集： $\left\{\left(\mathrm{x}^{(n)}, y^{(n)}\right)\right\}_{n=1}^{N}$ ，其中 $y^{(n)} \in\{+1,-1 \}$ ，如果两类样本是线性可分的，即存在一个超平面：

$\mathbf{w}^{\mathrm{T}} \mathbf{x}+b=0$

将两类样本分开，那么对于每个样本都有 $y^{(n)}\left(\mathbf{w}^{\mathrm{T}} \mathbf{x}^{(n)}+b\right)>0$ 。

数据集 $D$ 中每个样本 $x^{(n)}$ 到分割超平面的距离为：

$\gamma^{(n)}=\frac{\left\|\mathbf{w}^{\mathrm{T}} \mathbf{x}^{(n)}+b\right\|}{\|\mathbf{w}\|}=\frac{y^{(n)}\left(\mathbf{w}^{\mathrm{T}} \mathbf{x}^{(n)}+b\right)}{\|\mathbf{w}\|}$

定义整个数据集 $D$ 中所有样本到分割超平面的最短距离为间隔（Margin） $γ$ ：

$\gamma=\min _{n} \gamma^{(n)}$

如果间隔 $γ$ 越大，其分割超平面对两个数据集的划分越稳定，不容易受噪声等因素影响。支持向量机的目标是寻找一个超平面 $(w^∗, b^∗)$ 使得 $γ$ 最大，即：

$\begin{align}&\max _{\mathbf{w}, b} \quad \gamma \\ &s.t.\quad \frac{y^{(n)}\left(\mathbf{w}^{\mathrm{T}} \mathbf{x}^{(n)}+b\right)}{\|\mathbf{w}\|} \geq \gamma, \forall n \end{align}$

令 $∥w∥ · γ = 1$ ，则上式等价于：

$\begin{array} {cl}{\max _{\mathbf{w}, b}} & {\frac{1}{\|\mathbf{w}\|^{2}}} \\ {\text { s.t. }} & {y^{(n)}\left(\mathbf{w}^{\mathrm{T}} \mathbf{x}^{(n)}+b\right) \geq 1, \forall n} \end{array}$

数据集中所有满足 $y^{(n)}\left(\mathbf{w}^{\mathrm{T}} \mathbf{x}^{(n)}+b\right)=1$ 的点，都称为支持向量。

接下来的推导不详细叙述了，之前的笔记里已经记录过。

软间隔的优化问题形式如下：

$\min _{\mathbf{w}, b} \sum_{n=1}^{N} \max \left(0,1-y^{(n)}\left(\mathbf{w}^{\mathrm{T}} \mathbf{x}^{(n)}+b\right)\right)+\frac{1}{C} \cdot \frac{1}{2}\|\mathbf{w}\|^{2}$