美文网首页机器学习
一、最细粒度 推导支持向量机SVM

一、最细粒度 推导支持向量机SVM

作者: 炼丹师_风酒 | 来源:发表于2019-04-23 09:39 被阅读0次

声明:原创文章,转载请注明或保留出处【https://www.jianshu.com/p/dbc86a2b9760】by【飞奔的野指针】

一、优缺点

  1. 优点:泛化错误率低,计算开销不大,结果易解释。
  2. 缺点:对参数调节和核函数的选择敏感,原始分类器不加修改仅适用于处理二类问题。
  3. 适用数据类型:数值型和标称型数据。

二、简单解释

如果数据线性可分,则分,如果数据线性不可分,则升维(用核函数)后用面切分。

三、SVM

1.寻找超平面

1554348724539.png

在平面上直线y=ax+b \implies ax-y+b=0表示。

在三维中平面z=ax_1+bx_2+c\implies ax_1+bx_2-z+c=0表示。

多维中用法向量w=\begin{bmatrix}a_1\\a_2\\...\\a_n\end{bmatrix}​来表示参数,决定超平面的方向,x=\begin{bmatrix}x_1\\x_2\\... \\x_n\end{bmatrix}​表示每一个维度。w^Tx = [a_1\;a_2...a_n]\begin{bmatrix}x_1\\x_2\\... \\x_n\end{bmatrix}=a_1x_1+a_2x_2+...+a_nx_n​,超平面表示为:w^Tx+b=0​,参数b​决定了到原点的距离。

超平面把数据分成两类,上方数据点为w^Tx+b>0,下方数据点w^Tx+b<0​

如超平面是y=-2x+2(-2x-y+2=0),两个点P_1(1,1)、P_2(0,1)分别位于直线上下。

\begin{cases}(1,1)\implies-2\times1-1+2=-1<0\\(0,1)\implies-2\times0-1+2=1>0\end{cases},位于直线上两侧的点异号。SVM中正样本标签为1,负样本标签为-1。

目标函数就是最优超平面,用来分类,我们要求的就是w^Tb
\color{red}{w^Tx+b=0}

2.寻找最大间隔

2.1 最大间隔

取两个中间没有数据点的超平面H_1:w^Tx+b=\delta \quad H_0:w^Tx+b=-\delta(一般\delta1)此时目标函数的超平面w^Tx+b=0 正在这两个超平面的正中间

于是寻找超平面转化为寻找这两个超平面,或者说两个超平面的最大间隔,间隔m​是求解w^T​b​的关键。
这两个超平面的限制条件是: \begin{cases} w^Tx+b\geq\delta & (y=1) \\[2ex] w^Tx+b\leq-\delta & (y=-1) \\ \end{cases} \implies y_i(w^Tx+b)\geqslant 1 \tag{2.1}

2.2 \delta=1的原因

\begin{align} &平行线\begin{cases} \text{直线1:}wx + b_1 =0\\[2ex] \text{直线2:}wx + b_2 =0 \end{cases} \implies \text{中间线:}wx+b_1+\frac{b_2-b_1}{2}=0 \\[2ex] & t = b_2-b_1 \text{中间线:}wx+b_1+\frac{t}{2}=0 \\[2ex] & b_2 = t+b_1 \text{直线2:}wx+t+b_1=0 \\[2ex] &\begin{cases} \text{直线1:}wx + b_1 =0\\[2ex] \text{直线2:}wx+b_1+t=0\\[2ex] \text{中间线::}wx+b_1+\frac{t}{2}=0 \end{cases} \implies\begin{cases} \text{直线1:}wx + b_1 +\frac t2=\frac t2\\[2ex] \text{直线2:}wx + b_1 +\frac t2=-\frac t2\\[2ex] \text{中间线::}wx+b_1+\frac{t}{2}=0 \end{cases}\\[2ex] &\left.\begin{cases} \text{直线1:}wx + b_1 +\frac t2=\frac t2\\[2ex] \text{直线2:}wx + b_1 +\frac t2=-\frac t2\\[2ex] \text{中间线::}wx+b_1+\frac{t}{2}=0 \end{cases} \right\}\times \frac2t= \begin{cases} \text{直线1:}\frac2twx + \frac2tb_1 +1=1\\[2ex] \text{直线2:}\frac2twx + \frac2tb_1 +1=-1\\[2ex] \text{中间线::}\frac2twx+\frac2tb_1+1=0 \end{cases}\\[2ex] &令\frac2tw = W \quad \frac2tb+1=B\\[2ex] &\begin{cases} \text{直线1:}Wx + B =1\\[2ex] \text{直线2:}Wx + B=-1\\[2ex] \text{中间线::}Wx+B=0 \end{cases}\\[2ex] &\therefore \text{平行线和中间线都可以写成上述方式} \end{align}

2.3 投影

1554274183601.png

\begin{align} & \vec{u} = \langle u_1,u_2\rangle=\begin{bmatrix}u_1\\u_2\end{bmatrix} \quad \vec{v} = \langle v_1,v_2\rangle=\begin{bmatrix}v_1\\v_2\end{bmatrix} \quad ||u||= |u| = \sqrt{u_1^2+u_2^2} \quad\text{(向量范数)} \\[2ex] & \vec{u}\cdot\vec{v}=|u||v| \cos{\theta}\\[2ex] & 投影proj_uv= (v\cos\theta)\frac{\vec{u}}{|u|} = \left(\frac{\vec{u}\cdot\vec{v}}{|u|}\right)\frac{\vec{u}}{|u|} = \frac{(\vec{u}\cdot\vec{v})\vec{u}}{|u|^2}\\[2ex] & p=|proj_uv|=\frac{\vec{u}\cdot\vec{v}}{|u|}\\[2ex] \end{align}
详细可看向量方面的内容,不过多阐述。

2.4最大分隔上的点到超平面距离

1554346986756.png 1554354032804.png

\begin{align} \text{由第1个图}&\text{可以看出 }w^T\text{是超平面的法向量,垂直于超平面}\\[2ex] H_0\text{是最大}&\text{分隔,}H_1\text{是超平面,}x_0\text{是最大分隔上一点,}x_1\text{是}x_0\text{在超平面的投影} \vec{x_1x_0}\bot H_1\implies x_0x_1||w^T\\[2ex] |w^T\cdot\vec{x_0x_1}|&=|w^T||\vec{x_0x_1}|=\sqrt{a_1^2+a_2^2+...+a_n^2}\cdot m=||w^T||m \\[2ex] w^T\cdot\vec{x_0x_1}&=a_1(x_0−x_1)+a_2(x_0^2−x_1^2)+...+a_n(x_0^n-x_1^n) \\ & =a_1x_0+a_2x_0+...+a_nx_0-(a_1x_1+a_2x_2+...+a_nx_n)\\ & =a_1x_0+a_2x_0+...+a_nx_0−(−b)\\ & =w^T\cdot x_0+b \\[2ex] |w^T\cdot\vec{x_0x_1}|&=|w^T\cdot x_0+b|=||w^T||m\\[4ex] m&=\frac{|w^T\cdot x_0+b|}{||w^T||} \end{align}

2.5 最大分隔的间距

\begin{align} &\text{上方最大分隔到超平面距离是:}m_1=\frac{|w^T\cdot x_1+b|}{||w^T||} \\[2ex] \because & \text{在上方最大分隔} \therefore w^T x_1+b =1 \\[2ex] \therefore & m_1=\frac{1}{||w^T||} \\[2ex] & \text{同理,下方最大分隔到超平面距离是:}m_2=\frac{-1}{||w^T||}\\[2ex] \therefore & m=m_1-m_2 = \frac{2}{||w^T||} \\[2ex] \because & ||w^T|| = ||w|| \quad\text{x,y交换长度不变}\\[2ex] \therefore & m= \frac{2}{||w||} \\[2ex] \end{align}

3.SVM基本型

\begin{align} & \max(m)\implies \max_{w,b}(\frac{2}{||w||})\\[2ex] & \max_{w,b}\frac{2}{||w||} \quad s.t. \;y_i(w^Tx_i+b)\geqslant 1 \quad(i=1,...,m) \\[2ex] & \color{red}{\min_{w,b}\frac12||w||^2 \quad s.t. \;y_i(w^Tx_i+b)\geqslant1 \quad(i=1,...,m)} \quad \quad 这是SVM的基本型 \tag{3.1} \end{align}
也就是说要求解最大的间距,就得找出\min_{w,b}\frac12||w||^2​的解,而求解的条件包含在限制条件y_i(w^Tx_i+b)\geqslant1 ​中。

4.软间隔和正则化

在求解\alpha ​之前,修改下优化函数,使其允许小部分样本分错,增加其鲁棒性。

4.1代价函数

代价函数也叫损失函数,用来展示预测结果与实际结果的差异。
\begin{align} & 使z=y_{i}(\boldsymbol{w}^T\boldsymbol{x}_{i}+b)-1\\[2ex] & y_{i}(\boldsymbol{w}^T\boldsymbol{x}_{i}+b)\geqslant1 \implies z\geqslant0 \quad预测正确\ell_{0/1}(z) =1 \\ & y_{i}(\boldsymbol{w}^T\boldsymbol{x}_{i}+b)<1 \implies z<0 \quad预测错误\ell_{0/1}(z)=0 \\[2ex] & \ell_{0/1}(z)= \begin{cases} 0 &z\geqslant 0 & \text{预测正确,代价为0}\\[2ex] 1 &z<0 & \text{预测错误,代价为1} \end{cases} \end{align}

1555823822569.png

但是\ell_{0/1}的连续性不好,我们使用如下连续性更好的函数,它们都有相似性质,我们关注z=1​附近的性质。

1555494315721.png

\begin{align} & \ell_{hinge}(z)=\max(0,1-z) & \text{hinge损失}\\[2ex] & \ell_{exp}(z)=e^{-z} & \text{指数损失}\\[2ex] & \ell_{log}(z)=\log(1+e^{-z}) &\text{对率损失}\\[2ex] \end{align}

4.2软间隔

允许某些样本不满足(3.1)的约束y_i(w^Tx+b)\geqslant 1,引入松弛变量\xi_i\xi_i就是上一节的代价函数,用来表示不满足条件的程度,按照性质\xi\geqslant0
\begin{align} & y_i(w^Tx+b)\geqslant 1-\xi_i \\[2ex] &\xi_i = \ell_{hinge} = \max(0,1-y_{i}(\boldsymbol{w}^T\boldsymbol{x}_{i}+b))\\[2ex] & \text{注意区别:} \begin{cases} \ell_{0/1}中z=y_{i}(\boldsymbol{w}^T\boldsymbol{x}_{i}+b)-1 &\text{如果预测正确}z\geqslant 0\\[2ex] \ell_{hinge}中z=y_{i}(\boldsymbol{w}^T\boldsymbol{x}_{i}+b) &\text{如果预测正确}z\geqslant 1\\ \end{cases}\\[2ex] & \text{同样, 预测正确}\xi_i=\ell_{hinge}=0,\text{预测错误}\xi_i=\ell_{hinge}>0 \end{align}
新的SVM基本型为:
\color{red}{\min_{w,b}\frac{1}{2}\|\boldsymbol{w}\|^2+C\sum_{i=1}^{m} \xi_i \quad s.t.\; y_i(w^Tx_i+b)\geqslant 1-\xi_i \quad \xi_i\geqslant 0} \tag{4.1}\\

预测正确的项\xi_i=0​,不会影响整体的值,预测错误才会有”惩罚“,影响最小化。

4.3 惩罚项系数C

新的SVM基本型中的C是惩罚项系数。
\begin{align} & C\to\infty\text{(或一个很大的值)}\implies \text{一旦有错误预测,即}\xi_i>0\implies C\sum_{i=1}^{m} \xi_i\to \infty\\[2ex] & \text{要最小化SVM基本型,则要求所有预测正确,}y_{i}(\boldsymbol{w}^T\boldsymbol{x}_{i}+b)-1\geqslant 0 \text{依旧是强制条件}\\[2ex] & C\to\infty\implies\frac{1}{2}\|\boldsymbol{w}\|^2+C\sum_{i=1}^{m} \xi_i\to\frac{1}{2}\|\boldsymbol{w}\|^2\\[2ex] & C\text{是有限值}\implies \text{才允许一些样本不满足约束} \\ \end{align}

5.最小化SVM基本型

5.1拉格朗日乘子法

拉格朗日乘子法是用来求约束条件下,多元函数的极值

  1. 若函数f(x,y,z)的变量受约束g(x,y,z)=0的限制,函数局部极值满足条件\nabla f= \lambda \nabla g

  2. 若函数f(x,y,z)的变量受约束g(x,y,z)=0h(x,y,z)=0的限制,函数极值局部极值满足条件\nabla f = \lambda \nabla g_1+\mu \nabla g_2

5.2KKT条件

在拉格朗日乘子法的基础上加入不等式条件限制,则用KKT条件求解局部极值。
\begin{align} \begin{cases} \mit{Min}\; f(x) \\[2ex] st. h_i(x)=0 &i=0,1...,q \\[2ex] st. g_j(x)\leqslant 0 &j=0,1...,p \end{cases} \end{align}
拉格朗日乘子法和KKT具体可以从数学中的多元函数部分了解,这里简单介绍和使用。

5.3 约束优化问题

\min f(x) \quad s.t. \begin{cases} g_{i}(x)<0 &i=1,2, \ldots, q \\[2ex] h_{j}(x)=0 &j=1,2, \ldots, m \end{cases}\quad {x \in D}

f(x)为目标函数,g(x)为不等式约束,h(x)为等式约束。

如果三者都是线性函数,则优化问题为线性规划,若任意一个为非线性函数,则为非线性规划。

如果目标函数为二次函数,约束全为线性函数,则为二次规划。

如果f(x)为凸函数,g(x)为凸函数,h(x)为线性函数,则称该问题为凸优化。(如果g(x)\geqslant 0)则为凹函数。

凸优化的任一局部极值点也是全局极值点,局部最优也是全局最优。

5.4最小化过程

SVM基本型(4.1)是凸二次规划问题,求解极值需要使用拉格朗日乘子法+KKT条件得其对偶问题。两个不等式条件为
\begin{cases} y_i(w^Tx_i+b)\geqslant 1-\xi_i \implies y_i(w^Tx_i+b)+\xi_i-1\geqslant 0\\[2ex] \xi_i\geqslant 0 \end{cases}
给每条约束添加乘子a_i\geqslant0 \quad \mu_i\geqslant0,则拉格朗日函数是
\begin{align} \color{red}{\mathcal{L}= \frac{1}{2}\|w\|^{2}+C\sum_{i=1}^{m} \xi_{i} -\sum_{i=1}^{m}\alpha_{i}\left(y_i(w^Tx_i+b)+\xi_i-1\right)-\sum_{i=1}^{m} \mu_{i} \xi_{i}}\tag{5.1} \end{align}
\boldsymbol{w}​b​\xi ​分别求偏导数,它们都为0:
\begin{align} & \frac{\partial L}{\partial w}=\frac{1}{2} \times 2 \times w+0- \sum_{i=1}^{m} \alpha_{i} y_{i} x_{i}-0=0 \implies & w=\sum_{i=1}^{m} \alpha_{i} y_{i} x_{i} \tag{5.2}\\ & \frac{\partial L}{\partial b}=0+0-0-\sum_{i=1}^{m} \alpha_{i} y_{i} =0 \implies &\sum_{i=1}^{m} \alpha_{i} y_{i}=0 \tag{5.3}\\ & \frac{\partial L}{\partial \xi}=0+C-\alpha_i-\mu_i =0 \implies &C =\alpha_{i}+\mu_{i} \tag{5.4} \end{align}
使用上述的条件,继续优化(5.1)
\begin{align} \mathcal{L}&= \frac{1}{2}\|w\|^{2}+C\sum_{i=1}^{m} \xi_{i} -\sum_{i=1}^{m}\alpha_{i}(y_i(w^Tx_i+b)+\xi_i-1)-\sum_{i=1}^{m} \mu_{i} \xi_{i} \\[2ex] \langle代入(5.4)C =\alpha_{i}+\mu_{i}\rangle\quad & = \frac{1}{2}\|w\|^{2}+(\alpha_{i}+\mu_{i})\sum_{i=1}^{m} \xi_{i}-\sum_{i=1}^{m}\alpha_{i}(y_i(w^Tx_i+b)+\xi_i-1)-\sum_{i=1}^{m} \mu_{i} \xi_{i}\\[2ex] & = \frac{1}{2}\|w\|^{2}-\sum_{i=1}^{m}\alpha_{i}(y_i(w^Tx_i+b)+\xi_i-1)-\sum_{i=1}^{m} \alpha_{i} \xi_{i} \\[2ex] & = \frac{1}{2}\|w\|^{2}-\sum_{i=1}^{m}\alpha_{i}(y_i(w^Tx_i+b)-1)+\sum_{i=1}^{m} \alpha_{i} \xi_{i}-\sum_{i=1}^{m} \alpha_{i} \xi_{i} \\[2ex] & = \frac{1}{2}\|w\|^{2}-\sum_{i=1}^{m}\alpha_{i}(y_i(w^Tx_i+b)-1)\\[2ex] & = \frac{1}{2}\|w\|^{2}-\sum_{i=1}^{m}(\alpha_iy_iw^Tx_i+\alpha_iy_ib-\alpha_i)\\[2ex] \langle\|w\| =w^Tw\rangle\quad & = \frac{1}{2} w^T w -w^T\sum_{i=1}^{m} \alpha_{i} y_i x_i-b \sum_{i=1}^{m} \alpha_iy_i+\sum_{i=1}^{m} \alpha_{i} \\[2ex] \langle\text{代入(5.3)}\sum_{i=1}^{m} \alpha_{i} y_{i}=0\rangle \quad & =\frac{1}{2} w^T w -w^T\sum_{i=1}^{m} \alpha_{i} y_i x_i+\sum_{i=1}^{m} \alpha_{i} \\[2ex] \langle\text{代入(5.2)}w=\sum_{i=1}^{m} \alpha_{i} y_{i} x_{i}\rangle \quad & =\frac{1}{2} \left(\sum_{i=1}^{m} \alpha_{i} y_i x_i\right)^T \sum_{i=1}^{m} \alpha_{i} y_i x_i-\left(\sum_{i=1}^{m} \alpha_{i} y_i x_i\right)^T\sum_{i=1}^{m} \alpha_{i} y_i x_i+\sum_{i=1}^{m} \alpha_{i} \\[2ex] & =-\frac{1}{2} \left(\sum_{i=1}^{m} \alpha_{i} y_i x_i\right)^T \sum_{i=1}^{m} \alpha_{i} y_i x_i +\sum_{i=1}^{m} \alpha_{i} \\[2ex] &=-\frac{1}{2} \sum_{i=1}^{m} \alpha_{i} y_{i} x_{i}^{T} \sum_{i=1}^{m} \alpha_{i} y_i x_i+\sum_{i=1}^{m} \alpha_{i} \\[2ex] &=\sum_{i=1}^{m} \alpha_{i} -\frac{1}{2} \sum_{i=1}^{m} \sum_{j=1}^{m}\alpha_{i} y_{i} x_{i}^{T} \alpha_{j} y_j x_j\\[2ex] &=\sum_{i=1}^{m} \alpha_i-\frac{1}{2}\sum_{i=1}^m\sum_{j=1}^m\alpha_i\alpha_j y_iy_j x_i^T x_j\\[2ex] \mathcal{L}&=\sum_{i=1}^{m}\alpha_i-\frac{1}{2}\sum_{i=1}^m\sum_{j=1}^m \alpha_i\alpha_j y_iy_j x_i^T x_j\\[2ex] \end{align}
优化下限制条件
\begin{align} &\left.\begin{array}{l} &\alpha_{i}\geqslant 0\\[2ex] &\mu{i}\geqslant 0\\[2ex] &C =\alpha_{i}+\mu_{i}\\[2ex] \end{array}\right\}\implies 0\leqslant\alpha_i\leqslant C\\[2ex] &\quad\sum_{i=1}^m \alpha_i y_i=0\\[2ex] & \text{综合可知:} s.t.\quad 0\leqslant\alpha_i\leqslant C \quad \sum_{i=1}^m \alpha_i y_i=0 \end{align}\\
最后可得(5.1)的对偶问题是
\color{red}{\max\min\mathcal{L}=\max\sum_{i=1}^{m}\alpha_i-\frac{1}{2}\sum_{i=1}^m\sum_{j=1}^m \alpha_i\alpha_j y_iy_j x_i^T x_j \quad s.t.\; 0\leqslant\alpha_i\leqslant C \quad \sum_{i=1}^m \alpha_i y_i=0} \tag{5.5}

5.5目标函数

将(5.2)代入目标函数
\color{red}{f(x) =w^{T} x+b =\sum_{i=1}^m\alpha_i y_i x_i^T x+b} \tag{5.6}
上式中已经没有了w​,只和\alpha​b​相关,我们试着去理解\alpha_i的含义。
\begin{align} & w=\begin{bmatrix}a_1\\a_2\\...\\a_n\end{bmatrix} 是多元函数每一元的系数\\ \end{align}
系数代表超平面对该维度分割的程度。如下图,x_1,x_2的系数a,b异号时,只看第一象限,a,b绝对值的大小决定第一象限如何被分割,|a|=|b|时,均匀分割,|a|>|b|靠近x_1的区域大,|a|<|b|靠近x_2的区域大,所以多元函数某元前的系数决定被分割时该元所占空间的比重。

1556085842375.png

\begin{align} &f(x) =\sum_{i=1}^m\alpha_i y_i x_i^T x+b\\[2ex] 忽略b,单看\sum_{i=1}^m\alpha_i y_i x_i^T &=\alpha_1y_1\begin{bmatrix} x_1^1\\ x_1^2\\ ...\\ x_1^n \end{bmatrix} +\alpha_2y_2\begin{bmatrix} x_2^1\\ x_2^2\\ ...\\ x_2^n \end{bmatrix}+...+ \alpha_my_m\begin{bmatrix} x_m^1\\ x_m^2\\ ...\\ x_m^n \end{bmatrix}\\[2ex] &= \begin{bmatrix} \alpha_1y_1x_1^1+\alpha_2y_2x_2^1+...+\alpha_my_mx_m^1\\ \alpha_1y_1x_1^2+\alpha_2y_2x_2^2+...+\alpha_my_mx_m^2\\ ...\\ \alpha_1y_1x_1^n+\alpha_2y_2x_2^n+...+\alpha_my_mx_m^n\\ \end{bmatrix} =\begin{bmatrix} a_1\\ a_2\\ ...\\ a_n \end{bmatrix}\\[2ex] \end{align}
前面是拉格朗日乘子\alpha,后面是系数a(最后一项),x_m^n表示第m个样本第n个特征,y_m表示第m个样本的类别,以第一个系数a_1为例,只取决于所有样本的第1个特征,所有样本的类别,和所有的\alpha

6.SMO求解α

想要求解(5.5),这是一个二次规划问题,先固定先固定\alpha_i之外的所有参数,然后求\alpha_i上的极值,我们将\alpha_j\alpha_i来表示。这种方法就是SMO(Sequential Minimal Optimizaion)序列最小化算法,将大优化问题分解成多个小优化问题求解。

于要满足(6.5)的条件,\alpha必须两两更新,否则\alpha_{i} y_{i}+\alpha_{j} y_{j}\neq \epsilon,每次选择两个变量a_ia_j并固定其他参数,在参数初始化后,SMO不断执行直到收敛。

6.1化简对偶问题

改变下\mathcal{L}​的形式,暴露出一对\alpha(\alpha_1​\alpha_2)​
\begin{align} &方便求解将\alpha_i \alpha_j y_i y_j x_i^T x_j记为X\\[2ex] \mathcal{L} &=\sum_{i=1}^m \alpha_i-\frac12 \sum_{i=1}^m \sum_{j=1}^m \alpha_i \alpha_j y_i y_j x_i^T x_j\\[2ex] &=\sum_{i=1}^m \alpha_i-\frac12 \sum_{i=1}^m \sum_{j=1}^m X\\[2ex] &=\alpha_1+\alpha_2+\sum_{i=3}^m \alpha_i-\frac12 \sum_{i=1}^m\left(\sum_{j=1}^2 X+\sum_{j=3}^m X\right)\\[2ex] &=\alpha_1+\alpha_2+\sum_{i=3}^m \alpha_i-\frac12 \sum_{i=1}^2\left(\sum_{j=1}^2 X+\sum_{j=3}^m X\right)-\frac12 \sum_{i=3}^m\left(\sum_{j=1}^2 X+\sum_{j=3}^m X\right)\\[2ex] &=\alpha_1+\alpha_2+\sum_{i=3}^m \alpha_i-\frac12 \sum_{i=1}^2\sum_{j=1}^2 X+\frac12\sum_{i=1}^2\sum_{j=3}^m X-\frac12\sum_{i=3}^m\sum_{j=1}^2 X-\frac12\sum_{i=3}^m\sum_{j=3}^m X\\[2ex] &=\alpha_{1}+\alpha_2+\sum_{i=3}^m \alpha_i-\frac12 \sum_{i=1}^2 \sum_{j=1}^2 X-\sum_{i=1}^2 \sum_{j=3}^m X-\frac12 \sum_{i=3}^m \sum_{j=3}^m X\\[2ex] &=\alpha_{1}+\alpha_{2}-\frac{1}{2} \sum_{i=1}^{2} \sum_{j=1}^{2} X-\sum_{i=1}^{2} \sum_{j=3}^{m} X+\sum_{i=3}^{m} \alpha_{i}-\frac{1}{2} \sum_{i=3}^{m} \sum_{j=3}^{m} X\\[2ex] \langle将X代回,y_i^2=1\rangle\quad&=\alpha_{1}+\alpha_{2}-\frac{1}{2} \alpha_{1}^{2} x_{1}^{T} x_{1}-\frac{1}{2} \alpha_{2}^{2} x_{2}^{T} x_{2}-y_{1} y_{2} \alpha_{1} \alpha_{2} x_{2}^{T} x_{2}-y_{1} \alpha_{1} \sum_{j=3}^{m} \alpha_{j} y_{j} x_{1}^{T} x_{j}\\ &\quad-y_{2} \alpha_{2} \sum_{j=3}^{m} \alpha_{j} y_{j} x_{2}^{T} x_{j}+\sum_{i=3}^{m} \alpha_{i}-\frac{1}{2} \sum_{i=3}^{m} \sum_{j=3}^{m} y_{i} y_{j} \alpha_{i} \alpha_{j} x_{i}^{T} x_{j} \tag{6.1}\\ \hline & \because f(x) =w^T x+b =\sum_{i=1}^m \alpha_i y_i x_i^T x+b \\[2ex] & \therefore \text{设} \upsilon_i=\sum_{j=3}^m \alpha_j y_j x_i^Tx_j=f(x_i)-\sum_{j=1}^2 \alpha_j y_j x_i^T x_j-b \tag{6.2}\\[2ex] & \therefore \text{设} \beta=\sum_{i=3}^{m} \alpha_{i}-\frac{1}{2} \sum_{i=3}^{m} \sum_{j=3}^{m} y_{i} y_{j} \alpha_{i} \alpha_{j} x_{i}^{T} x_{j}\quad\beta\text{为常数}\tag{6.3}\\[2ex] \mathcal{L}&=\alpha_1+\alpha_2-\frac12 \alpha_1^2 x_1^T x_1-\frac12 \alpha_2^2 x_2^T x_2-y_1 y_2 \alpha_1 \alpha_2 x_1^T x_2-y_1 \alpha_1 v_1-y_2 \alpha_2 v_2+\beta\tag{6.4} \end{align}

6.2找出α_1α_2的关系

由于(5.5)的约束条件,\sum_{i=1}^{m} \alpha_{i} y_{i}=0​,仅考虑\alpha_i​\alpha_j​,将其他项移到等式右边,暴露出一对\alpha ​
\begin{align} &\color{red}{\alpha_i y_i+\alpha_j y_j=\epsilon \quad (0\leqslant\alpha_i\leqslant C ,\;0\leqslant\alpha_j \leqslant C)} & \color{red}{\epsilon=-\sum_{k \neq i, j} \alpha_{k} y_{k} \text{为常数}} \tag{6.5}\\ & \text{推广:}\alpha_1 y_1+\alpha_2 y_2=\epsilon \end{align}
\alpha_2来表示\alpha_1\mathcal{L}将化成用\alpha_2​表示的形式。
\begin{align} \because & y_i=1或-1\quad y_j=1或-1\\[2ex] \therefore & y_i^2=1 \quad y_j^2=1 \quad y_i^2y_2^2=1\\[2ex] &\text{设}\color{red}{\gamma=\epsilon y_1 \quad \psi=y_1y_2}\\[2ex] & \alpha_1 y_1+\alpha_2 y_2=\epsilon\implies\alpha_1 y_1^2+\alpha_2 y_2y_1=\epsilon y_1 \tag{6.6}\\[2ex] & a_1=\epsilon y_{1}-\alpha_{2} y_{2}y_{1}\implies a_i=\gamma- \psi\alpha_{2} \tag{6.7}\\[2ex] \mathcal{L}&=\alpha_1+\alpha_2-\frac12 \alpha_1^2 x_1^T x_1-\frac12 \alpha_2^2 x_2^T x_2-y_1 y_2 \alpha_1 \alpha_2 x_1^T x_2-y_1 \alpha_1 v_1-y_2 \alpha_2 v_2+\beta \\[2ex] &=\gamma-\psi \alpha_2+\alpha_2-\frac12(\gamma-\psi \alpha_2)^2 x_1^T x_1-\frac12\alpha_2^2 x_2^T x_2-\psi(\gamma-\psi \alpha_2) \alpha_2 x_1^T x_2-y_1(\gamma-\psi \alpha_2) v_1-y_2 \alpha_2 v_2+\beta \\[2ex] \end{align}
根据KKT条件,可知\frac{\partial \mathcal{L}}{\partial a_{2}}=0,可以得出\alpha_2的表达式
\begin{align} \frac{\partial \mathcal{L}}{\partial a_{2}}&=-\psi+1+\gamma \psi x_{1}^{T} x_{1}-\alpha_{2} x_{1}^{T} x_{1}-\alpha_{2} x_{2}^{T} x_{2} -\gamma \psi x_{1}^{T} x_{2}+2 \alpha_{2} x_{1}^{T} x_{2}+y_{2} v_{1}-y_{2} v_{2}=0 \\[2ex] \alpha_{2}&=\frac{y_{2}(y_{2}-y_{1}+y_{1} \gamma(x_{1}^{T} x_{1}-x_{1}^{T} x_{2})+v_{1}-v_{2})}{x_{1}^{T} x_{1}+x_{2}^{T} x_{2}-2 x_{1}^{T} x_{2}} \end{align}

6.3 更新前后α_2^\text{old}α_2^\text{new}的关系

根据(6.5)我们就能找到更新前后的\alpha_2^\text{old}​\alpha_2^\text{new}​的关系
\alpha_1^\text{new} y_1+\alpha_2^\text{new} y_2=\alpha_1^\text{old}y_1+\alpha_2^\text{old} y_2=\epsilon
求解\alpha_2^\text{old}\alpha_2^\text{new}的具体关系
\begin{align} \frac{\partial \mathcal{L}}{\partial a_2}&=-\psi+1+\gamma \psi x_1^T x_1-\alpha_2 x_1^T x_{1}-\alpha_2 x_2^T x_2 -\gamma \psi x_1^T x_2+2 \alpha_2 x_1^T x_2+y_2 v_1-y_2 v_2\\[2ex] &=-\alpha_2(x_1^T x_1+x_2^T x_2-2x_1^T x_2)-\psi+1+\gamma \psi x_1^T x_1-\gamma \psi x_1^T x_2+2 + y_2v_1 - y_2 v_2\\[2ex] \because & \psi=y_1y_2\\[2ex] \frac{\partial \mathcal{L}}{\partial a_2}&=-\alpha_2(x_1^T x_1+x_2^T x_2-2x_1^T x_2)-y_1 y_2+1+\gamma y_1y_2 x_1^T x_1-\gamma y_1y_2 x_1^T x_2 + y_2v_1 - y_2 v_2\\[2ex] \because & \gamma=\epsilon y_1\\[2ex] \frac{\partial \mathcal{L}}{\partial a_2}&=-\alpha_2(x_1^T x_1+x_2^T x_2-2x_1^T x_2)-y_1 y_2+1+\epsilon y_1^2y_2 x_1^T x_1 -\epsilon y_1^2y_2 x_1^T x_2+ y_2v_1 - y_2 v_2\\[2ex] &=-\alpha_2(x_1^T x_1+x_2^T x_2-2x_1^T x_2)-y_1 y_2+1+\epsilon y_2 x_1^T x_1 -\epsilon y_1^2y_2 x_1^T x_2+ y_2v_1 - y_2 v_2\\[2ex] \because & \epsilon=\alpha_1^\text{old}y_1+\alpha_2^\text{old}y_2\\[2ex] \frac{\partial \mathcal{L}}{\partial a_2}&=-\alpha_2(x_1^T x_1+x_2^T x_2-2x_1^T x_2)-y_1 y_2+1\\ & +(\alpha_1^\text{old}y_1+\alpha_2^\text{old}y_2)y_2 x_1^T x_1-(\alpha_1^\text{old}y_1+\alpha_2^\text{old}y_2)y_2 x_1^T x_2 + y_2v_1 - y_2 v_2\\[2ex] \because & \frac{\partial \mathcal{L}}{\partial a_2}=0\\[2ex] \therefore\alpha_2&(x_1^T x_1+x_2^T x_2-2x_1^T x_2)\\[2ex] &=-y_1 y_2+1+(\alpha_1^\text{old}y_1+\alpha_2^\text{old}y_2)y_2 x_1^T x_1-(\alpha_1^\text{old}y_1+\alpha_2^\text{old}y_2)y_2 x_1^T x_2 + y_2v_1 - y_2 v_2\\[2ex] &=-y_1 y_2+1+\alpha_1^\text{old}y_1y_2 x_1^T x_1+\alpha_2^\text{old}y_2^2 x_1^T x_1 -\alpha_1^\text{old}y_1y_2 x_1^T x_2-\alpha_2^\text{old}y_2^2 x_1^T x_2+ y_2v_1 - y_2 v_2\\[2ex] &=\alpha_2^\text{old}x_1^T x_1-\alpha_2^\text{old}x_1^T x_2 -y_1 y_2+1+\alpha_1^\text{old}y_1y_2 x_1^T x_1-\alpha_1^\text{old}y_1y_2 x_1^T x_2+ y_2v_1 - y_2 v_2\\[2ex] \because & 1=y_2^2\\[2ex] \therefore\alpha_2&(x_1^T x_1+x_2^T x_2-2x_1^T x_2)\\[2ex] &=\alpha_2^\text{old}x_1^T x_1-\alpha_2^\text{old}x_1^T x_2 -y_1 y_2+y_2^2+\alpha_1^\text{old}y_1y_2 x_1^T x_1-\alpha_1^\text{old}y_1y_2 x_1^T x_2+ y_2v_1 - y_2 v_2\\[2ex] \because & v_1=f(x_1)-\sum_{i=1}^2 \alpha_i y_i x_1^T x_i-b =f(x_1)-\alpha_1^\text{old} y_1 x_1^T x_1-\alpha_2^\text{old} y_2 x_1^T x_2-b\\ \because &v_2=f(x_2)-\sum_{i=1}^2 \alpha_i y_i x_2^T x_i-b =f(x_2)-\alpha_1^\text{old} y_1 x_2^T x_1-\alpha_2^\text{old} y_2 x_2^T x_2-b\\[2ex] \therefore\alpha_2&(x_1^T x_1+x_2^T x_2-2x_1^T x_2)\\[2ex] &=\alpha_2^\text{old}x_1^T x_1-\alpha_2^\text{old}x_1^T x_2 -y_1y_2+y_2^2+\alpha_1^\text{old}y_1y_2x_1^T x_1- \alpha_1^\text{old}y_1 y_2 x_1^T x_2\\[2ex] &\quad+y_2\left[f(x_1)-\alpha_1^\text{old} y_1 x_1^T x_1-\alpha_2^\text{old} y_2 x_1^T x_2-b\right]-y_2\left[f(x_2)-\alpha_1^\text{old} y_1 x_2^T x_1-\alpha_2^\text{old} y_2 x_2^T x_2-b\right]\\[2ex] &=\alpha_2^\text{old}x_1^T x_1-\alpha_2^\text{old}x_1^T x_2+\alpha_1^\text{old}y_1y_2 x_1^T x_1-\alpha_1^\text{old}y_1y_2 x_1^T x_2\\[2ex] &\quad+y_2\left[f(x_1)-y_1 -\alpha_1^\text{old} y_1 x_1^T x_1-\alpha_2^\text{old} y_2 x_1^T x_2-b\right]-y_2\left[f(x_2)-y_2-\alpha_1^\text{old} y_1 x_2^T x_1-\alpha_2^\text{old} y_2 x_2^T x_2-b\right]\\[2ex] &=\alpha_2^\text{old}x_1^T x_1-\alpha_2^\text{old}x_1^T x_2+\alpha_1^\text{old}y_1y_2 x_1^T x_1-\alpha_1^\text{old}y_1y_2 x_1^T x_2\\[2ex] &\quad+y_2\left[(f(x_1)-y_1) -(f(x_2)-y_2) \right]+y_2\left[-\alpha_1^\text{old} y_1 x_1^T x_1-\alpha_2^\text{old} y_2 x_1^T x_2+\alpha_1^\text{old} y_1 x_2^T x_1+\alpha_2^\text{old} y_2 x_2^T x_2\right]\\[2ex] &=\alpha_2^\text{old}x_1^T x_1-\alpha_2^\text{old}x_1^T x_2 +y_2\left[(f(x_1)-y_1) -(f(x_2)-y_2) \right]\\[2ex] &\quad+y_2\left[\alpha_1^\text{old}y_1 x_1^T x_1-\alpha_1^\text{old}y_1 x_1^T x_2-\alpha_1^\text{old} y_1 x_1^T x_1-\alpha_2^\text{old} y_2 x_1^T x_2+\alpha_1^\text{old} y_1 x_2^T x_1+\alpha_2^\text{old} y_2 x_2^T x_2\right]\\[2ex] &=\alpha_2^\text{old}x_1^T x_1-\alpha_2^\text{old}x_1^T x_2 +y_2\left[(f(x_1)-y_1) -(f(x_2)-y_2) \right]\\[2ex] &\quad+y_2\left[-\alpha_1^\text{old}y_1 x_1^T x_2-\alpha_2^\text{old} y_2 x_1^T x_2+\alpha_1^\text{old} y_1 x_2^T x_1+\alpha_2^\text{old} y_2 x_2^T x_2\right]\\[2ex] &=\alpha_2^\text{old}x_1^T x_1-\alpha_2^\text{old}x_1^T x_2-\alpha_1^\text{old}y_1y_2 x_1^T x_2-\alpha_2^\text{old} x_1^T x_2+\alpha_1^\text{old} y_1y_2 x_2^T x_1+\alpha_2^\text{old} x_2^T x_2\\[2ex] &\quad+y_2\left[(f(x_1)-y_1) -(f(x_2)-y_2) \right]\\[2ex] &=\alpha_2^\text{old}x_1^T x_1-2\alpha_2^\text{old} x_1^T x_2+\alpha_2^\text{old} x_2^T x_2-\alpha_1^\text{old}y_1y_2 x_1^T x_2+\alpha_1^\text{old} y_1y_2 x_2^T x_1\\[2ex] &\quad+y_2\left[(f(x_1)-y_1) -(f(x_2)-y_2) \right]\\[2ex] &\alpha_2(x_1^T x_1+x_2^T x_2-2x_1^T x_2) =\alpha_2^\text{old}(x_1^T x_1 + x_2^T x_2-2x_1^T x_2)+y_2\left[(f(x_1)-y_1) -(f(x_2)-y_2) \right]\\[2ex] &\alpha_2 =\alpha_2^\text{old}+\frac{y_2\left[(f(x_1)-y_1) -(f(x_2)-y_2) \right]}{x_1^T x_1+x_2^T x_2-2x_1^T x_2}\\[2ex] \end{align}

6.4 学习速率和误差率

\begin{align} &\text{设学习速率}\color{red}{\eta=x_1^T x_1+x_2^T x_2-2x_1^T x_2} \\[2ex] &\text{设误差项}\color{red}{E_i=f(x_i)-y_i=\sum_{i=1}^m \alpha_i y_i x_i^T x+b-y_i}\\[2ex] & \color{red}{\alpha_{2}^{\text {new }}=\alpha_{2}^{\text { old }}+\frac{y_{2}\left(E_{1}-E_{2}\right)}{\eta}} \\ \end{align}

6.5 α更新条件

计算出一对\alpha​后,我们需要验证,是否所有点都分对了,如果不对,则成对更新\alpha ​,条件如下:
\left. \begin{array}{l} {\alpha_{i} \geqslant 0, \quad \mu_{i} \geqslant 0} \\[2ex] {y_{i} f\left(\boldsymbol{x}_{i}\right)-1+\xi_{i} \geqslant 0} \\[2ex] {\alpha_{i}\left(y_{i} f\left(\boldsymbol{x}_{i}\right)-1+\xi_{i}\right)=0} \\[2ex] {\xi_{i} \geqslant 0, \mu_{i} \xi_{i}=0} \end{array} \right\} \alpha_i=0 或 y_{i} f\left(x_{i}\right)-1+\xi_i = 0 至少一项成立 \tag{6.8}
这反应了SVM的一个重要性质,训练完成后,大部分训练样本不需要保留,只需要保留支持向量

然后根据条件,可得样本样本分类情况
\begin{align} &\begin{cases} C =\alpha_{i}+\mu_{i}\\[2ex] \xi_{i} \geqslant 0\\[2ex] \mu_{i} \xi_{i}=0\\[2ex] \alpha_i=0\\[2ex] y_i f(x_i)-1+\xi_i = 0\\[2ex] \xi_i=\max(0,1-y_i(w^Tx_i+b))\\[2ex] \end{cases}\\[2ex] &\begin{cases} \begin{aligned} &\alpha_i=0 \\ &\text{不会对样本有影响} \end{aligned}\\[2ex] \begin{aligned} &\alpha_i>0 \implies y_{i} f\left(x_{i}\right) = 1-\xi_i\leqslant1\\ &\text{样本是支持向量} \end{aligned} \begin{cases} \begin{aligned} &\alpha_i<C\implies \mu_i>0 \implies \xi_i=0\implies y_{i} f\left(x_{i}\right) = 1\\ &\text{样本在最大间隔边界上} \end{aligned}\\[2ex] \alpha_i=C\implies \mu_i=0 \begin{cases} \begin{aligned} &0\leqslant\xi_i\leqslant1\implies 0\leqslant y_{i} f\left(x_{i}\right) = 1-\xi_i\leqslant1\\ &\text{样本在最大间隔内部} \end{aligned}\\[2ex] \begin{aligned} & \xi_i>1\implies y_{i} f\left(x_{i}\right) = 1-\xi_i<0\\ &\text{样本被错误分类} \end{aligned} \end{cases} \end{cases} \end{cases}\\[2ex] &\begin{cases} \alpha_{i}=0 & \Leftrightarrow y_{i} f(x_i) \geqslant 1 &不会对样本有影响 \\[2ex] 0<\alpha_{i}<C & \Leftrightarrow y_{i} f(x_i) =1 & 样本在最大间隔边界上\\[2ex] \alpha_{i}=C & \Leftrightarrow y_{i} f(x_i) \leqslant 1 & 样本在内部或被错误分类 \end{cases} \end{align}
总结下上述条件,我们可以知道哪些情况的\alpha_i不满足KKT条件,需要更新
\begin{align} &\begin{cases} \alpha_{i}>0 & y_{i} f(x_i) > 1 \\[2ex] \alpha_{i}<C & y_{i} f(x_i) < 1 \end{cases}\tag{这些条件下需要更新}\\[2ex] &改变下写法\\[2ex] &y_iE_i=y_i(f(x_i)−y_i)=y_if(x_i)−1\\[2ex] &\color{red}{\begin{cases} \alpha_{i}>0 & y_{i} f(x_i) > 1\implies y_iE_i>1 \\[2ex] \alpha_{i}<C & y_{i} f(x_i) < 1 \implies y_iE_i<1 \end{cases}} \end{align}

6.6 α的取值范围

对于更新后的\alpha,同样有取值范围,先寻找\alpha​的限制条件。
\begin{align} &\begin{cases} 0\leqslant\alpha_{i}\leqslant C\\ \alpha_{i} y_{i}+\alpha_{j} y_{j}=\epsilon\implies \begin{cases} \text{如果}y_1\neq y_2 (异号)\implies\alpha_1 -\alpha_2 =\epsilon\\ \text{如果}y_1= y_2 (同号)\implies\alpha_1 +\alpha_2 =\epsilon\\ \end{cases}\\[2ex] \end{cases}\\[2ex] \end{align}

\begin{align} 1&\text{设}\alpha_2\text{的取值范围为}L\leqslant\alpha_2\leqslant H \\[2ex] &\epsilon=-\sum_{k \neq i, j} \alpha_{k} y_{k} 为常数 \\[2ex]\hline & y_i \neq y_j \begin{cases} &\alpha_1-\alpha_2=\epsilon\implies \alpha_1=\alpha_2+\epsilon\\[2ex] &\alpha_2-\alpha_1=\epsilon\implies \alpha_1=\alpha_2-\epsilon\\[2ex] \end{cases}\\[2ex] &\left.\begin{array}{l} 0\leqslant\alpha_2+\epsilon\leqslant C\\[2ex] 0\leqslant\alpha_2-\epsilon\leqslant C\\[2ex] 0\leqslant\alpha_1\leqslant C\\[2ex] 0\leqslant\alpha_2\leqslant C\\[2ex] \end{array}\right\}\implies \color{red}{\begin{cases} L=\max(0,-\epsilon )\\[2ex] H=\min(C,C-\epsilon ) \end{cases}}\\[2ex] \hline & y_i = y_j \begin{cases} &\alpha_1+\alpha_2=\epsilon\implies \alpha_1=-\alpha_2+\epsilon\\[2ex] &-\alpha_1-\alpha_2=\epsilon\implies \alpha_1=-\alpha_2-\epsilon\\[2ex] \end{cases}\\[2ex] &\left.\begin{array}{l} 0\leqslant-\alpha_2+\epsilon\leqslant C\\[2ex] 0\leqslant-\alpha_2-\epsilon\leqslant C\\[2ex] 0\leqslant\alpha_1\leqslant C\\[2ex] 0\leqslant\alpha_2\leqslant C\\[2ex] \end{array}\right\} \implies \color{red}{\begin{cases} L=\max(0,\epsilon -C)\\[2ex] H=\max(C,\epsilon) \end{cases}} \end{align}
根据\alpha ​的取值范围,如果更新后大于最大值,则取最大值;更新后小于最小值,则取最小值。
\begin{align} \alpha_2^\text{new}= \begin{cases} {H} & {\alpha_2^\text{temp}>H} \\[2ex] {\alpha_{2}^\text{temp}} & {L \leq \alpha_{2}^{\text{temp}} \leq H} \\[2ex] {L} & {\alpha_{2}^{\text{temp}}<L} \end{cases} \end{align}
通过(6.5)的推广更新\alpha_2后可以再更新\alpha_1
\begin{align} &\begin{cases} \alpha_{1}^{\text{old}} =\gamma-\psi \alpha_{2}^\text{old} \\[2ex] \alpha_{1}^{\text{new}} =\gamma-\psi \alpha_{2}^\text{new} \end{cases} \implies \alpha_1^\text{new}=\alpha_1^\text{old}+\psi \alpha_2^\text{old}-\psi \alpha_{2}^\text{new}\\[2ex] &\because \psi=y_iy_j \\[2ex] &\color{red}{\alpha_1^\text{new}=\alpha_1^\text{old}+y_{1} y_{2}\left(\alpha_{2}^{\text{old}}-\alpha_2^\text{new}\right)}\\ \end{align}

7.求解b

求出\alpha​后,需要求解b​

更新\alpha_i​后需要更新b​b​影响f(x)​的值,从而影响E_i​的值),使间隔最大。
\begin{align} &0< \alpha_1 < C \implies y_1 f(x_1)=y_{1}(w^T x_1+b)=y_1(\sum_{i=1}^m \alpha_i y_i x_i^T x_1+b)=1\quad\text{样本是支持向量}\\ &y_1^2(\sum_{i=1}^n \alpha_i y_i x_i x_1+b)=y_1\\ &y_1^2=1 \implies \sum_{i=1}^n \alpha_i y_i x_i x_1+b = y_1\\ &b_1^\text{new}=y_1-\sum_{i=3}^m \alpha_i y_i x_i^T x_1-\alpha_1^\text{new} y_1 x_1^T x_1-\alpha_2^\text{new} y_2 x_2 x_1 \\ &E_1=\sum_{i=1}^{m} \alpha_{i} y_{i} \boldsymbol{x}_{i}^{\mathrm{T}} \boldsymbol{x_1}+b-y_1\\ &y_1=-E_1+\sum_{i=1}^{m} \alpha_{i} y_{i} \boldsymbol{x}_{i}^{\mathrm{T}} \boldsymbol{x_1}+b\\ &y_{1}-\sum_{i=3}^{m} \alpha_{i} y_{i} \boldsymbol{x}_{i}^{T} \boldsymbol{x}_{1}=-E_1+\sum_{i=1}^{2} \alpha_{i} y_{i} \boldsymbol{x}_{i}^{\mathrm{T}} \boldsymbol{x_1}+b=-E_{1}+\alpha_{1}^{\text { old }} y_{1} \boldsymbol{x}_{1}^{T} \boldsymbol{x}_{1}+\alpha_{2}^{\text { old }} y_{2} \boldsymbol{x}_{2}^{T} \boldsymbol{x}_{1}+b^{\text { old }}\\ & \color{red}{ \begin{cases} b_{1}^{n e w}=b^{o l d}-E_{1}-y_{1}\left(\alpha_{1}^{n e w}-\alpha_{1}^{\text { old }}\right) x_{1}^{T} x_{1}-y_{2}\left(\alpha_{2}^{n e w}-\alpha_{2}^{\text { old }}\right) x_{2}^{T} x_{1} \\[2ex] b_{2}^{n e w}=b^{o l d}-E_{2}-y_{1}\left(\alpha_{1}^{n e w}-\alpha_{1}^{\text { old }}\right) x_{1}^{T} x_{2}-y_{2}\left(\alpha_{2}^{n e w}-\alpha_{2}^{\text { old }}\right) x_{2}^{T} x_{2} \end{cases}} \end{align}
b_1​b_2​都有效时,相等
b^{\text { new }}=b_{1}^{\text { new }}=b_{2}^{\text {new}}
当两个乘子都在边界上,则b和KKT条件一致。当不满足的时候,SMO算法选择他们的中点作为新的阈值
b= \begin{cases} b_{1}, & {0<\alpha_{1}^{n e w}<C} \\ b_{2} & {0<\alpha_{2}^{n e w}<C} \\ {\left(b_{1}+b_{2}\right) / 2} & {\text { otherwise }} \end{cases}


参考:

  1. 周志华 《机器学习》第六章 支持向量机
  2. 托马斯《托马斯微积分》11.多元函数及其导数
  3. Jack Cui《机器学习实战教程(九):支持向量机实战篇之再撕非线性SVM》https://cuijiahua.com/blog/2017/11/ml_9_svm_2.html
  4. 机器之心 《学习SVM,这篇文章就够了!》https://www.jiqizhixin.com/articles/2018-10-17-20
  5. donyzh《SVM关于Platt文章中SMO求解参数的具体推导过程》https://blog.csdn.net/donyzh/article/details/78196827

相关文章

网友评论

    本文标题:一、最细粒度 推导支持向量机SVM

    本文链接:https://www.haomeiwen.com/subject/spcmgqtx.html