一、逻辑斯蒂回归
(1)构建预测函数
- 第一步,构建一个预测函数(概率,分类问题,变成了概率问题)
-
线性回归函数
-
其中
- 向量,一维的;矩阵是二维的
- 一般情况下,给一个向量,<font color = red>默认是列向量</font>
- 原来的
系数是列向量,现在
(转置)表示行向量
- 行向量
:行向量点乘列向量(
)
(2)构建损失函数
![](https://img.haomeiwen.com/i4956968/968e8aaedaed2a79.png)
(3)使用梯度下降求最小值
![](https://img.haomeiwen.com/i4956968/ce2ee4b448a1870f.png)
虽然名称是回归的算法,但是他是用于分类的而不是用于回归的算法。
逻辑斯蒂回归的损失函数用的是最大似然。
什么是最大似然?
案列:
-
假如有一个罐子,里面有黑白两种颜色的球,数目多少不知,两种颜色的比例也不知。我们想知道罐中白球和黑球的比例,但我们不能把罐中的球全部拿出来数。现在我们可以每次任意从已经摇匀的罐中拿一个球出来,记录球的颜色,然后把拿出来的球再放回罐中。这个过程可以重复,我们可以用记录的球的颜色来估计罐中黑白球的比例。假如在前面的一百次重复记录中,有七十次是白球,请问罐中白球所占的比例最有可能是多少?
-
最大似然估计,计算
-
白球概率是p,黑球是1-p(罐子中非黑即白)
-
罐子中取一个请问是白球的概率是多少?
-
罐子中取两个球,两个球都是白色,概率是多少?
-
罐子中取5个球都是白色,概率是多少?
-
罐子中取10个球,9个是白色,一个是黑色,概率是多少呢?
-
罐子取100个球,70次是白球,30次是黑球,概率是多少?
-
常数,大了一点,还是常数
-
-
最大似然估计 什么时候,P最大呢?
-
-
令导数为0
-
-
-
-
-
p = 70%
通过上述的例子我们可以得到似然函数,就是概率函数,就是每一个样本真实标记的概率。
![](https://img.haomeiwen.com/i4956968/85b0913c9aa94cd2.png)
二、原理及公式
逻辑斯蒂回归
-
Logistic regression, despite its name, is a linear model for classification rather than regression.
-
逻辑回归尽管有其名称,但它是用于分类而不是回归的线性模型。
-
-
令
,令
,令
-
-
-
纵坐标统一缩小一半
-
-
Sigmoid函数
-
逻辑斯蒂函数和Sigmoid函数统一了
逻辑斯蒂回归使用:
- 导包
- 声明对象 lr
- lr.fit(X_train,y_train)训练
- lr.predict(X_test)算法使用
- 公司中业务,无论复杂还是简单,流程类似这样的
逻辑斯蒂回归原理
第一步预测函数
-
from sklearn.linear_model import LogisticRegression
-
线性回归(四元(四个属性)一次方程)
-
Sigmoid方程
-
sigmoid.png
-
逻辑斯蒂回归线性回归+sigmoid结合
- 复合函数线性回归套到逻辑斯蒂回归中
- 预测函数:<font color =red>
</font>
- 如上的预测函数概率函数,范围0 ~ 1之间
- 分类问题,计算机(死脑筋),比较概率的大小分类!!!
-
为什么要把线性回归套进去逻辑斯蒂函数中呢???
- 分类问题,怎么分类???
- 分类问题转化成概率问题
- 分类问题,交给计算机解决,想法设法,把问题变成概率问题,比较大小
- 逻辑斯蒂函数就是概率函数,无论给的值,多大多小变换到0~1之间,概率
- 逻辑斯蒂函数或者Sigmoid函数<font color =red>巧妙</font>之处
第二步,cost损失函数
-
之前线性回归:最小二乘法
-
逻辑斯蒂回归:最大似然
-
预测函数
-
-
预测函数
概率函数,范围 0 ~ 1
-
似然函数 = 概率函数:
- 似然函数就是概率函数,就是每个样本属于真实标记的概率
- 什么是似然函数呢???
- 逻辑斯蒂回归首先解决二分类问题,类别0、1
- 延伸逻辑斯蒂回归可以解决多分类。
- 情况:二分类
- 情况一:y = 1
- 情况二:y = 0
- 定义白球概率 p,黑球的概率是 1- p
- 黑球白球目标值y
- 白球y = 1
- 黑球y = 0
三、逻辑斯蒂代码的简单的使用
(一、代码的简单的使用)
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn import metrics
from sklearn.model_selection import train_test_split
# iris鸢尾花,细分,品种
# 自然环境生成的是不同的,所以导致,亚种
# 鸢尾花,品种不同,花萼花瓣长宽是不同的
iris = datasets.load_iris()
iris
# 四个属性:花萼长度、花萼宽度、花瓣长度、花瓣宽度
X = iris['data']
# 分类问题
y = iris['target']
display(X.shape,y.shape)
(150, 4)
(150,)
# test_size = 0.2 20%
# 测试数据占比20%,训练80%
# 构建模型,学习数据,规律,根据数据,进行预测
# train_test_split这个方法,随机的打乱顺序,每个人都会不一样
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.2,random_state = 512)
lr = LogisticRegression(max_iter=1000)
# 学习了数据X_train和y_train
# X_train是数据 ------> y_train目标值
# 算法寻找,关系,方程
lr.fit(X_train,y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=1000,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
# X_test测试数据,标准答案 y_test
# X_test对我们的算法而言,是全新的数据
y_ = lr.predict(X_test)
print('标准答案:\n',y_test)
print('算法预测:\n',y_)
# 算法预测了30个,28个正确,2个出现错误
print('准确率:',28/30)
标准答案:
[0 1 1 1 2 0 0 2 0 2 1 1 1 0 2 0 2 0 1 1 0 1 0 1 0 2 1 1 1 2]
算法预测:
[0 1 1 2 2 0 0 2 0 2 1 1 1 0 2 0 2 0 1 1 0 1 0 2 0 2 1 1 1 2]
准确率: 0.9333333333333333
(二、原理代码的实现)
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(-10,10)
sigmoid = lambda x : 1/(1 + np.e**(-x))
y = sigmoid(x)
plt.plot(x,y)
plt.title('Sigmoid-Logistic')
Text(0.5, 1.0, 'Sigmoid-Logistic')
![](https://img.haomeiwen.com/i4956968/08c97dfe1b911f69.png)
(三。逻辑斯蒂识别手写的数字)
导包
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
import pandas as pd
from sklearn.model_selection import train_test_split
加载数据,拆分
data = pd.read_csv('./digits.csv')
data
X = data.iloc[:,1:]
y = data['label']
X.shape
(42000, 784)
可视化手写数字
# 图片高度28像素,宽度28像素
28*28
784
plt.imshow(X.loc[1024].values.reshape(28,28))
<matplotlib.image.AxesImage at 0x1ab22bd1b88>
![](https://img.haomeiwen.com/i4956968/f2a8b6a4ab5f8095.png)
拆分数据:训练和测试
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 1000)
print(X_train.shape,X_test.shape)
(41000, 784) (1000, 784)
# 数据减少,5000个训练,不少,可以找到规律了
X_train = X.iloc[:5000]
y_train =y.iloc[:5000]
X_test = X.iloc[-1000:]
y_test = y.iloc[-1000:]
使用算法,训练和预测
import warnings
warnings.filterwarnings('ignore')
%%time
lr = LogisticRegression(max_iter=200)
lr.fit(X_train,y_train)
y_ = lr.predict(X_test)
print('真实数字是:\n',y_test[:50].values)
print('逻辑斯蒂回归算法预测是:\n',y_[:50])
真实数字是:
[2 8 1 8 0 1 3 8 1 0 8 1 5 7 3 8 7 6 9 7 2 5 8 4 1 6 4 2 4 9 4 1 1 2 7 7 3
6 1 3 5 0 9 5 2 9 1 5 9 4]
逻辑斯蒂回归算法预测是:
[2 8 1 8 0 1 3 8 1 0 8 1 5 7 3 3 9 6 9 7 5 5 8 4 1 6 4 2 4 9 0 1 1 2 7 4 3
6 1 8 5 0 9 5 2 9 1 2 9 4]
Wall time: 3.62 s
准确率计算
(y_test == y_).mean()
0.851
lr.score(X_test,y_test)
0.851
逻辑斯蒂回归分类问题,变成概率问题
lr.predict(X_test)[:5]
array([2, 8, 1, 8, 0], dtype=int64)
lr.predict_proba(X_test)[:5]
array([[3.99944549e-56, 3.13802567e-40, 9.99999991e-01, 8.11029507e-33,
9.55758950e-31, 1.33407904e-36, 1.63559956e-35, 1.36167810e-34,
3.94873126e-13, 9.17542857e-09],
[3.47691937e-32, 9.45659216e-12, 2.15691146e-15, 7.76272853e-17,
4.60336112e-47, 1.95280205e-14, 2.86509150e-14, 3.42799723e-47,
1.00000000e+00, 8.09887934e-30],
[2.95355219e-48, 1.00000000e+00, 3.78128277e-14, 9.53249317e-37,
3.37867878e-48, 1.33987574e-35, 1.18501330e-30, 9.42662502e-55,
7.71758466e-16, 7.02293528e-47],
[1.04645342e-27, 2.44542774e-06, 5.74072289e-19, 6.70739001e-29,
7.67253219e-51, 4.90497849e-13, 8.16572961e-27, 6.46372239e-62,
9.99997555e-01, 4.55192317e-40],
[1.00000000e+00, 1.47237626e-77, 1.61003550e-34, 9.10565041e-46,
8.07064824e-64, 1.78965234e-36, 1.98884531e-31, 6.01113728e-52,
3.11702498e-25, 3.17750910e-48]])
(四、逻辑斯蒂二分类问题)
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
d:\python3.7.4\lib\importlib\_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
return f(*args, **kwds)
获取数据,类别筛选,保留2类,二分类问题
X,y = datasets.load_iris(True)
# 150代表 150个样本,4,代表4个属性、特征
# 数据挖掘、机器学习、人工智能------->将实际问题,数学化(方程,求解方程)
print(X.shape,y.shape)
# 类别分三类
# 逻辑斯蒂回归,进行原理推导的时候,二分类:0,1
# 将类别,删去类别2,此时只剩下0,1
# 将类别,删去类别0,此时只剩下1,2
cond = y!=1
X = X[cond]
y = y[cond]
print(X.shape)
print(y)
(150, 4) (150,)
(100, 4)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
算法训练、预测
算法找到规律后,使用规律进行预测
lr = LogisticRegression()
lr.fit(X,y)#算法,训练数据,找X和y之间的规律,方程
y_ = lr.predict(X)# 规律找到之后,使用规律,进行计算
# 计算出来的y_和y(真实值),是完全一样???
# 给出的结果完全一样,数据简单(去除了类别2,保留了类别0,1,准确率100%,说明类别0,1容易区分)
# 相同的代码,执行了一次,数据复杂一点(去除了类别0,保留了类别1,2,准确96%,4个预测错了,
# 反推,有一些个例,不太容易分类)
(y_ == y).mean()
1.0
算法概率计算
# 计算的概率
proba_ = lr.predict_proba(X) #probability 概率
# 将概率转化成类别
proba_[:10]
array([[0.99153216, 0.00846784],
[0.9908928 , 0.0091072 ],
[0.99355185, 0.00644815],
[0.99086387, 0.00913613],
[0.99219814, 0.00780186],
[0.98268555, 0.01731445],
[0.99252002, 0.00747998],
[0.9899949 , 0.0100051 ],
[0.99259338, 0.00740662],
[0.99028393, 0.00971607]])
手动计算概率,代码实现
'''calculate the probability
of each class assuming it to be positive using the logistic function.
and normalize these values across all the classes.'''
w_[0].dot(X.T)
array([ 4.00608449, 4.07951914, 3.73156718, 4.08271965, 3.92349967,
4.73031193, 3.88104351, 4.17445465, 3.871113 , 4.14484998,
4.26618964, 4.26023998, 3.94765549, 3.15681607, 3.80957767,
4.33981114, 3.99918266, 4.08944965, 4.82652859, 4.16997299,
4.73401326, 4.28742447, 2.99837642, 4.87269957, 4.80858694,
4.49358227, 4.52396728, 4.2373653 , 4.08866931, 4.27991414,
4.36249896, 4.53517893, 3.94948219, 3.96147417, 4.22821513,
3.69428034, 4.01729615, 3.79163602, 3.65424436, 4.22295314,
3.85816884, 4.0247123 , 3.5860717 , 4.65661126, 4.98446742,
4.1143858 , 4.26939015, 3.86585101, 4.21769114, 4.02575865,
14.98162437, 12.79841805, 14.95562835, 13.8032843 , 14.56522023,
16.47759705, 11.16668003, 15.56774547, 14.49918822, 15.49863414,
13.05084102, 13.45497365, 14.07900359, 12.71867505, 13.1811575 ,
13.61800264, 13.68341264, 16.5195524 , 17.37751813, 12.54960373,
14.59162438, 12.38513525, 16.69368536, 12.59198072, 14.29381076,
14.86864104, 12.32661358, 12.39272475, 14.13596459, 14.40451874,
15.36813081, 15.90147212, 14.21932975, 12.67336356, 13.47508567,
15.77891426, 14.1330436 , 13.60082782, 12.16144394, 13.91063344,
14.42929656, 13.52901679, 12.79841805, 14.90869053, 14.62727138,
13.64888845, 12.92630085, 13.301796 , 13.63561532, 12.6612924 ])
X.dot(w_[0].T)
array([ 4.00608449, 4.07951914, 3.73156718, 4.08271965, 3.92349967,
4.73031193, 3.88104351, 4.17445465, 3.871113 , 4.14484998,
4.26618964, 4.26023998, 3.94765549, 3.15681607, 3.80957767,
4.33981114, 3.99918266, 4.08944965, 4.82652859, 4.16997299,
4.73401326, 4.28742447, 2.99837642, 4.87269957, 4.80858694,
4.49358227, 4.52396728, 4.2373653 , 4.08866931, 4.27991414,
4.36249896, 4.53517893, 3.94948219, 3.96147417, 4.22821513,
3.69428034, 4.01729615, 3.79163602, 3.65424436, 4.22295314,
3.85816884, 4.0247123 , 3.5860717 , 4.65661126, 4.98446742,
4.1143858 , 4.26939015, 3.86585101, 4.21769114, 4.02575865,
14.98162437, 12.79841805, 14.95562835, 13.8032843 , 14.56522023,
16.47759705, 11.16668003, 15.56774547, 14.49918822, 15.49863414,
13.05084102, 13.45497365, 14.07900359, 12.71867505, 13.1811575 ,
13.61800264, 13.68341264, 16.5195524 , 17.37751813, 12.54960373,
14.59162438, 12.38513525, 16.69368536, 12.59198072, 14.29381076,
14.86864104, 12.32661358, 12.39272475, 14.13596459, 14.40451874,
15.36813081, 15.90147212, 14.21932975, 12.67336356, 13.47508567,
15.77891426, 14.1330436 , 13.60082782, 12.16144394, 13.91063344,
14.42929656, 13.52901679, 12.79841805, 14.90869053, 14.62727138,
13.64888845, 12.92630085, 13.301796 , 13.63561532, 12.6612924 ])
X.dot(w_[0])
array([ 4.00608449, 4.07951914, 3.73156718, 4.08271965, 3.92349967,
4.73031193, 3.88104351, 4.17445465, 3.871113 , 4.14484998,
4.26618964, 4.26023998, 3.94765549, 3.15681607, 3.80957767,
4.33981114, 3.99918266, 4.08944965, 4.82652859, 4.16997299,
4.73401326, 4.28742447, 2.99837642, 4.87269957, 4.80858694,
4.49358227, 4.52396728, 4.2373653 , 4.08866931, 4.27991414,
4.36249896, 4.53517893, 3.94948219, 3.96147417, 4.22821513,
3.69428034, 4.01729615, 3.79163602, 3.65424436, 4.22295314,
3.85816884, 4.0247123 , 3.5860717 , 4.65661126, 4.98446742,
4.1143858 , 4.26939015, 3.86585101, 4.21769114, 4.02575865,
14.98162437, 12.79841805, 14.95562835, 13.8032843 , 14.56522023,
16.47759705, 11.16668003, 15.56774547, 14.49918822, 15.49863414,
13.05084102, 13.45497365, 14.07900359, 12.71867505, 13.1811575 ,
13.61800264, 13.68341264, 16.5195524 , 17.37751813, 12.54960373,
14.59162438, 12.38513525, 16.69368536, 12.59198072, 14.29381076,
14.86864104, 12.32661358, 12.39272475, 14.13596459, 14.40451874,
15.36813081, 15.90147212, 14.21932975, 12.67336356, 13.47508567,
15.77891426, 14.1330436 , 13.60082782, 12.16144394, 13.91063344,
14.42929656, 13.52901679, 12.79841805, 14.90869053, 14.62727138,
13.64888845, 12.92630085, 13.301796 , 13.63561532, 12.6612924 ])
### 查看方程
w_ = lr.coef_
b_ = lr.intercept_
print('方程系数',lr.coef_)
print('方程截距',lr.intercept_)
def fun(X):#线性方程,矩阵,批量计算
return X.dot(w_[0]) + b_[0]
def sigmoid(x):#fun就是线性方程的返回值
return 1/(1+np.e**-x)
方程系数 [[ 0.48498493 -0.34086327 1.8278232 0.83365156]]
方程截距 [-8.76905997]
proba_[:5]
array([[0.99153216, 0.00846784],
[0.9908928 , 0.0091072 ],
[0.99355185, 0.00644815],
[0.99086387, 0.00913613],
[0.99219814, 0.00780186]])
f = fun(X)
p_1 = sigmoid(f)
p_0 = 1 - p_1
p_ = np.c_[p_0,p_1]
p_[:10]
array([[0.99153216, 0.00846784],
[0.9908928 , 0.0091072 ],
[0.99355185, 0.00644815],
[0.99086387, 0.00913613],
[0.99219814, 0.00780186],
[0.98268555, 0.01731445],
[0.99252002, 0.00747998],
[0.9899949 , 0.0100051 ],
[0.99259338, 0.00740662],
[0.99028393, 0.00971607]])
(五、逻辑斯蒂实现多分类的问题)
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
加载数据,三分类问题,打乱顺序,shuffle,每个人都会不同
# y三分类问题
X,y = datasets.load_iris(True)
index = np.arange(150)#0,1,2,……149
np.random.shuffle(index)
X = X[index]
y = y[index]
print(y)
[2 1 2 1 2 0 2 1 0 2 0 0 0 2 0 0 1 0 1 1 2 0 1 0 1 1 0 0 2 0 2 0 0 1 0 0 2
2 1 2 0 0 1 1 1 1 0 0 1 0 1 2 0 2 2 2 1 0 0 2 2 1 0 0 0 2 1 0 1 1 2 2 0 0
2 2 0 2 2 1 0 2 1 2 0 0 0 2 0 2 0 0 1 1 2 0 1 2 2 0 2 2 1 2 2 2 1 1 1 1 1
0 2 1 2 2 1 2 1 0 0 2 0 1 1 2 0 1 1 1 2 1 1 2 0 1 1 1 2 1 2 2 2 0 1 0 1 1
2 0]
使用算法,训练数据,构建模型,方程系数出来
lr = LogisticRegression(max_iter = 200)
lr.fit(X,y)
print('方程斜率\n',lr.coef_)
print('方程截距\n',lr.intercept_)
w_ = lr.coef_
b_ = lr.intercept_
方程斜率
[[-0.42423735 0.9676256 -2.51686784 -1.07948854]
[ 0.5345059 -0.32152121 -0.2063666 -0.94389713]
[-0.11026856 -0.64610438 2.72323444 2.02338567]]
方程截距
[ 9.85170372 2.23620276 -12.08790648]
使用模型,预测类别与预测概率
y_ = lr.predict(X) #预测类别
proba_ = lr.predict_proba(X)# 预测概率
print(y_[:10])
print(proba_[:10])
print(proba_[:10].argmax(axis = 1)) #概率转化为类别
[2 1 2 1 2 0 2 1 0 2]
[[6.26707685e-05 1.88637260e-01 8.11300070e-01]
[1.47766771e-01 8.49178375e-01 3.05485403e-03]
[1.62200903e-03 4.40387095e-01 5.57990896e-01]
[3.71141547e-02 9.55351158e-01 7.53468681e-03]
[2.48080106e-06 2.55761731e-02 9.74421346e-01]
[9.68771637e-01 3.12283317e-02 3.17614803e-08]
[9.96351290e-05 1.20620400e-01 8.79279965e-01]
[9.07597639e-03 9.76586059e-01 1.43379645e-02]
[9.86783707e-01 1.32162727e-02 1.99839847e-08]
[3.73964002e-06 1.74498680e-02 9.82546392e-01]]
[2 1 2 1 2 0 2 1 0 2]
手动进行概率的计算
''' softmax function is used to find the predicted probability of
each class.'''
# softmax 将数据变成概率问题(所有的概率和是1)
' softmax function is used to find the predicted probability of\neach class.'
a = np.array([-3,1,3])
# softmax将数值转换成概率,大的值,变的更大,小的值,变得更小
np.e**a/((np.e**a).sum())
array([0.00217852, 0.11894324, 0.87887824])
def softmax(x):
return np.e**x/((np.e**x).sum(axis = 1).reshape(-1,1))
c = np.random.randint(1,10,size = (3,4))
c
array([[3, 7, 6, 1],
[7, 6, 8, 5],
[8, 8, 5, 7]])
c_s = c.sum(axis = 1)
# c_s.reshape(-1,1)
# 数据 c每个数据除以每一行的平均值
c/c_s
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-10-efea26b8ac04> in <module>
1 # 数据 c每个数据除以每一行的平均值
----> 2 c/c_s
ValueError: operands could not be broadcast together with shapes (3,4) (3,)
# w_ 和 b_是方程的斜率和截距
def linear(x):
y = x.dot(w_.T) + b_ #矩阵运算,对齐!!!
return y
# y_pred这个是线性函数预测的线性值
# 线性值,转化成概率
y_pred = linear(X)
y_proba = softmax(y_pred) # softmax可以转化成概率
y_proba[:10]
网友评论