分类

作者: Recalcitrant | 来源:发表于2019-09-27 22:16 被阅读0次

一次性产品
一明音频6按次序表答
客户分类
分类页面通用变量
Excel学习笔记5
生活垃圾分类
机器学习之分类器
商品管理案例——案例准备
分类(Category)与类拓展(Extension)
Emlog531获取指定分类的子分类信息

分类

以手写数字图像识别为例

一、数据准备

1.下载MNIST数据集

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')

2.获取数据和标签

X, y = mnist['data'], mnist['target']

3.画出数字

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt

some_digit = X[0]
some_digit_image = some_digit.reshape(28, 28)
plt.imshow(some_digit_image, cmap=matplotlib.cm.binary)
plt.axis('off')
plt.show()

4.划分训练集和测试集的数据及标签

X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

5.洗牌

import numpy as np
shuffle_index = np.random.permutation(60000)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]

二、训练模型

以5-非5二分类为例

1.重新生成标签

为了满足二分类的需求，重新生成以True和False代表5和非5的标签

y_train_5 = (y_train == '5')
y_test_5 = (y_test == '5')

2.训练模型

（1）随机梯度下降分类模型

from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)

（2）决策树分类模型

from sklearn.tree import DecisionTreeClassifier
tree_clf = DecisionTreeClassifier()
tree_clf.fit(X_train, y_train_5)

三、模型测评

1.交叉验证

使用交叉验证测试决策树模型性能

from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone
skfolds = StratifiedKFold(n_splits=3, random_state=42)

for train_index, test_index in skfolds.split(X_train, y_train_5):
    clone_clf = clone(tree_clf)
    # 找出分层随机选取的数据
    X_train_folds = X_train[train_index]
    y_train_folds = y_train_5[train_index]
    X_test_folds = X_train[test_index]
    y_test_folds = y_train_5[test_index]
    
    clone_clf.fit(X_train_folds, y_train_folds)
    y_pred = clone_clf.predict(X_test_folds)
    
    # 计算预测正确的数量
    n_correct = sum(y_pred == y_test_folds)
    
    # 打印正确率
    print(n_correct / len(y_pred))

注意：正确率不能作为模型好坏的衡量标准。

2.混淆矩阵

（1）一级指标

混淆矩阵

（2）二级指标

真阳性率 TPR = TP / (TP + FN)（召回率）

真阴性率 TNR = TN / (TN + FP)

假阳性率 FPR = FP / (FP + TN)

假阴性率 FNR = FN / (FN + TP)

精度与召回率

准确率（Accuracy） = TP + TN / (TP + TN + FP + FN)

精度（Precision） = TP / (TP + FP)

召回率（Recall） = TP / (TP + FN)

二级指标	作用
准确率（A）	反映分类器对整个样本的判定能力，能将正的判定为正，负的判定为负的能力
精度（P）	真正类占所有被判定为正类的比例（正类的判断正确率），衡量查准率
召回率（R）	真正类被判定为真正类的比例（真正类的鉴别率），衡量查全率

（3）三级指标

F1分数

F1分数（精度召回率权重一致时）

F1分数（β=召回率权重比精度权重）

F1分数（F1-Score） = 2 / (1 / 精度 + 1 / 召回率) = 2 * (精度 * 召回率) / (精度 + 召回率)

# 混淆矩阵
from sklearn.metrics import confusion_matrix
confusion_matrix(标签向量, 预测向量)

# 精度、召回率、F1分数
from sklearn.metrics import precision_score, recall_score, f1_score
precision_score(标签向量, 预测向量)
recall_score(标签向量, 预测向量)
f1_score(标签向量, 预测向量)

3.精度召回率折衷

（1）决策分数

获取决策分数

y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3, method='decision_function')

（2）精度召回率曲线（Precision Recall Curves,PRC）

precision_recall_curve()函数会返回精度、召回率、阈值

from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

作图

%matplotlib inline
import matplotlib.pyplot as plt

def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], 'g--', label='Precision')
    plt.plot(thresholds, recalls[:-1], 'b--', label='Recall')
    plt.xlabel('Threshold')
    plt.ylim([0, 1])
    plt.legend(loc='upper left')
    
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)

查看精度和召回率的最佳值

def plot_precision_recall(precisions, recalls):
    plt.plot(recalls[:-1], precisions[:-1], 'b-')
    # plt.plot([0, 1], [0, 1], 'k--')
    plt.xlabel("recall")
    plt.ylabel("precision")
    plt.show()
    
plot_precision_recall(precisions, recalls)

（3）受试者操作特性曲线（Receiver Operating Characteristic,ROC）

ROC曲线：受试者工作特征曲线（接受者操作特性曲线）

纵轴：TPR=正类判定正确概率（真阳性率） = TP / (TP + FN)
横轴：FPR=负类判定错误概率（假阳性率） = FP / (FP + TN)

绘制ROC曲线

from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.axis([0, 1, 0, 1])
    plt.xlabel("FPR")
    plt.ylabel("TPR")
    plt.show()
    
plot_roc_curve(fpr, tpr)

虚线代表随机分类器的ROC曲线

四、多类别分类器

两种思路：

1.一对多（OvA）（亦一对剩余，OvR）：以数字识别为例，训练10个分类器（0-非0,1-非1,2-非2，......），每个分类器只能识别一种数字。识别时，输入一张图片，运行10个分类器，哪个分类器分数最高即分为哪类。

2.一对一（OvO）：以数字识别为例，训练多个分类器（为每一对数字训练一个分类器：0-1，0-2，0-3，......，9-7,9-8）。识别时，输入一张图片，运行45个分类器，哪个分类器获胜最多即分为哪类。

OvO优点：每个分类器不需要对全部数据进行训练，只需对用到的两个类别的数据进行运算。

如果需要强制Scikit-Learn使用一对一或一对多策略，可以使用OneVsOneClassifier或OneVsRestClassifier类。

只需创建一个实例，然后传给分类器的构造参数即可。

from sklearn.multiclass import OneVsOneClassifier
ovo_clf = OneVsOneClassifier(SGDClassifier(random_state=42))
ovo_clf.fit(X_train, y_train)

网友评论

本文标题：分类

本文链接：https://www.haomeiwen.com/subject/lqinsctx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

分类

分类