第2章：SVM（支持向量机） - 编码

作者: iOSDevLog | 来源:发表于2019-04-16 17:32 被阅读0次

【机器学习实战】第6章支持向量机（Support Vector
「数据分类」15支持向量机(SVM)及混淆矩阵
算法岗面试——机器学习总结
支持向量机&&聚类
机器学习——libSVM（一）
SVR（Support Vactor Regerssion）支持
18、SVM（支持向量机）
《机器学习实战》读书笔记6
SVM(支持向量机)的原理
支持向量机-QA

作者：Savan Patel
时间：2017年5月5日
原文：https://medium.com/machine-learning-101/chapter-2-svm-support-vector-machine-coding-edd8f1cf8f2d

这不是一个错误 - 它只是一个没有说明文档的新功能

支持向量机和朴素贝叶斯比？训练速度慢吗？让我们在这个编码练习中探索所有这些问题。这是第2章：支持向量机或支持向量分类器的第二部分。如果你还没有阅读理论（第一部分），我建议你在此处阅读。强烈建议你了解 SVM 分类器背后的基础知识。

虽然通过阅读你会对实现有足够的了解，但我强烈建议你打开编辑器和代码以及教程。我会给你更好的洞察力和持久的学习。

我们该怎么办？

别忘了 ❤。:)

编码练习是以前的 Naive Bayes 分类器程序的扩展，它将电子邮件分类为垃圾邮件和非垃圾邮件。不用担心，如果你还没有通过朴素贝叶斯（第1章）（虽然我建议你先完成它）。这里也应以抽象的方式讨论相同的代码片段。

我们将通过将训练数据集大小减少 10％来减少训练时间。然后，我们改变调整参数以提高准确率。我们将看到变化的内核，C 和 gamma 如何改变准确率和时序。

副作用！

1.下载

我已经为数据集和示例代码创建了一个git存储库。你可以从此处下载（使用第2章文件夹）。如果失败，你可以使用 / 引用我的版本（第2章文件夹中的 classifier.py ）来理解工作。忽略 plot.py 文件。

2.关于清理的一点点

如果你已经编写了朴素贝叶斯的一部分，你可以跳过这部分。（这是直接跳到这里的读者）。

在我们应用sklearn分类器之前，我们必须清理数据。清理涉及删除停用词，从文本中提取最常见的单词等。在相关的代码示例中，我们执行以下步骤：

要详细了解，再一次请参考编码部分第一章在这里。

从训练集中的电子邮件文档构建单词词典。
考虑最常见的 3000 字。
对于训练集中的每个文档，为字典和相应标签中的这些单词创建频率矩阵。[垃圾邮件文件名以前缀“ spmsg ” 开头。

The code snippet below does this:
def make_Dictionary(root_dir):
   all_words = []
   emails = [os.path.join(root_dir,f) for f in os.listdir(root_dir)]
   for mail in emails:
        with open(mail) as m:
            for line in m:
                words = line.split()
                all_words += words
   dictionary = Counter(all_words)
   # if you have python version 3.x use commented version.
   # list_to_remove = list(dictionary)
   list_to_remove = dictionary.keys()
for item in list_to_remove:
       # remove if numerical. 
       if item.isalpha() == False:
            del dictionary[item]
        elif len(item) == 1:
            del dictionary[item]
    # consider only most 3000 common words in dictionary.
dictionary = dictionary.most_common(3000)
return dictionary
def extract_features(mail_dir):
  files = [os.path.join(mail_dir,fi) for fi in os.listdir(mail_dir)]
  features_matrix = np.zeros((len(files),3000))
  train_labels = np.zeros(len(files))
  count = 0;
  docID = 0;
  for fil in files:
    with open(fil) as fi:
      for i,line in enumerate(fi):
        if i == 2:
          words = line.split()
          for word in words:
            wordID = 0
            for i,d in enumerate(dictionary):
              if d[0] == word:
                wordID = i
                features_matrix[docID,wordID] = words.count(word)
      train_labels[docID] = 0;
      filepathTokens = fil.split('/')
      lastToken = filepathTokens[len(filepathTokens) - 1]
      if lastToken.startswith("spmsg"):
          train_labels[docID] = 1;
          count = count + 1
      docID = docID + 1
  return features_matrix, train_labels

3.进入 SVC 世界

使用 svc 的代码类似于朴素贝叶斯的代码。我们首先从库中导入 svc。接下来，我们提取训练功能和标签。最后，我们要求模型预测测试集的标签。基本代码块代码如下所示：

from sklearn import svm
from sklearn.metrics import accuracy_score
TRAIN_DIR = "../train-mails"
TEST_DIR = "../test-mails"
dictionary = make_Dictionary(TRAIN_DIR)
print "reading and processing emails from file."
features_matrix, labels = extract_features(TRAIN_DIR)
test_feature_matrix, test_labels = extract_features(TEST_DIR)
model = svm.SVC()
print "Training model."
#train model
model.fit(features_matrix, labels)
predicted_labels = model.predict(test_feature_matrix)
print "FINISHED classifying. accuracy score : "
print accuracy_score(test_labels, predicted_labels)

综合起来：

import os
import numpy as np
from collections import Counter
from sklearn import svm
from sklearn.metrics import accuracy_score
def make_Dictionary(root_dir):
    all_words = []
    emails = [os.path.join(root_dir,f) for f in os.listdir(root_dir)]
    for mail in emails:
        with open(mail) as m:
            for line in m:
                words = line.split()
                all_words += words
    dictionary = Counter(all_words)
    list_to_remove = dictionary.keys()
for item in list_to_remove:
        if item.isalpha() == False:
            del dictionary[item]
        elif len(item) == 1:
            del dictionary[item]
    dictionary = dictionary.most_common(3000)
return dictionary
def extract_features(mail_dir):
    files = [os.path.join(mail_dir,fi) for fi in os.listdir(mail_dir)]
    features_matrix = np.zeros((len(files),3000))
    train_labels = np.zeros(len(files))
    count = 0;
    docID = 0;
    for fil in files:
      with open(fil) as fi:
        for i,line in enumerate(fi):
          if i == 2:
            words = line.split()
            for word in words:
              wordID = 0
              for i,d in enumerate(dictionary):
                if d[0] == word:
                  wordID = i
                  features_matrix[docID,wordID] = words.count(word)
        train_labels[docID] = 0;
        filepathTokens = fil.split('/')
        lastToken = filepathTokens[len(filepathTokens) - 1]
        if lastToken.startswith("spmsg"):
            train_labels[docID] = 1;
            count = count + 1
        docID = docID + 1
    return features_matrix, train_labels
TRAIN_DIR = "../train-mails"
TEST_DIR = "../test-mails"
dictionary = make_Dictionary(TRAIN_DIR)
print "reading and processing emails from file."
features_matrix, labels = extract_features(TRAIN_DIR)
test_feature_matrix, test_labels = extract_features(TEST_DIR)
model = svm.SVC()
print "Training model."
#train model
model.fit(features_matrix, labels)
predicted_labels = model.predict(test_feature_matrix)
print "FINISHED classifying. accuracy score : "
print accuracy_score(test_labels, predicted_labels)

这是非常基本的实现。它假定调整参数的默认值 (kernel = linear, C = 1 and gamma = 1)

看看你在这种情况下获得的准确率是多少？
训练时间是多少？它是否比 Naive Bayes 更快/更慢？
与 Naive Bayes 相比准确率如何？

嗯......我们如何减少训练时间？

一种方法是减少训练集的大小。我们将其减小到原始尺寸的 1/10，然后检查准确率。当然会减少。

这里有 702 封电子邮件用于训练。1/10 将意味着 70 个用于训练的电子邮件非常少。（尽管结账奇迹我们可以实现）。

在训练模型之前添加以下行。（它将feature_matrix和标签减少了1/10）。

features_matrix = features_matrix[:len(features_matrix)/10]
labels = labels[:len(labels)/10]

现在训练时间和准确率是多少？

参数调整

image.png

我猜你会得到56％左右的准确率。那太低了。

现在将训练集保持为1/10，让我们尝试调整三个参数：kernel，C和gamma。

1.内核

将内核更改为rbf。即在model = SVC（）中添加内核参数

model = svm.SVC(kernel="rbf", C = 1)

2. C.

接下来将C（正则化参数）变为10,100,1000,10000。确定准确率是增加还是减少？

你会注意到，在C = 100时，准确率增加到85.38％，并且几乎保持不变。

Gamma

最后，让我们玩伽马。再添加一个参数gamma = 1.0

model = svm.SVC(kernel="rbf", C=100, gamma=1)

哎呀！准确率下降。对？尝试更高的 gamma = 10 值。它进一步向右下降。尝试减少。使用值0.1,0.01,0.001。现在的准确率是多少？它在增加吗？

你会注意到，在这种运动情况下，低伽马值使我们具有很高的准确率。（直觉：这意味着数据点很稀疏，远远超过图表中的决策边界）。

在这种情况下，我们注意到通过减少训练集大小，我们可以达到85.4。（PS：朴素贝叶斯的准确率得分是多少？）

快速运行脚本[可选]

你可能已经注意到，每次脚本都需要花费大量时间来清理和读取电子邮件中的数据（功能和标签）。你可以通过保存从首次运行中提取的数据来加快该过程。

这将为你节省更多时间，专注于学习调整参数。

将以下代码段用于代码以进行保存和加载。

import cPickle
import gzip
def load(file_name):
    # load the model
    stream = gzip.open(file_name, "rb")
    model = cPickle.load(stream)
    stream.close()
    return model
def save(file_name, model):
    # save the model
    stream = gzip.open(file_name, "wb")
    cPickle.dump(model, stream)
    stream.close()
#To save
save("/tmp/features_matrix", features_matrix)
save("/tmp/labels", labels)
save("/tmp/test_feature_matrix", test_feature_matrix)
save("/tmp/test_labels", test_labels)
#To load
features_matrix = load("/tmp/features_matrix")
labels = load("/tmp/labels")
test_feature_matrix = load("/tmp/test_feature_matrix")
test_labels = load("/tmp/test_labels")

注意：请查看 classifier.py 和 classifier-fast.py 以供参考。

最后的想法

一般来说，SVC 比 Naive Bayes 需要更多的训练时间，但预测速度更快。在编码练习中，朴素贝叶斯优于 SVC。但是，它完全取决于哪一个表现最佳的场景和数据集。

即使将训练数据减少到 1/10，也可以获得更高的准确率。

但是，为什么我们需要减少训练集？

与朴素贝叶斯相比，SVC 的训练时间更长，一般为 3 倍。与精度相比，我们需要更快地进行预测的应用。

想想信用卡交易。对于交易的欺诈标志，快速响应比 99％的准确率要重要得多。这里可以容忍 90％的准确率。
另一方面，仅将电子邮件标记为垃圾邮件或非垃圾邮件可能会容忍延迟，我们可以努力提高准确率。

我们是否需要始终调整参数？

并不是的。sklearn 工具包中有内置功能，可以帮助我们。我们将在以后的文章中探讨它们。

希望本教程为你提供有关 SVC 编码的基本概念。即使对于小数据集大小，我们如何调整参数并实现公平的准确率。（我们在训练集中只收到了 70 封电子邮件，在 350 封电子邮件的测试中达到了 85％的准确率）😊。

接下来是什么？

在下一章中，我们将了解决策树。

【机器学习实战】第6章支持向量机（Support Vector
第6章支持向量机支持向量机概述支持向量机(Support Vector Machines, SVM)：是一...
「数据分类」15支持向量机(SVM)及混淆矩阵
1.支持向量机（SVM）概述（1）支持向量机(Support Vector Machines，SVM)是一种二元...
算法岗面试——机器学习总结
SVM！参考资料支持向量机通俗导论（理解SVM的三层境界）参考资料支持向量机（SVM）从入门到放弃再到掌握支持...
支持向量机&&聚类
支持向量机SVM（Support Vector Machine）一、支持向量机的原理 Support Vecto...
机器学习——libSVM（一）
一、什么是支持向量机支持向量机（Support Vector Machine,SVM）也称为支持向量网络。是一类...
SVR（Support Vactor Regerssion）支持
支持向量机(SVM)本身是针对二分类问题提出的，而SVR（支持向量回归）是SVM（支持向量机）中的一个重要的应用分...
18、SVM（支持向量机）
一、支持向量机SVM（support vector machine） SVC分类，SVR回归--统称SVM 支持向...
《机器学习实战》读书笔记6
支持向量机算法概述支持向量机（Support Vector Machines，SVM）这个算法的名字很抽象，简单...
SVM(支持向量机)的原理
原博文：支持向量机（SVM）入门理解与推导一、简介支持向量机（support vector machines）...
支持向量机-QA
Q1：SVM的类型有哪些？三类：线性可分支持向量机、线性支持向量机、非线性支持向量机线性可分支持向量机：当训练数...