第1章：监督学习和朴素贝叶斯分类 - 第2部分（编码）

作者: iOSDevLog | 来源:发表于2019-04-15 15:03 被阅读0次

朴素贝叶斯分类算法
【机器学习实战】第4章朴素贝叶斯（NaiveBayes）
机器学习基本知识
朴素贝叶斯法
机器学习实战（笔记）：第 4 章基于概率论的分类方法：朴素贝叶
第1章：监督学习和朴素贝叶斯分类 - 第2部分（编码）
朴素贝叶斯
朴素贝叶斯法(NaiveBayes)
常用机器学习算法入门（小白适用）
【机器学习实战（四）】sklearn包中朴素贝叶斯库的使用

作者：Savan Patel
时间：2017 年 5 月 3 日
原文：https://medium.com/machine-learning-101/chapter-1-supervised-learning-and-naive-bayes-classification-part-2-coding-5966f25f1475

计算机是否能够思考的问题并不比潜艇是否可以游泳的问题更有意思

计算机是否能够思考的问题并不比潜艇是否可以游泳的问题更有意思。

注意：如果你没有经历过朴素贝叶斯理论的第一部分，我建议你仔细阅读。（4 分钟阅读）这里。

在这一部分，我们将探索 sklearn 库。python 中的 sklearn 提供了像 Naive Bayes 这样流行的机器学习算法。有了这个，可以让您免于手动编写自己的朴素贝叶斯实现代码。

教人用程序，沮丧一整天。教人写程序，受挫一辈子。

编码练习

在本练习中，我们将使用标记为垃圾邮件或非垃圾邮件的一组电子邮件来训练模型。有 702 封电子邮件，分为垃圾邮件和非垃圾邮件类别。接下来，我们将在 260 封电子邮件中测试该模型。我们将要求模型预测此电子邮件的类别，并将准确性与我们已知的正确分类进行比较。

这是文本数据挖掘的经典示例

条件

本 教程假设 编写练习是在基于 Debian 的 Linux 上完成的。安装说明可能与您使用的操作系统不同，但 python 代码保持不变。

安装 Python。
安装 pip。
安装 sklearn：pip install scikit-learn
安装 numpy：pip install numpy
安装 SciPy：pip install scipy

0. 下载

我已经为数据集和示例代码创建了一个 git 存储库。您可以从此处下载（使用第 1 章文件夹）。它与本章讨论的数据集相同。您可以使用 / 引用我的版本来了解工作情况。

1. 清理和准备数据

我们有两个文件夹 测试邮件 和 训练邮件。我们将使用训练邮件来训练模型。示例电子邮件数据集如下所示：

Subject: re : 2 . 882 s - > np np
> deat : sun , 15 dec 91 2 : 25 : 2 est > : michael < mmorse @ vm1 . yorku . ca > > subject : re : 2 . 864 query > > wlodek zadrozny ask " anything interest " > construction " s > np np " . . . second , > much relate : consider construction form > discuss list late reduplication ? > logical sense " john mcnamara name " tautologous thus , > level , indistinguishable " , , here ? " . ' john mcnamara name ' tautologous support those logic-base semantics irrelevant natural language . sense tautologous ? supplies value attribute follow attribute value . fact value name-attribute relevant entity ' chaim shmendrik ' , ' john mcnamara name ' false . tautology , . ( reduplication , either . )

第一行是主题，内容从第三行开始。

如果您导航到任何列车邮件或测试邮件，您将看到两种模式的文件名

number-numbermsg[number].txt : example 3-1msg1.txt (this are non spam emails)

或者

spmsg[Number].txt : example spmsga162.txt (these files are of spam emails).

文本数据挖掘任务的第一步是清理和准备模型的数据。在 清理中 我们从文本中删除不需要的单词，表达式和符号。

考虑以下文字：

“Hi, this is Alice. Hope you are doing well and enjoying your vacation.”

在这里，像 is, this, are 等等的词并没有真正有助于分析。这样的词也被称为 停用词。因此，在本练习中，我们只考虑来自电子邮件的最常见的 3000 字的词典。以下是代码段。

在清理了我们需要的每个电子邮件文档之后，我们应该是单词频率的一些矩阵表示。

例如，如果文档包含文本： “Hi, this is Alice. Happy Birthday Alice”

word      :   Hi this is Alice Happy Birthday
frequency :   1   1    1  2      1      1

我们需要为每个文件都这样做。下面的 extract_features（第 2 节）函数执行此操作，然后删除每个文档的不太常用的单词。

def make_Dictionary(root_dir):
   all_words = []
   emails = [os.path.join(root_dir,f) for f in os.listdir(root_dir)]
   for mail in emails:
        with open(mail) as m:
            for line in m:
                words = line.split()
                all_words += words
   dictionary = Counter(all_words)
   # if you have python version 3.x use commented version.
   # list_to_remove = list(dictionary)
   list_to_remove = dictionary.keys()
   for item in list_to_remove:
       # remove if numerical. 
       if item.isalpha() == False:
            del dictionary[item]
        elif len(item) == 1:
            del dictionary[item]
    # consider only most 3000 common words in dictionary.
   dictionary = dictionary.most_common(3000)
   return dictionary

make_Dictionary 从文件夹中读取电子邮件文件，为所有单词构造字典。接下来，我们删除长度为 1 且不纯粹按字母顺序排列的单词。

最后我们只提取了 3000 个最常用的单词。

2. 提取特征和相应的标签矩阵。

接下来，在字典的帮助下，我们生成标签和字频率矩阵

word      :   Hi this is Alice Happy Birthday
frequency :   1   1    1  2      1      1
word      :   Hi this is Alice Happy Birthday
frequency :   1   1    1  2      1      1

def extract_features(mail_dir):
  files = [os.path.join(mail_dir,fi) for fi in os.listdir(mail_dir)]
  features_matrix = np.zeros((len(files),3000))
  train_labels = np.zeros(len(files))
  count = 0;
  docID = 0;
  for fil in files:
    with open(fil) as fi:
      for i,line in enumerate(fi):
        if i == 2:
          words = line.split()
          for word in words:
            wordID = 0
            for i,d in enumerate(dictionary):
              if d[0] == word:
                wordID = i
                features_matrix[docID,wordID] = words.count(word)
      train_labels[docID] = 0;
      filepathTokens = fil.split('/')
      lastToken = filepathTokens[len(filepathTokens) - 1]
      if lastToken.startswith("spmsg"):
          train_labels[docID] = 1;
          count = count + 1
      docID = docID + 1
  return features_matrix, train_labels

3. 使用 sklearn Naive Bayes 进行训练和预测

sklearn Naive Bayes 的文档（这里）清楚地解释了使用和参数。

基本上，sklearn Naive Bayes 为模型训练提供了三种选择：

高斯() 它用于分类，它假设特征遵循正态分布。
多项式：用于离散计数。例如，假设我们有文本分类问题。在这里，我们可以考虑进一步的伯努利试验，而不是 “在文档中出现的单词”，我们“计算文档中出现单词的频率”，你可以将其视为“观察到结果数 x_i 的次数” 超过 n 次试验“。
伯努利：如果你的特征向量是二元的（即 0 和 1），二项式模型很有用。一个应用是具有 “词袋” 模型的文本分类，其中 1 和 0 分别是 “文档中出现单词” 和“文档中不出现单词”。

在本练习中，我们将使用高斯。示例代码段看起来像

TRAIN_DIR = "../train-mails"
TEST_DIR = "../test-mails"
dictionary = make_Dictionary(TRAIN_DIR)
# using functions mentioned above.
features_matrix, labels = extract_features(TRAIN_DIR)
test_feature_matrix, test_labels = extract_features(TEST_DIR)
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
#train model
model.fit(features_matrix, labels)
#predict
predicted_labels = model.predict(test_feature_matrix)

4. 准确率

接下来，我们比较预测标签的准确度分数。准确率只是正确预测的百分比。同样在这里，sklearn 提供了准确率计算的简洁实现。

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(test_labels, predicted_labels)

5. 合并

import os
import numpy as np
from collections import Counter
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
def make_Dictionary(root_dir):
   all_words = []
   emails = [os.path.join(root_dir,f) for f in os.listdir(root_dir)]
    for mail in emails:
        with open(mail) as m:
            for line in m:
                words = line.split()
                all_words += words
    dictionary = Counter(all_words)
    list_to_remove = dictionary.keys()
    for item in list_to_remove:
        if item.isalpha() == False:
            del dictionary[item]
        elif len(item) == 1:
            del dictionary[item]
    dictionary = dictionary.most_common(3000)
    return dictionary
def extract_features(mail_dir):
  files = [os.path.join(mail_dir,fi) for fi in os.listdir(mail_dir)]
  features_matrix = np.zeros((len(files),3000))
  train_labels = np.zeros(len(files))
  count = 0;
  docID = 0;
  for fil in files:
    with open(fil) as fi:
      for i,line in enumerate(fi):
        if i == 2:
          words = line.split()
          for word in words:
            wordID = 0
            for i,d in enumerate(dictionary):
              if d[0] == word:
                wordID = i
                features_matrix[docID,wordID] = words.count(word)
      train_labels[docID] = 0;
      filepathTokens = fil.split('/')
      lastToken = filepathTokens[len(filepathTokens) - 1]
      if lastToken.startswith("spmsg"):
          train_labels[docID] = 1;
          count = count + 1
      docID = docID + 1
  return features_matrix, train_labels
TRAIN_DIR = "../train-mails"
TEST_DIR = "../test-mails"
dictionary = make_Dictionary(TRAIN_DIR)
print "reading and processing emails from file."
features_matrix, labels = extract_features(TRAIN_DIR)
test_feature_matrix, test_labels = extract_features(TEST_DIR)
model = GaussianNB()
print "Training model."
#train model
model.fit(features_matrix, labels)
predicted_labels = model.predict(test_feature_matrix)
print "FINISHED classifying. accuracy score : "
print accuracy_score(test_labels, predicted_labels)

你可以在我下载的代码中找到代码（链接在这里）。

任务

试试其他型号; Multinomial 和 Bernoulli 并比较你得到的准确率。
尝试将最常见的单词数量从 3000 更改为大小值，并绘制您获得的准确度图表。

结论

朴素贝叶斯认为特征的独立性。例如，它假设一个单词 / 特征的出现独立于其他单词 / 特征。但在现实生活中可能并非如此（好后 早上的 发生率很高）。我希望第 1 章（理论和这一章）在朴素贝叶斯中给出了很多见解。

朴素贝叶斯分类算法
朴素贝叶斯分类算法多项式和高斯朴素贝叶斯的解释朴素贝叶斯是一种有监督的机器学习方法，是概率分类器家族的一员。它采...
【机器学习实战】第4章朴素贝叶斯（NaiveBayes）
第4章基于概率论的分类方法：朴素贝叶斯朴素贝叶斯概述贝叶斯分类是一类分类算法的总称，这类算法均以贝叶斯定理...
机器学习基本知识
机器学习分类按有无监督分类1.全监督学习：回归算法，朴素贝叶斯，SVM(支持向量机)2.无监督学习：聚类算法，降...
朴素贝叶斯法
朴素贝叶斯法朴素贝叶斯法的学习与分类朴素贝叶斯法的参数估计朴素贝叶斯实现高斯朴素贝叶斯实现使用 skle...
机器学习实战（笔记）：第 4 章基于概率论的分类方法：朴素贝叶
第 4 章基于概率论的分类方法：朴素贝叶斯 [TOC] 本章内容：使用概率分布进行分类学习朴素贝叶斯分类器 ...
第1章：监督学习和朴素贝叶斯分类 - 第2部分（编码）
作者：Savan Patel时间：2017 年 5 月 3 日原文：https://medium.com/mach...
朴素贝叶斯
一、朴素贝叶斯法 1.定义：朴素贝叶斯法基于(1)贝叶斯定理和(2)特征条件独立假设的分类方法。 2.具体分类...
朴素贝叶斯法(NaiveBayes)
朴素贝叶斯法(Naive Bayes) 朴素贝叶斯法是基于贝叶斯定力和特征条件独立假设的分类方法。朴素贝叶斯法实...
常用机器学习算法入门（小白适用）
目录1. 监督学习贝叶斯与朴素贝叶斯 SVM 决策树回归2. 非监督学习 KMeans聚类主成分分析PCA ...
【机器学习实战（四）】sklearn包中朴素贝叶斯库的使用
目录朴素贝叶斯相关知识点回顾1.1. 什么是朴素贝叶斯分类器1.2. 朴素贝叶斯推断1.3. 朴素贝叶斯学习sk...