朴素贝叶斯分类
基于贝叶斯定理来计算概率和条件概率,且假设每个属性独立地对分类结果产生影响
为什么称为“朴素的”?
因为朴素贝叶斯假定所有属性都是同等重要且相互独立的。
例子
高斯贝叶斯分类
Gaussian3种不同的贝叶斯分类
Bernoulli Multinomial Gaussian拉普拉斯修正(Laplacian correction)
如果训练集中未出现属性值,条件概率的值为0。为了避免这种情况,在估计概率值时通常要进行"平滑",常用'拉普拉斯修正"。
Laplacian correction
例子
代码
#GausianNB
from sklearn import datasets
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
dataset = datasets.load_iris()
model = GaussianNB()
model.fit(dataset.data, dataset.target)
expected = dataset.target
predicted = model.predict(dataset.data)
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))
Output
文本分类
#####Text Classification
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB,GaussianNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report, confusion_matrix
categories = ['alt.atheism', 'talk.religion.misc',
'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test',
categories=categories)
X_train = newsgroups_train.data
X_test = newsgroups_test.data
y_train = newsgroups_train.target
y_test = newsgroups_test.target
pipe = make_pipeline(TfidfVectorizer(), MultinomialNB())
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
垃圾邮件过滤
别人写的例子
https://github.com/Surya-Murali/Email-Spam-Classifier-Using-Naive-Bayes/blob/master/SpamClassifier.py
网友评论