美文网首页机器学习实战
机器学习实战-朴素贝叶斯

机器学习实战-朴素贝叶斯

作者: mov觉得高数好难 | 来源:发表于2017-04-24 18:03 被阅读0次

    前两章我们要求分类器作出艰难的抉择,不过分类器有时候会产生错误,这时会产生错误结果,这是可以要求分类器给出一个最优的类别猜测结果,同事给出这个猜测的概率估计值。
    本章会给出一些使用概率论进行分类的方法。首先从一个最简单的概率分类器开始,然后给出一些假设来学习朴素贝叶斯分类器。我们称之为“朴素”,是因为整个形式化过程只做最原始、最简单的假设。

    朴素贝叶斯
    优点:对于输入数据的准备方式较为敏感
    缺点:对于输入数据的准备方式较为敏感
    适用数据类型:标称型数据

    在文档分类中,整个文档就是实例,而某些元素则构成特征。

    # -*- coding:utf-8 -*-
    #4-1 词表到向量的转化函数
    def loadDataSet():
        postingList = [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
                     ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                     ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                     ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                     ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                     ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
        classVec = [0,1,0,1,0,1]
        return postingList,classVec#词条切分后的集合,类别标签
    
    def createVocabList(dataSet):
        vocabSet = set([])
        for document in dataSet:
            vocabSet = vocabSet | set(document) #合集
        return list(vocabSet)
    
    def setOfwords2Vec(vocabList, inputSet):
        returnVec = [0]*len(vocabList) #创建一个其中所有元素都为0的向量
        for word in inputSet:
            if word in vocabList:
                returnVec[vocabList.index(word)] = 1
            else: print 'the word: %s is not in my Vocabulary' % word
        return returnVec
    
    #bayes-1.py
    import bayes
    listOPosts,listClasses = bayes.loadDataSet()
    myVocabList = bayes.createVocabList(listOPosts)
    
    >>> myVocabList
    ['cute', 'love', 'help', 'garbage', 'quit', 'I', 'problems', 'is', 'park', 'stop', 'flea', 'dalmation', 'licks', 'food', 'not', 'him', 'buying', 'posting', 'has', 'worthless', 'ate', 'to', 'maybe', 'please', 'dog', 'how', 'stupid', 'so', 'take', 'mr', 'steak', 'my']
    >>> bayes.setOfwords2Vec(myVocabList, listOPosts[0])
    [0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1]
    >>> bayes.setOfwords2Vec(myVocabList, listOPosts[3])
    [0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
    

    从词向量计算概率

    #4-2 朴素贝叶斯分类器训练函数
    import numpy as np 
    def trainNB0(trainMatrix, trainCategory):
        m = len(trainMatrix)  #numTrainDocs
        n = len(trainMatrix[0]) #numWords
        p0Num = np.zeros(n); p1Num = np.zeros(n)
        p0Denom = 0.0; p1Denom = 0.0#初始化概率
        for i in range(m):
            if trainCategory[i] == 1:
                p1Num += trainMatrix[i]
                p1Denom += sum(trainMatrix[i])
            else:
                p0Num += trainMatrix[i]
                p0Denom += sum(trainMatrix[i])
    #此次因为缩进错误debug了很久...
        p1Vect = p1Num/p1Denom
        p0Vect = p0Num/p0Denom
        pAbusive = sum(trainCategory)/float(m)
        return p0Vect,p1Vect,pAbusive
    
    #bayes-1.py
    import bayes
    from numpy import *
    reload(bayes)
    listOPosts,listClasses = bayes.loadDataSet()
    myVocabList = bayes.createVocabList(listOPosts)
    trainMat=[]
    for postinDoc in listOPosts:
        trainMat.append(bayes.setOfwords2Vec(myVocabList, postinDoc))
    p0V,p1V,pAb = bayes.trainNB0(trainMat,listClasses)
    
    
    >>> pAb
    0.5
    >>> p0V
    array([ 0.04166667,  0.04166667,  0.04166667,  0.        ,  0.        ,
            0.04166667,  0.04166667,  0.04166667,  0.        ,  0.04166667,
            0.04166667,  0.04166667,  0.04166667,  0.        ,  0.        ,
            0.08333333,  0.        ,  0.        ,  0.04166667,  0.        ,
            0.04166667,  0.04166667,  0.        ,  0.04166667,  0.04166667,
            0.04166667,  0.        ,  0.04166667,  0.        ,  0.04166667,
            0.04166667,  0.125     ])
    >>> p1V
    array([ 0.        ,  0.        ,  0.        ,  0.05263158,  0.05263158,
            0.        ,  0.        ,  0.        ,  0.05263158,  0.05263158,
            0.        ,  0.        ,  0.        ,  0.05263158,  0.05263158,
            0.05263158,  0.05263158,  0.05263158,  0.        ,  0.10526316,
            0.        ,  0.05263158,  0.05263158,  0.        ,  0.10526316,
            0.        ,  0.15789474,  0.        ,  0.05263158,  0.        ,
            0.        ,  0.        ])
    >>> 
    

    利用贝叶斯分类器对文档进行分类时,要计算多个概率的乘积以获得文档属于某个类别的概率。
    如果其中一个概率值为0,那么最后乘积也是0。
    为了降低这种影响,可以将所有词的出现数改为1,并将分母改为2
    另外可以遇到下溢出。由于太多很小的树相乘,可能四舍五入得到0。可以改成log

    p0Num = np.ones(n); p1Num = np.ones(n)
    p0Denom = 2.0; p1Denom = 2.0
    p1Vect = log(p1Num/p1Denom)
    p0Vect = log(p0Num/p0Denom)
    

    开始使用numpy向量处理功能

    def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):#要分类的向量
        p1 = sum(vec2Classify * p1Vec) + log(pClass1)#向量相乘
        p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
        if p1 > p0:
            return 1
        else:
            return 0
    
    def testingNB():
        listOPosts,listClasses = loadDataSet()
        myVocabList = createVocabList(listOPosts)
        trainMat=[]
        for postinDoc in listOPosts:
            trainMat.append(setOfWords2Vec(myVocabList,postinDoc))
        p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses))
        testEntry = ['love','my','dalmation']
        thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
        print testEntry,'classified as:',classifyNB(thisDoc,p0V,p1V,pAb)
        testEntry = ['stupid','garbage']
        thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
        print testEntry,'classified as:', classifyNB(thisDoc,p0V,p1V,pAb)
    
    
    #bayes-1.py
    import bayes
    from numpy import *
    reload(bayes)
    listOPosts,listClasses = bayes.loadDataSet()
    myVocabList = bayes.createVocabList(listOPosts)
    trainMat=[]
    for postinDoc in listOPosts:
        trainMat.append(bayes.setOfWords2Vec(myVocabList, postinDoc))
    p0V,p1V,pAb = bayes.trainNB0(trainMat,listClasses)
    bayes.testingNB()
    

    测试这两句话

    ['love', 'my', 'dalmation'] classified as: 0
    ['stupid', 'garbage'] classified as: 1
    

    如果一个词在文档中出现不止一次,这可能意味着包含该词是否出现在文档中所不能表达的某种信息,这种方法被称为词袋模型。

    4-4朴素贝叶斯词袋模型
    def bagOfWords2VecMN(vocabList, inputSet):
        returnVec = [0]*len(vocavList)
        for word in inputSet:
            if word in inputSet:
                returnVec[vocabList.index(word)] += 1
        return returnVec
    

    下面开始了解贝叶斯的一个著名应用:电子邮件垃圾过滤。
    首先准备数据,切分文本

    >>> mySent='This book is the best book on Python or M.L. I have ever laid eyes upon.'
    >>> mySent.split()
    ['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M.L.', 'I', 'have', 'ever', 'laid', 'eyes', 'upon.']
    

    但是标点符号也被当成了词的一部分,可以用正则表达式来切分句子,其中分隔符是除单词、数字外的任意字符串。

    >>> import re
    >>> regEx = re.compile('\\W*')
    >>> listOfTokens = regEx.split(mySent)
    >>> listOfTokens
    ['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M', 'L', 'I', 'have', 'ever', 'laid', 'eyes', 'upon', '']
    

    接着我们把里面的空字符串去掉

    [tok for tok in listOfTokens if len(tok) > 0]
    ['This', 'book', 'is', 'the', 'best', 'book', 'on', 'Python', 'or', 'M', 'L', 'I', 'have', 'ever', 'laid', 'eyes', 'upon']
    

    然后把所有字幕改成小写

    >>> [tok.lower() for tok in listOfTokens if len(tok) > 0]
    ['this', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'm', 'l', 'i', 'have', 'ever', 'laid', 'eyes', 'upon']
    

    现在来看数据集中一封完整的电子邮件的实际处理结果。

    >>> emailText = open('E:/上学/机器学习实战/4.朴素贝叶斯/email/ham/6.txt').read()
    >>> listOfTokens=regEx.split(emailText)
    >>> listOfTokens
    ['Hello', 'Since', 'you', 'are', 'an', 'owner', 'of', 'at', 'least', 'one', 'Google', 'Groups', 'group', 'that', 'uses', 'the', 'customized', 'welcome', 'message', 'pages', 'or', 'files', 'we', 'are', 'writing', 'to', 'inform', 'you', 'that', 'we', 'will', 'no', 'longer', 'be', 'supporting', 'these', 'features', 'starting', 'February', '2011', 'We', 'made', 'this', 'decision', 'so', 'that', 'we', 'can', 'focus', 'on', 'improving', 'the', 'core', 'functionalities', 'of', 'Google', 'Groups', 'mailing', 'lists', 'and', 'forum', 'discussions', 'Instead', 'of', 'these', 'features', 'we', 'encourage', 'you', 'to', 'use', 'products', 'that', 'are', 'designed', 'specifically', 'for', 'file', 'storage', 'and', 'page', 'creation', 'such', 'as', 'Google', 'Docs', 'and', 'Google', 'Sites', 'For', 'example', 'you', 'can', 'easily', 'create', 'your', 'pages', 'on', 'Google', 'Sites', 'and', 'share', 'the', 'site', 'http', 'www', 'google', 'com', 'support', 'sites', 'bin', 'answer', 'py', 'hl', 'en', 'answer', '174623', 'with', 'the', 'members', 'of', 'your', 'group', 'You', 'can', 'also', 'store', 'your', 'files', 'on', 'the', 'site', 'by', 'attaching', 'files', 'to', 'pages', 'http', 'www', 'google', 'com', 'support', 'sites', 'bin', 'answer', 'py', 'hl', 'en', 'answer', '90563', 'on', 'the', 'site', 'If', 'you', 're', 'just', 'looking', 'for', 'a', 'place', 'to', 'upload', 'your', 'files', 'so', 'that', 'your', 'group', 'members', 'can', 'download', 'them', 'we', 'suggest', 'you', 'try', 'Google', 'Docs', 'You', 'can', 'upload', 'files', 'http', 'docs', 'google', 'com', 'support', 'bin', 'answer', 'py', 'hl', 'en', 'answer', '50092', 'and', 'share', 'access', 'with', 'either', 'a', 'group', 'http', 'docs', 'google', 'com', 'support', 'bin', 'answer', 'py', 'hl', 'en', 'answer', '66343', 'or', 'an', 'individual', 'http', 'docs', 'google', 'com', 'support', 'bin', 'answer', 'py', 'hl', 'en', 'answer', '86152', 'assigning', 'either', 'edit', 'or', 'download', 'only', 'access', 'to', 'the', 'files', 'you', 'have', 'received', 'this', 'mandatory', 'email', 'service', 'announcement', 'to', 'update', 'you', 'about', 'important', 'changes', 'to', 'Google', 'Groups', '']
    

    文本解析是个相当复杂的过程,接下来将构建一个极其简单的函数。

    def textParse(bigString):
        import re
        listOfTokens = re.split(r'\W*',bigString)
        return [tok.lower() for tok in listOfTokens if len(tok) > 2]
    
    def spamTest():
        docList=[]; classList = []; fullText = []
        for i in range(1,26):
            wordList = textParse(open('email/spam/%d.txt'%i).read())
            docList.append(wordList)
            fullText.extend(wordList)
            classList.append(1)
            wordList = textParse(open('email/ham/%d.txt'%i).read())
            docList.append(wordList)
            fullText.extend(wordList)
            classList.append(0)
        vocabList = createVocabList(docList)
        trainingSet = range(50); testSet=[]
        for i in range(10):#随机选择10个作为训练集
            randIndex = int(random.uniform(0,len(trainingSet)))
            testSet.append(trainingSet[randIndex])
            del(trainingSet[randIndex])
        trainMat=[]; trainClasses = []
        for docIndex in trainingSet:
            trainMat.append(bagOfWords2VecMN(vocabList,docList[docIndex]))
            trainClasses.append(classList[docIndex])
        p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
        errorCount = 0
        for docIndex in testSet:
            wordVector = bagOfWords2VecMN(vocabList,docList[docIndex])
            if classifyNB(array(wordVector),p0V,p1V,pSpam) !=classList[docIndex]:
                errorCount += 1
                print "classification error",docList[docIndex]
        print 'the error rate is: ',float(errorCount)/len(testSet)
    

    因为是随机选择10封电子邮件,所以每次都有差别。如果发现错误,函数会输出错分文档的词表,这样就可以分析是那篇文档发生了错误。

    bayes.spamTest()
    classification error ['yeah', 'ready', 'may', 'not', 'here', 'because', 'jar', 'jar', 'has', 'plane', 'tickets', 'germany', 'for']
    the error rate is:  0.1
    
    bayes.spamTest()
    the error rate is:  0.0
    
    bayes.spamTest()
    classification error ['benoit', 'mandelbrot', '1924', '2010', 'benoit', 'mandelbrot', '1924', '2010', 'wilmott', 'team', 'benoit', 'mandelbrot', 'the', 'mathematician', 'the', 'father', 'fractal', 'mathematics', 'and', 'advocate', 'more', 'sophisticated', 'modelling', 'quantitative', 'finance', 'died', '14th', 'october', '2010', 'aged', 'wilmott', 'magazine', 'has', 'often', 'featured', 'mandelbrot', 'his', 'ideas', 'and', 'the', 'work', 'others', 'inspired', 'his', 'fundamental', 'insights', 'you', 'must', 'logged', 'view', 'these', 'articles', 'from', 'past', 'issues', 'wilmott', 'magazine']
    the error rate is:  0.1
    

    下面使用朴素贝叶斯分类器从个人广告中获取区域倾向
    本人现在使用的是Anaconda,先安装feedparser包,然后构建一个类似于spamTest()的函数

    def calcMostFreq(vocabList,fullText):#计算单词出现的频率
        import operator
        freqDict = {}
        for token in vocabList:
            freqDict[token]=fullText.count(token)
        sortedFreq = sorted(freqDict.iteritems(),key=operator.itemgetter(1),reverse=True)#排序
        return sortedFreq[:30]#选出最多的30个
    
    def localWords(feed1,feed0):
        import feedparser
        docList=[];classList = []; fullText = []
        minLen = min(len(feed1['entries']),len(feed0['entries']))
        for i in range(minLen):
            wordList = textParse(feed1['entries'][i]['summary'])
            docList.append(wordList)
            fullText.extend(wordList)
            classList.append(1)
            wordList = textParse(feed0['entries'][i]['summary'])
            docList.append(wordList)
            fullText.extend(wordList)
            classList.append(0)
        vocabList = createVocabList(docList)
        top30Words = calcMostFreq(vocabList,fullText)#去掉出现次数最高的那些词
        for pairW in top30Words:
            if pairW[0] in vocabList: vocabList.remove(pairW[0])
        trainingSet = range(2*minLen); testSet=[]
        for i in range(20):
            randIndex = int(random.uniform(0,len(trainingSet)))
            testSet.append(trainingSet[randIndex])
            del(trainingSet[randIndex])
        trainMat=[]; trainClasses = []
        for docIndex in trainingSet:
            trainMat.append(bagOfWords2VecMN(vocabList,docList[docIndex]))
            trainClasses.append(classList[docIndex])
        p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))
        errorCount = 0
        for docIndex in testSet:
            wordVector = bagOfWords2VecMN(vocabList, docList[docIndex])
            if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]:
                errorCount += 1
        print 'the error rate is:',float(errorCount)/len(testSet)
        return vocabList,p0V,p1V
    

    在iPython中调试

    import bayes
    ny=feedparser.parse('http://newyork.craigslist.org/stp/index.rss')
    sf=feedparser.parse('http://sfbay.craigslist.org/stp/index.rss')
    vocabList,pSF,pNY=bayes.localWords(ny,sf)
    
    the error rate is: 0.5
    
    import bayes
    ny=feedparser.parse('http://newyork.craigslist.org/stp/index.rss')
    sf=feedparser.parse('http://sfbay.craigslist.org/stp/index.rss')
    vocabList,pSF,pNY=bayes.localWords(ny,sf)
    
    the error rate is: 0.55
    

    为了得到错误率的精确估计,应该多次进行上述实验,最后取平均值
    下面开始显示地域相关的用词

    def getTopWords(ny,sf):
        import operator
        vocabList,p0V,p1V=localWords(ny,sf)
        topNY=[];topSF=[]
        for i in range(len(p0V)):
            if p0V[i] > -4.5 : topSF.append((vocabList[i],p0V[i]))
            if p1V[i] > -4.5 : topNY.append((vocabList[i],p1V[i]))
        sortedSF = sorted(topSF,key=lambda pair:pair[1],reverse=True)
        print 'SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**'
        for item in sortedSF:
            print item[0]
        sortedNY = sorted(topNY,key=lambda pair:pair[1], reverse=True)
        print "NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**"
        for item in sortedNY:
            print item[0]
    

    和之前返回排名最高的X个单词不同,这里可以返回大于某个阈值的所有词,这些元组会按照它们的条件概率进行排序。

    import bayes
    ny=feedparser.parse('http://newyork.craigslist.org/stp/index.rss')
    sf=feedparser.parse('http://sfbay.craigslist.org/stp/index.rss')
    bayes.getTopWords(ny,sf)
    
    the error rate is: 0.4
    SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**
    meet
    all
    61514
    great
    open
    any
    movie
    about
    area
    NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**
    show
    contact
    info
    very
    guy
    massage
    talk
    nyc
    first
    need
    only
    off
    

    相关文章

      网友评论

        本文标题:机器学习实战-朴素贝叶斯

        本文链接:https://www.haomeiwen.com/subject/xqipzttx.html