朴素贝叶斯算法进行NLP初试

作者: fred_33c7 | 来源:发表于2019-07-25 21:07 被阅读0次

    朴素贝叶斯算法是NLP领域常用的一种算法模型,这里我们用一个简单的例子来看看怎么样用他来进行一个NLP的分类例子。(偏向实用,如果要想了解算法原理的话,另外搜索学习)

    跟常见的模型建立一样,主要有一下几个步骤:

    1. 数据的预处理
    2. 数据集分类标记
    3. 特征提取与建立模型并训练
    4. 进行测试

    这次我用了sklearn来进行这个简单的小例子,有两个文本集,hotel和travel,一个文本集全是各种宾馆,一个文本集都是旅游信息


    具体的代码如下:

    import os
    import jieba
    from sklearn.feature_extraction.text import TfidfTransformer
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.naive_bayes import MultinomialNB
    from sklearn.externals import joblib
    import time
    
    """
    1.数据的预处理
    """
    
    
    def preprocess(path):
        text_with_space = ""
        textfile = open(path, "r", encoding="utf8").read()
        textcute = jieba.cut(textfile)
        for word in textcute:
            text_with_space += word + " "
        return text_with_space
    
    
    """
    2. 数据集分类标记
    """
    
    
    def loadtrainset(path, classtag):
        allfiles = os.listdir(path)
        processed_textset = []
        allclasstags = []
        for thisfile in allfiles:
            # print(thisfile)
            path_name = path + "/" + thisfile
            processed_textset.append(preprocess(path_name))
            allclasstags.append(classtag)
        return processed_textset, allclasstags
    
    
    processed_textdata1, class1 = loadtrainset("/Users/fengyang/PycharmProjects/NLP/dataset/train/hotel", "宾馆")
    processed_textdata2, class2 = loadtrainset("/Users/fengyang/PycharmProjects/NLP/dataset/train/travel", "旅游")
    
    train_data = processed_textdata1 + processed_textdata2
    classtags_list = class1 + class2
    # 对文本中的词语转换
    count_vector = CountVectorizer()
    vecot_matrix = count_vector.fit_transform(train_data)
    
    """
    3. 特征提取与训练
    """
    # TFIDF
    # 提取特征
    train_tfidf = TfidfTransformer(use_idf=False).fit_transform(vecot_matrix)
    # 特征训练
    clf = MultinomialNB().fit(train_tfidf, classtags_list)
    """
    4. 测试
    """
    testset = []
    
    path = "/Users/fengyang/PycharmProjects/NLP/dataset/test/hotel"
    allfiles = os.listdir(path)
    
    hotel = 0
    travel = 0
    
    for thisfile in allfiles:
        path_name = path + "/" + thisfile
        new_count_vector = count_vector.transform([preprocess(path_name)])
        new_tfidf = TfidfTransformer(use_idf=False).fit_transform(new_count_vector)
        predict_result = clf.predict(new_tfidf)
        print(predict_result)
        print(thisfile)
    
        if (predict_result == "宾馆"):
            hotel += 1
        if (predict_result == "旅游"):
            travel += 1
    
    print("宾馆" + str(hotel))
    print("旅游" + str(travel))
    

    结果:

    ['宾馆']
    三亚市春节宾馆房价不乱涨价违者将受到严处_seg_pos.txt
    ['宾馆']
    住宿-宾馆名录_seg_pos.txt
    ['宾馆']
    nj7_seg_pos.txt
    ['宾馆']
    dali09_seg_pos.txt
    ['宾馆']
    bj6_seg_pos.txt
    ['宾馆']
    xm7_seg_pos.txt
    ['宾馆']
    dujiangyan09_seg_pos.txt
    ['宾馆']
    wuyishan12_seg_pos.txt
    ['宾馆']
    zhuhai06_seg_pos.txt
    ['宾馆']
    kuerle01_seg_pos.txt
    ['宾馆']
    xm3_seg_pos.txt
    宾馆11
    旅游0
    

    通过结果我们看到,所有的测试本文,一种11个,全部正确。

    具体代码和数据集地址:https://github.com/fredfeng0326/NLP/tree/master/nb_test

    相关文章

      网友评论

        本文标题:朴素贝叶斯算法进行NLP初试

        本文链接:https://www.haomeiwen.com/subject/nibkrctx.html