美文网首页大数据,机器学习,人工智能机器学习与数据挖掘@IT·互联网
文本分类(上)- 基于传统机器学习方法进行文本分类

文本分类(上)- 基于传统机器学习方法进行文本分类

作者: 致Great | 来源:发表于2018-07-16 16:37 被阅读15次

    简介

    自己由于最近参加了一个比赛“达观杯”文本智能处理挑战赛,上一周主要在做这一个比赛,看了一写论文和资料,github上搜刮下。。感觉一下子接触的知识很多,自己乘热打铁整理下吧。

    接着上一篇文章20 newsgroups数据介绍以及文本分类实例,我们继续探讨下文本分类方法。文本分类作为NLP领域最为经典场景之一,当目前为止在业界和学术界已经积累了很多方法,主要分为两大类:

    • 基于传统机器学习的文本分类
    • 基于深度学习的文本分类

    传统机器学习的文本分类通常提取tfidf或者词袋特征,然后给LR模型进行训练;这里模型有很多,比如贝叶斯、svm等;深度学习的文本分类,主要采用CNN、RNN、LSTM、Attention等。

    利用传统机器学习和深度学习进行文本分类

    • 基于传统机器学习方法进行文本分类
      基本思路是:提取tfidf特征,然后喂给各种分类模型进行训练
    import numpy as np
    from sklearn.pipeline import Pipeline
    from sklearn.datasets import fetch_20newsgroups
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.naive_bayes import MultinomialNB
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.neural_network.multilayer_perceptron import MLPClassifier
    from sklearn.svm import SVC,LinearSVC,LinearSVR
    from sklearn.linear_model.stochastic_gradient import SGDClassifier
    from sklearn.linear_model.logistic import LogisticRegression
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.ensemble import GradientBoostingClassifier
    from sklearn.ensemble import AdaBoostClassifier
    from sklearn.tree import DecisionTreeClassifier
    
    # 选取下面的8类
    selected_categories = [
        'comp.graphics',
        'rec.motorcycles',
        'rec.sport.baseball',
        'misc.forsale',
        'sci.electronics',
        'sci.med',
        'talk.politics.guns',
        'talk.religion.misc']
    
    # 加载数据集
    newsgroups_train=fetch_20newsgroups(subset='train',
                                        categories=selected_categories,
                                        remove=('headers','footers','quotes'))
    newsgroups_test=fetch_20newsgroups(subset='train',
                                        categories=selected_categories,
                                        remove=('headers','footers','quotes'))
    
    train_texts=newsgroups_train['data']
    train_labels=newsgroups_train['target']
    test_texts=newsgroups_test['data']
    test_labels=newsgroups_test['target']
    print(len(train_texts),len(test_texts))
    
    # 贝叶斯
    text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                       ('clf',MultinomialNB())])
    text_clf=text_clf.fit(train_texts,train_labels)
    predicted=text_clf.predict(test_texts)
    print("MultinomialNB准确率为:",np.mean(predicted==test_labels))
    
    # SGD
    text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                       ('clf',SGDClassifier())])
    text_clf=text_clf.fit(train_texts,train_labels)
    predicted=text_clf.predict(test_texts)
    print("SGDClassifier准确率为:",np.mean(predicted==test_labels))
    
    # LogisticRegression
    text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                       ('clf',LogisticRegression())])
    text_clf=text_clf.fit(train_texts,train_labels)
    predicted=text_clf.predict(test_texts)
    print("LogisticRegression准确率为:",np.mean(predicted==test_labels))
    
    # SVM
    text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                       ('clf',SVC())])
    text_clf=text_clf.fit(train_texts,train_labels)
    predicted=text_clf.predict(test_texts)
    print("SVC准确率为:",np.mean(predicted==test_labels))
    
    text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                       ('clf',LinearSVC())])
    text_clf=text_clf.fit(train_texts,train_labels)
    predicted=text_clf.predict(test_texts)
    print("LinearSVC准确率为:",np.mean(predicted==test_labels))
    
    text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                       ('clf',LinearSVR())])
    text_clf=text_clf.fit(train_texts,train_labels)
    predicted=text_clf.predict(test_texts)
    print("LinearSVR准确率为:",np.mean(predicted==test_labels))
    
    # MLPClassifier
    text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                       ('clf',MLPClassifier())])
    text_clf=text_clf.fit(train_texts,train_labels)
    predicted=text_clf.predict(test_texts)
    print("MLPClassifier准确率为:",np.mean(predicted==test_labels))
    
    # KNeighborsClassifier
    text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                       ('clf',KNeighborsClassifier())])
    text_clf=text_clf.fit(train_texts,train_labels)
    predicted=text_clf.predict(test_texts)
    print("KNeighborsClassifier准确率为:",np.mean(predicted==test_labels))
    
    # RandomForestClassifier
    text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                       ('clf',RandomForestClassifier(n_estimators=8))])
    text_clf=text_clf.fit(train_texts,train_labels)
    predicted=text_clf.predict(test_texts)
    print("RandomForestClassifier准确率为:",np.mean(predicted==test_labels))
    
    # GradientBoostingClassifier
    text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                       ('clf',GradientBoostingClassifier())])
    text_clf=text_clf.fit(train_texts,train_labels)
    predicted=text_clf.predict(test_texts)
    print("GradientBoostingClassifier准确率为:",np.mean(predicted==test_labels))
    
    # AdaBoostClassifier
    text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                       ('clf',AdaBoostClassifier())])
    text_clf=text_clf.fit(train_texts,train_labels)
    predicted=text_clf.predict(test_texts)
    print("AdaBoostClassifier准确率为:",np.mean(predicted==test_labels))
    
    # DecisionTreeClassifier
    text_clf=Pipeline([('tfidf',TfidfVectorizer(max_features=10000)),
                       ('clf',DecisionTreeClassifier())])
    text_clf=text_clf.fit(train_texts,train_labels)
    predicted=text_clf.predict(test_texts)
    print("DecisionTreeClassifier准确率为:",np.mean(predicted==test_labels))
    
    

    输出结果为:

    MultinomialNB准确率为: 0.8960196779964222
    SGDClassifier准确率为: 0.9724955277280859
    LogisticRegression准确率为: 0.9304561717352415
    SVC准确率为: 0.13372093023255813
    LinearSVC准确率为: 0.9749552772808586
    LinearSVR准确率为: 0.00022361359570661896
    MLPClassifier准确率为: 0.9758497316636852
    KNeighborsClassifier准确率为: 0.45840787119856885
    RandomForestClassifier准确率为: 0.9680232558139535
    GradientBoostingClassifier准确率为: 0.9186046511627907
    AdaBoostClassifier准确率为: 0.5916815742397138
    DecisionTreeClassifier准确率为: 0.9758497316636852
    

    从上面结果可以看出,不同分类器在改数据集上的表现差别是比较大的,所以在做文本分类的时候要多尝试几种方法,说不定有意外收获;另外TfidfVectorizer、LogisticRegression等方法,我们可以设置很多参数,这里对实验的效果也影响比较大,比如TfidfVectorizer中一个参数ngram_range直接影响提取的特征,这里也是需要多磨多练;
    更多请见:https://github.com/yanqiangmiffy/20newsgroups-text-classification

    参考资料

    中文文本分类对比(经典方法和CNN)
    sklearn 中的 Pipeline 机制

    相关文章

      网友评论

        本文标题:文本分类(上)- 基于传统机器学习方法进行文本分类

        本文链接:https://www.haomeiwen.com/subject/psubpftx.html