美文网首页
NLP入门 - 新闻文本分类 Task3

NLP入门 - 新闻文本分类 Task3

作者: 正在学习的Yuki | 来源:发表于2020-07-25 16:02 被阅读0次

    Task3 基于机器学习的文本分类

    学习目标

    • 学会TF-IDF的原理和使用
    • 使用sklearn的机器学习模型完成文本分类

    文本表示方法 Part1

    文本表示成计算机能够运算的数字或向量的方法一般称为词嵌入(Word Embedding)方法:将不定长的文本转换到定长的空间内。

    1. One-hot
    将每一个单词使用一个离散的向量表示:将每个字/词编码一个索引,然后根据索引进行赋值。
    e.g.,
    句子1:我 爱 北 京 天 安 门
    句子2:我 喜 欢 上 海

    • 首先对所有句子的字进行索引:
      { '我': 1, '爱': 2, '北': 3, '京': 4, '天': 5, '安': 6, '门': 7,
      '喜': 8, '欢': 9, '上': 10, '海': 11}
    • 每个字转换为一个11维度稀疏向量:
      我:[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
      爱:[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
      ...
      海:[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]

    2. Bag of Words / Count Vectors
    每个文档的字/词用出现次数来表示。
    句子1:我 爱 北 京 天 安 门 -> [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0]
    句子2:我 喜 欢 上 海 -> [1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]

    • 可以用sklearn中的CountVectorizer实现:
    from sklearn.feature_extraction.text import CountVectorizer
    corpus = [
        'This is the first document.',
        'This document is the second document.',
        'And this is the third one.',
        'Is this the first document?',
    ]
    vectorizer = CountVectorizer()
    X = vectorizer.fit_transform(corpus)
    print(X.toarray()) # 词频结果
    print(vectorizer.get_feature_names()) # 词袋中所有文本关键词
    

    Result:

    [[0 1 1 1 0 0 1 0 1]
     [0 2 0 1 0 1 1 0 1]
     [1 0 0 1 1 0 1 1 1]
     [0 1 1 1 0 0 1 0 1]]
    ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
    

    3. N-gram
    与Count Vectors类似,不过加入了相邻单词组成新的单词,并进行计数。
    e.g., N取值为2,就变为:
    句子1:我爱 爱北 北京 京天 天安 安门
    句子2:我喜 喜欢 欢上 上海

    4. TF-IDF

    • term frequency-inverse document frequency: a statistical measure used to evaluate how important a word is to a document in a collection or corpus.
      TF(t)= 该词语在当前文档出现的次数 / 当前文档中词语的总数
      IDF(t)= log_e(文档总数 / 出现该词语的文档总数)
      tfidf_{i,j} = tf_{i,j} \times idf_{i,j}
    • 可以用sklearn中的TfidfVectorizer实现:
    from sklearn.feature_extraction.text import TfidfVectorizer
    corpus = [
        'This is the first document.',
        'This document is the second document.',
        'And this is the third one.',
        'Is this the first document?',
    ]
    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(corpus)
    print(X.toarray())
    print(vectorizer.get_feature_names())
    

    Result:

    [[0.         0.46979139 0.58028582 0.38408524 0.         0.
     0.38408524 0.         0.38408524]
    [0.         0.6876236  0.         0.28108867 0.         0.53864762
     0.28108867 0.         0.28108867]
    [0.51184851 0.         0.         0.26710379 0.51184851 0.
     0.26710379 0.51184851 0.26710379]
    [0.         0.46979139 0.58028582 0.38408524 0.         0.
     0.38408524 0.         0.38408524]]
    ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
    

    基于机器学习的文本分类

    对比不同文本表示算法的精度,通过本地构建验证集计算F1得分。

    1. Count Vectors + RidgeClassifier

    import pandas as pd
    
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.linear_model import RidgeClassifier
    from sklearn.metrics import f1_score
    
    train_df = pd.read_csv('data/train_set.csv', sep='\t', nrows=15000)
    
    vectorizer = CountVectorizer(max_features=3000)
    train_test = vectorizer.fit_transform(train_df['text'])
    
    clf = RidgeClassifier()
    clf.fit(train_test[:10000], train_df['label'].values[:10000])
    val_pred = clf.predict(train_test[10000:])
    
    print("Count Vectors + RidgeClassifier: f1_score =", end=' ')
    print(f1_score(train_df['label'].values[10000:], val_pred, average='macro'))
    # 0.74
    

    2. TF-IDF + RidgeClassifier

    import pandas as pd
    
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.linear_model import RidgeClassifier
    from sklearn.metrics import f1_score
    
    train_df = pd.read_csv('data/train_set.csv', sep='\t', nrows=15000)
    
    tfidf = TfidfVectorizer(ngram_range=(1,3), max_features=3000)
    train_test = tfidf.fit_transform(train_df['text'])
    
    clf = RidgeClassifier()
    clf.fit(train_test[:10000], train_df['label'].values[:10000])
    val_pred = clf.predict(train_test[10000:])
    
    print("TF-IDF \t\t  + RidgeClassifier: f1_score =", end=' ')
    print(f1_score(train_df['label'].values[10000:], val_pred, average='macro'))
    # 0.87
    

    Result:

    Count Vectors + RidgeClassifier: f1_score = 0.7406241569237678
    TF-IDF        + RidgeClassifier: f1_score = 0.8721598830546126
    

    本章作业

    1. 尝试改变TF-IDF的参数,并验证精度
      tfidf = TfidfVectorizer(ngram_range=(1, 1), max_features=None)
    • 参数含义:
      ngram_range=(min, max) - 将text分成min~max 个不同的词组
      比如'Python is useful'中ngram_range(1,3)可得到'Python' 'is' 'useful' 'Python is' 'is useful' 和'Python is useful';如果是ngram_range (1,1) 则只能得到单个单词'Python' 'is'和'useful'
      max_features: int - build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
      Set a certain threshold for word frequences. e.g., threshold=50, and data corpus consists of 100 words. After looking at the word frequences 20 words occur less than 50 times. Thus, set max_features=80.

    • 改变max_features

    for max_features in range(1000, 10000 , 1000):
        tfidf = TfidfVectorizer(ngram_range=(1,3), max_features=max_features)
        train_test = tfidf.fit_transform(train_df['text'])
    
        clf = RidgeClassifier()
        clf.fit(train_test[:10000], train_df['label'].values[:10000])
    
        val_pred = clf.predict(train_test[10000:])
        print("max_features =",max_features,": f1_score =", end=' ')
        print(f1_score(train_df['label'].values[10000:], val_pred, average='macro'))
    

    Result:

    max_features = 1000 : f1_score = 0.8270776630718544
    max_features = 2000 : f1_score = 0.8603842642428617
    max_features = 3000 : f1_score = 0.8721598830546126
    max_features = 4000 : f1_score = 0.8753945850878357
    max_features = 5000 : f1_score = 0.8850817067811825
    max_features = 6000 : f1_score = 0.8901406771892212
    max_features = 7000 : f1_score = 0.8920634181410882
    max_features = 8000 : f1_score = 0.8897593080180294
    max_features = 9000 : f1_score = 0.89142965492415
    

    发现max_features=7000时f1_score最高(0.8921)。

    • 选择max_features=7000,改变ngram_range
    for ngram_max in range(1, 5):
        tfidf = TfidfVectorizer(ngram_range=(1,ngram_max), max_features=7000)
    

    Result:

    ngram_range=(1,1), f1_score = 0.8603325900148268
    ngram_range=(1,2), f1_score = 0.8875087923712194
    ngram_range=(1,3), f1_score = 0.8920634181410882
    ngram_range=(1,4), f1_score = 0.891603038208734
    ngram_range=(1,5), f1_score = 0.8914513820822496
    
    1. 尝试使用其他机器学习模型,完成训练和验证
    • Logistic Regression classifier (0.8589)
    from sklearn.linear_model import LogisticRegression
    clf = LogisticRegression(max_iter=10000)
    
    • Linear Support Vector Classification (0.8964)
    from sklearn.svm import LinearSVC
    clf = LinearSVC(max_iter=10000)
    

    最优结果:0.8964


    Refereces:

    相关文章

      网友评论

          本文标题:NLP入门 - 新闻文本分类 Task3

          本文链接:https://www.haomeiwen.com/subject/wynplktx.html