美文网首页程序员
文本分类调研

文本分类调研

作者: Ydrivemecrazy | 来源:发表于2017-09-15 16:47 被阅读0次

    持续更新中

    Introduction

    1. Definition

    什么是文本分类,即我们常说的text classification,简单的说就是把一段文本划分到我们提前定义好的一个或多个类别。可以说是属于document classification的范畴。
    Input:
    a document d
    a fixed set of classes C = {c1, c2, ... , cn}
    Output:
    a predicted class ci from C

    2. Some simple application

    1. spam detection
    2. authorship attribution
    3. age/gender identification
    4. sentiment analysis
    5. assigning subject categories, topics or genes
      ......

    Traditional methods

    1. Naive Bayes

    two assumptions:

    1. Bag of words assumption:
      position doesn't matter
    2. Conditional independency:

    to compute these probabilities:

    add-one smoothing to prevent the situation in which we get zero:(you can add other number as well)

    to deal with unknown/unshown words:

    main features:

    1. very fast, low storage requirements
    2. robust to irrelevant features
    3. good in domains with many equally important features
    4. optimal if the indolence assumption hold
    5. lacks accuracy in general

    2. SVM

    cost function of SVM:

    2. SVM decision boundary
    when C is very large:

    about kernel:

    until now,it seems that the SVM are only applicable to two-class classification.

    Comparing with Logistic regression:

    while applying SVM and Logistic regression to text classification, all you need to do is to get the labeled data and find a proper way to represent the texts with vectors (you can use one-hot representation , word2vec, doc2vec ......)

    Neural network methods

    1. CNN

    (1) the paper Convolutional Neural Networks for Sentence Classification which appeared in EMNLP 2014
    (2) the paper A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification

    The model uses multiple filters to obtain multiple features. These features form the penultimate layer and are passed to a fully connected softmax layer whose output is the probability distribution over labels.

    For regularization we employ dropout on the penultimate layer with a constraint on l2-norms of the weight vectors. Dropout prevents co-adaptation of hidden units by randomly dropping out.

    Pre-trained Word Vectors
    We use the publicly available word2vec vectors that were trained on 100 billion words from Google News.

    Results

    There is simplified implementation using Tensorflow on Github:https://github.com/dennybritz/cnn-text-classification-tf

    2. RNN

    the paper Hierarchical Attention Networks for Document Classification which appeared in NAACL 2016

    in this paper we test the hypothesis that better representations can be obtained by incorporating knowledge of document structure in the model architecture

    1. It is observed that different words and sentences in a documents are differentially informative.
    2. Moreover, the importance of words and sentences are highly context dependent.
      i.e. the same word or sentence may be dif- ferentially important in different context

    Attention serves two benefits: not only does it often result in better performance, but it also provides in- sight into which words and sentences contribute to the classification decision which can be of value in applications and analysis

    Hierarchical Attention Network

    If you want to learn more about Attention Mechanisms:http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nlp/

    In the model they used the GRU-based sequence encoder.
    1. Word Encoder:

    2. Word Attention:

    3. Sentence Encoder:

    4. Sentence Attention:

    5. Document Classification:
    Because the document vector v is a high level representation of document d

    j is the label of document d

    Results

    There is simplified implementation written in Python on Github:https://github.com/richliao/textClassifier

    References

    https://www.cs.cmu.edu/%7Ediyiy/docs/naacl16.pdf
    https://www.cs.cmu.edu/%7Ediyiy/docs/naacl16.pdf
    https://www.coursera.org/learn/machine-learning/home/
    https://www.youtube.com/playlist?list=PL6397E4B26D00A269

    相关文章

      网友评论

        本文标题:文本分类调研

        本文链接:https://www.haomeiwen.com/subject/zaxesxtx.html