美文网首页
BMI598: Natural Language Process

BMI598: Natural Language Process

作者: MrGiovanni | 来源:发表于2018-11-05 16:08 被阅读53次

    Author: Zongwei Zhou | 周纵苇
    Weibo: @MrGiovanni
    Email: zongweiz@asu.edu

    1. Token Features


    1.1 token feature

    • case folding
    • punctuation (标点)
    • prefix/stem patterns
    • word shape
    • character n-grams

    1.2 context feature

    • token feature from n tokens before and n tokens after
    • word n-grams, n=2,3,4
    • skip-n-grams

    1.3 sentence features

    • sentence length
    • case-folding patterns
    • presence of digits
    • enumeration tokens at the start
    • a colon at the end
    • whether verbs indicate past or future tense

    1.4 section features

    • headings
    • subsections

    1.5 document features

    • case pattern across the document
    • document length indicator

    1.6 normalization

    Stemming和Lemmatization的区别
    Stemming:基于规则

    from nltk.stem.porter import PorterStemmer
    porter_stemmer = PorterStemmer()
    porter_stemmer.stem('wolves')
    # 结果里es被去掉了
    u'wolv'
    

    Lemmatization:基于字典

    from nltk.stem import WordNetLemmatizer
    lemmatizer = WordNetLemmatizer()
    lemmatizer.lemmatize('wolves')
    # 结果准确
    u'wolf'
    

    2. Word Embedding


    2.1 tf-idf

    特征的长度是整个字典单词数
    关键词:计数
    参考这个example:https://en.wikipedia.org/wiki/Tf%E2%80%93idf

    2.2 word2vec

    特征长度是固定的,一般比较小(几百)

    Start with V random 300-dimensional vectors as initial embeddings
    Use logistic regression, the second most basic classifier used in machine learning after naïve bayes

    • Take a corpus and take pairs of words that co-occur as positive examples
    • Take pairs of words that don't co-occur as negative examples
    • Train the classifier to distinguish these by slowly adjusting all the embeddings to improve the classifier performance
    • Throw away the classifier code and keep the embeddings.

    Pre-trained models are available for download
    https://code.google.com/archive/p/word2vec/
    You can use gensim (in python) to access the models
    http://nlp.stanford.edu/projects/glove/

    Brilliant insight: Use running text as implicitly supervised training data!

    Setup
    Let's represent words as vectors of some length (say 300), randomly initialized.
    So we start with 300 * V random parameters. V是字典中单词的数目。
    Over the entire training set, we’d like to adjust those word vectors such that we

    • Maximize the similarity of the target word, context word pairs (t,c) drawn from the positive data
    • Minimize the similarity of the (t,c) pairs drawn from the negative data.

    Learning the classifier
    Iterative process.
    We’ll start with 0 or random weights
    Then adjust the word weights to

    • make the positive pairs more likely
    • and the negative pairs less likely over the entire training set:

    3. Sentence vectors


    Distributed Representations of Sentences and Documents

    PV-DM [???]

    • Paragraph as a pseudo word
    • The algorithm learns a matrix of D vectors, corresponding to D paragraphs
    • in addition to W word vectors
    • Contexts are fixed length
    • Sampled from a sliding window over the paragraph
    • PV and WV are trained using Stochastic Gradient Descent

    What about the unseen paragraphs? [???]

    • Add more columns to D (the paragraph vectors matrix)
    • Learn the new D, while holding U, b, and W fixed
    • We use D as features in a standard classifier

    PV-DBOW [???]

    • Works by using a sliding window on a paragraph
    • then predict words randomly sampled from the paragraph
    • prediction: a classification task of the random word given the PV
    When predicting sentiment of a sentence, use paragraph vector instead of single word embedding.

    4. Neural Network


    \sigma(z)=\frac{1}{1+e^{-z}}
    softmax(z_i)=\frac{e^{z_i}}{\sum_{j=1}^{d}d^{z_j}} 1\leq i\leq d

    import numpy as np
    z = [1.0, 2.0, 3.0, 4.0, 1.0, 2.0, 3.0]
    softmax = lambda z:np.exp(z)/np.sum(np.exp(z))
    softmax(z)
    array([0.02364054, 0.06426166, 0.1746813 , 0.474833  , 0.02364054, 0.06426166, 0.1746813 ])
    

    http://colah.github.io/posts/2015-08-Understanding-LSTMs/

    5. Highlight summary


    • I2b2 challenge – concepts, relations
    • Vector semantics – long vectors
    • Vector semantics – Word embeddings
    • Vector semantics – how to compute word embeddings
    • Vector semantics – Paragraph vectors
    • UMLS and Metamap lite (max match algorithm)
    • Neuron and math behind it
    • Feed forward neural network model - math behind it
    • Example FFN for predicting the next word
    • Keras – Intro and validation
    • Keras examples – simple solutions to concept extraction and relations
    • Data preparation for concept extraction and relation classification
    • IBM MADE 1.0 paper: concepts/relations using BiLSTM CRF/Attention
    • Recurrent neural networks and LSTM

    相关文章

      网友评论

          本文标题:BMI598: Natural Language Process

          本文链接:https://www.haomeiwen.com/subject/kccsxqtx.html