2. NLP with Probabilistic Models

作者: Kevin不会创作 | 来源:发表于2020-12-03 08:33 被阅读0次

2. NLP with Probabilistic Models
概率图模型：原理与技术
Web Information Paper Review (3)
DDPM：去噪扩散概率模型
PGM(Probabilistic Graphical Mode
HMM及CRF
RNN_Back propagation reference
噪声与误差测定
NLP - Log-linear Models
4.NLP with Attention Models

Autocorrect
Part of Speech Tagging and Hidden Markov Models
Autocomplete and Language Models

Autocorrect

What is autocorrect?

Autocorrect is an application that changes misspelled words into the correct ones.
How it works?
1. Identify a misspelled word
2. Find strings n edit distance away
3. Filter candidates
4. Calculate word probabilities
Identify a misspelled word

if word not in vocab:
    misspelled = True

Find strings n edit distance away
- Edit: an operation performed on a string to change it
- Edit distance counts the number of edit operations
- n is usually one to three edits
- Insert (add a letter)
  to: top, two
- Delete (remove a letter)
  hat: ha, at, ht
- Switch (swap 2 adjacent letters)
  eta: eat, tea
- Replace (change 1 letter to another)
  jaw: jar, paw
Filter candidates

Compare these strings to a known dictionary or vocabulary. If the string does not appear in the dictionary, remove it from the list of candidates.
Calculate word probabilities

$P(w)=\frac{C(w)}{V}$
- $P(w)$ Probability of a word
- $C(w)$ Number of times the word appears
- $V$ Total size of the corpus

Minimum edit distance

How to evaluate similarity between 2 strings?
Minimum number of edits needed to transform 1 string into the other
Applications
- Spelling correction
- Document similarity
- Machine translation
- DNA sequencing
Edits (Edit cost)
- Insert (1)
- Delete (1)
- Replace (2)
Example

Source: play
Target: stay
- p $\rightarrow$ s : replace
- l $\rightarrow$ t : replace
- Edit distance = 2 + 2 = 4

Part of Speech Tagging and Hidden Markov Models

What is part of speech tagging?

Part of speech refers to the category of words or the lexical terms in the language. Examples of these lexical terms in the English language would be noun, verb, adjective, adverb, pronoun, preposition, etc. The process of assigning these tags to the words of a sentence or your corpus is referred to as parts of speech tagging, or POS tagging for short.

Markov Chains

A Markov chain is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.

You can use a table called transition matrix to store the states and transition probabilities. Each row in the matrix represents transition probabilities of one state to all other states. Notice that for all the odds going transition probabilities from a given state, the sum of these transition probabilities should always be one. To assign a part of speech tag to the first word in the sentence, you can introduce what is known as an initial state by you include these probabilities in the table.

Transition Matrix

Hidden Markov Model

The name hidden Markov model implies that states are hidden or not directly observable. Going back to the Markov model that has the states for the parts of speech, such as noun, verb, or other, you can now think of these as hidden states because these are not directly observable from the text data.

The hidden Markov model also has additional probabilities known as emission probabilities. These describe the transition from the hidden states of your hidden Markov model, which are parts of speech seen here as circles for noun, verb, and other, to the observables or the words of your corpus.

Emission Matrix

Calculation

Calculate Transition Probabilities
1. Count occurrences of tag pairs $C(t_{i-1},t_i)$
2. Calculate probabilities using the counts
  $P(t_i|t_{i-1})=\frac{C(t_{i-1},t_i)}{\sum_{j=1}^{N}C(t_{i-1},t_{j})}$
Populating the Transition Matrix
Transition Matrix
To avoid a division by zero, you can change your formula slightly
$P(t_i|t_{i-1})=\frac{C(t_{i-1},t_i)+\epsilon}{\sum_{j=1}^{N}C(t_{i-1},t_{j})+N*{\epsilon}}$
Populating the Emission Matrix
Emission Matrix
$P(w_i|t_i)=\frac{C(t_{i},w_i)+\epsilon}{\sum_{j=1}^{V}C(t_{i},w_{j})+N*{\epsilon}}=\frac{C(t_{i},w_i)+\epsilon}{C(t_i)+N*{\epsilon}}$

Viterbi Algorithm

If you're given a sentence like "Why not learn something?", what is the most likely sequence of parts of speech tags given the sentence in your model? The sequence can be computed using the Viterbi algorithm.

The Viterbi algorithm is a dynamic programming algorithm for finding the most likely sequence of hidden states—called the Viterbi path—that results in a sequence of observed events, especially in the context of Markov information sources and hidden Markov models (HMM).

Viterbi Algorithm

Initialization

Given your transition and emission probabilities, you first populates and then use the auxiliary matrices C and D. The matrix C holds the intermediate optimal probabilities and matrix D, the indices of the visited states.

In the initialization step, the first column of each of our matrices, C and D is populated. The first column of C represents probability of the transitions from the start states by in the graph to the first tag $t_i$ and word $W_1$ .
Initialization
In the D matrix, you store the labels that represent the different states you're traversing when finding the most likely sequence of parts of speech tags for the given sequence of words, $W_1$ all the way to $W_k$ .
Initialization
Forward Pass
Forward
- $b_{i,cindex(w_2)}$ is the emission probability from tag $t_1$ toward $w_2$
- $a_{k,i}$ is the transition probability from the parts of speech tag $t_k$ to the current tag $t_1$
- $c_{k,j-1}$ is the probability for the preceding path you've traversed
Backward Pass

You'll retrieve the most likely sequence of parts of speech tags for your given sequence of words using Backward pass.
Backward Backward

Autocomplete and Language Models

What language model can do?
- Autocomplete a sentence
  - Estimate probability of word sequences
  - Estimate probability of a word following a sequence of words
- Speech recognition
  Example: P(I saw a van) > P(eyes awe of an)
- Spelling correction
  Example: P(entered the shop to buy) > P(entered the ship to buy)
- Augmentative communication
  Predict most likely word from menu for people unable to physically talk or sign

N-grams Language Model

What is N-gram?

An N-gram is a sequence of N words.
- Example
  
  Corpus: I am happy because I am learning
  Unigrams: {I, am, happy, because, learning}
  Bigrams: {I am, am happy, happy because, ...}
  Trigrams: {I am happy, am happy because, ...}
Probability
- Unigram probability
  $P(w)=\frac{C(w)}{m}$
- Bigram probability
  $P(y|x)=\frac{C(xy)}{\sum_{w}C(xw)}=\frac{C(xy)}{C(x)}$
- Trigram probability
  $P(w_3|w_1^2)=\frac{C(w_1^2{w_3})}{C(w_1^2)}$
- N-gram probability
  $P(w_N|w_1^{N-1})=\frac{C(w_1^{N-1}w_N)}{C(w_1^{N-{1}})}$
Sequence Probability

2. NLP with Probabilistic Models
Table of Contents Autocorrect Part of Speech Tagging and ...
概率图模型：原理与技术
Probabilistic Graphical Models：Principles and Techniques（...
Web Information Paper Review (3)
Mark Steyvers, Tom Griffiths:Probabilistic Topic Models, ...
DDPM：去噪扩散概率模型
编者整理中... Denoising Diffusion Probabilistic Models 原文：http...
PGM(Probabilistic Graphical Mode
前言从上一篇PGM(Probabilistic Graphical Models)系列--1.基础文章中我们就可...
HMM及CRF
概率图模型概率图模型（probabilistic graphical models）在概率模型的基础上，使用了基...
RNN_Back propagation reference
link:https://www.coursera.org/learn/nlp-sequence-models/d...
噪声与误差测定
Noise and Probabilistic target Noise and Probabilistic ta...
NLP - Log-linear Models
1.The Language Modeling Problem 2.Log-linear models 在实际中，...
4.NLP with Attention Models
Table of Contents Neural Machine Translation Text Summari...