3. NLP with Sequence Models

作者: Kevin不会创作 | 来源:发表于2020-12-06 09:02 被阅读0次

3. NLP with Sequence Models
RNN_Back propagation reference
Sequence models
Introduction of Sequence models
论文阅读：BiLSTM-CRF实现序列标注
NLP论文收藏
Keras版Sequence2Sequence对对联实战——自然
Neural Models for Sequence Chunk
Attension is All You Need 论文笔记
TensorFlow深度学习笔记文本与序列的深度模型

Neural Networks for Sentiment Analysis
- Neural Networks in Trax
RNN for Language Modeling
- Recurrent Neural Networks
- Gated Recurrent Unit
- Bi-directional RNNs
- Deep RNNs
LSTMs and Named Entity Recognition
- LSTMs
- Named Entity Recognition
Siamese Networks

Neural Networks for Sentiment Analysis

Neural Networks in Trax

For simple architectures like a 3-layers NN, you will use a serial model.

from trax import layers as t1
Model = t1.Serial(t1.Dense(4), t1.Sigmoid(),
                  t1.Dense(4), t1.Sigmoid(),
                  t1.Dense(3), t1.Softmax())

Advantages of using frameworks
- Run fast on CPUs, GPUs and TPUs
- Parallel computing
- Record algebraic computations for gradient evaluation
Layers
- Dense Layer
  $z^{[i]}=w^{[i]}a^{[i-1]}$
- ReLU Layer
  $a^{[i]}=g(z^{[i]})=max(0, z^{[i]})$
- Embedding Layer
  
  An embedding layer takes an index assigned to each word from your vocabulary and maps it to a representation of that word with a determined dimension (embeddings).
  Embedding Layer
- Mean Layer
  
  If you just use the embedding layer, you might end up with lots of parameters to train. As an alternative, you could just take the mean of each feature from the embedding and that's exactly what the mean layer does in tracks.
  Mean Layer

RNN for Language Modeling

Disadvantages of the traditional N-gram language model
- Need large N-grams to capture dependencies between distant words
- Need a lot of space and RAM

Recurrent Neural Networks

RNNs Basic Structure

Compared to the traditional N-gram language model, RNNs look at every previous word.

Also in RNNs a lot of computations share parameters.
RNNs
Applications of RNNs
- Caption generation
- Sentiment analysis
- Machine translation
Math in RNNs
RNNs RNNs.png RNNs.png
- Cost function
  
  Cross Entropy Loss
  $J=-\frac{1}{T}\sum_{t=1}^{T}\sum_{j=1}^{K}y_j^{<t>}log\ \hat{y_j}^{<t>}$
  
  where $K$ is the number of categories or classes, $T$ is the total number of steps.

Gated Recurrent Unit

Gated Recurrent Unit (GRU) has some parameters which allow you to control how much information to forget from the past and how much information to extract from the current input.

GRU

Bi-directional RNNs

In bi-directional RNNS, the outputs take information from the past and the future.

Bi-directional RNNs

Deep RNNs

Deep RNNs have more than one layer, which helps in complex tasks.

Deep RNNs

LSTMs and Named Entity Recognition

RNNs Advantages

Captures dependencies within a short range
RNNs Disadvantages
- Struggles with longer sequences
- Prone to vanishing or exploding gradients
Solving for vanishing or exploding gradients
Solutions

Long short-term memory

Long short-term memory (LSTM) are the best known solution to the vanishing gradient problem.

Applications of LSTMs
- Next-character prediction
- Chatbots
- Music composition
- Image captioning
- Speech recognition
LSTM Architecture

A typical LSTM consists of a cell state and a hidden state, which holds the outputs from the cell. You can think of the cell as the memory of your network carrying all the relevant information down the sequence. As the cell travels, each gate adds or removes information from the cell state.
Cell and Hidden States
The gates make up the hidden states of your LSTM. They contain activation functions and element-wise operations. LSTMs typically have three gates: the forget gate, the input gate, and the output gate.
- Forget Gate
  
  The forget gate decides which information from the previous cell state and current input should be kept or tossed out. It does this with a sigmoid function, which squeezes each value from the cell states between zero and one.
  Forget Gate
- Input Gate
  
  The input gate updates the cell states. The input gate is actually two layers, a sigmoid layer and a tanh layer. The sigmoid takes the previous hidden states and current inputs and chooses which values to update by assigning zero or one to each value. The closer to one, the higher its importance.
  
  The tanh layer also takes the hidden states and current inputs and squeezes the values between negative one and one. This helps to regulate the flow of information in your network.
  Input Gate
- Output Gate
  
  The output gate will decide what your next hidden state should be.
  Output Gate

Named Entity Recognition

Named entity recognition (NER) is a fast and efficient way to scan text for certain kinds of information. NER systems locate and extract named entities from texts. Named entities can be anything from a place to an organization, to a person's name. They can even be times and dates.

Example of a labeled sentence
NER
NER system would extract and classifies Sharon as a personal name, Miami as a geographical entity, and Friday as a time indicator. All the other words are classified O for filler word.
Applications of NER systems
- Search engine efficiency
- Recommendation engines
- Customer service
- Automatic trading
Processing data for NERs
- Assign each class a number
- Assign each word a number
- Set sequence length to a certain number
- Use the PAD token to fill empty spaces
Training the NER
1. Create a tensor for each input and its corresponding number
2. Put them in a batch
3. Feed it into an LSTM unit
4. Run the output through a dense layer
5. Predict using a log softmax over K classes

Siamese Networks

Siamese Networks is a neural network made up of two identical neural networks which are merged at the end.

Applications of Siamese Networks
Applications
Architecture
Architecture
Noticed that the two subnetworks in a siamese network share the same parameters. That is the learned parameters of each sub-network are exactly the same. So you actually only need to train one sets of weights, not two.
Cost Function

How old are you? $\rightarrow$ Anchor (A)
What is your age? $\rightarrow$ Positive (P)
Where are you from? $\rightarrow$ Negative (N)

To train your model, you'll be comparing the vectors that are outputs by each sub-network using similarity.
$s(A,P)\approx 1$

$s(A,N)\approx -1$

$diff=s(A,N)-s(A,P)$
- Triplets
  
  Having the three components here is what gives rise to the name triplets, which is to say, an anchor being used in conjunction with a positive and negative pairing. Accordingly, triplet loss is the name for a loss function that uses three components.
  
  To make sure that the model doesn't update itself to do worse, you can modify the loss so that whenever the diff is less than zero, the loss should just be zero. When the loss is zero, we're effectively not asking the model to update its weights, because it is performing as expected for that training example.
  $L=\begin{cases} 0, & if\ diff\leq 0\\ diff, & if\ diff>0 \end{cases}$
  
  Notice the non-linearity happens at the origin of this line chart. But you might also wonder what's happens when the model is correct but only by a tiny bits? The model is still correct if the difference is a tiny number, that is less than zero. What if you want the model to still learn from this example, and ask it to predict a wider difference for this training example? You can think of shifting this loss function a little to the left, by a margin that we'll refer to as Alpha.
  $L=\begin{cases} 0, & if\ diff+\alpha\leq 0\\ diff+\alpha, & if\ diff+\alpha>0 \end{cases}$
- Triplet Selection
  - Random: Easy to satisfy. Little to learn.
  - Hard: Harder to train. More to learn.
    $s(A,N)\approx s(A,P)$
- Calculation
  
  Firstly, prepare data in batches. Noticed that in each row all of the sentences in the columns are duplicates. But for any column, none of the rows in those column contain a sentence that is a duplicate of another sentence in those column.
  Preparation
  Then get vectors for these two batches. Each question in the batch 1 is a duplicate of its corresponding question in batch 2. But none of the questions in batch 1 are duplicates of each other.
  Obtain Vectors
  The last step is to combine the two branches of the Siamese network by calculating the similarity between all vector pair combinations of v_1 with v_2.
  Compute Similarity
  Now, you can just stop here and use these similarities with the triplet loss function you already know shown here. Then the overall costs for your Siamese network will be the sum of these individual losses over the training sets. Here you can see that superscripts $i$ refers to a specific training example and there are $m$ observations.
  $diff=s(A,N)-s(A,P)$
  
  $L(A,P,N)=max(diff+\alpha,\ 0)$
  
  $J=\sum_{i=1}^{m}L(A^{(i)},P^{(i)},N^{(i)})$
- Modification
  
  You can use this off diagonal information to make some modifications to the loss function and really improve your models performance.
  - Mean negative: Mean of off-diagonal values in each row.
  - Closest negative: off-diagonal value closest to (but less than) the value on diagonal in each row.
  With the mean negative, this helps the model converge faster during training by reducing noise. It reduces noise by training on just the average of several observations, rather than training the model on each of these off-diagonal examples. So why does taking the average of several observations usually reduce noise? Well, we define noise to be a small value that comes from a distribution that is centered around 0. So if we took the average of several examples, this has the effect of cancelling out the individual noise from those observations.
  
  With the closest negative, this helps create a slightly larger penalty by diminishing the effects of the otherwise more negative similarity of A and N that it replaces. You can think of the closest negative as finding the negative example that results in the smallest difference between the two cosine similarities. If you had that small difference to alpha, then you're able to generate the largest loss among all of the other examples in that row. By focusing the training on the examples that produce higher loss values, you make the model update its weights more.
  
  $L_{Original}=max(s(A,N)-s(A,P)+\alpha,0)$
  
  $L_1=max(mean\_neg-s(A,P)+\alpha,0)$
  
  $L_2=max(closest\_neg-s(A,P)+\alpha,0)$
  
  $L_{Full}(A,P,N)=L_1+L_2$
  
  $J=\sum_{i=1}^{m}L_{Full}(A^{(i)},P^{(i)},N^{(i)})$
One Shot Learning

In one-shot learning, you don't need to re-train the entire system when there's a new sample coming in. Instead, you just learn a similarity function that can be used to calculate a similarity score. That can in turn be used to identify whether two samples are the same.
- Classification vs. One Shot Learning
  Comparison

3. NLP with Sequence Models
Table of Contents Neural Networks for Sentiment AnalysisN...
RNN_Back propagation reference
link:https://www.coursera.org/learn/nlp-sequence-models/d...
Sequence models
One 2 one One2many 生成音乐 Many2one 情感分析 Many2many 序列标注，输入输出...
Introduction of Sequence models
Why Sequence Models 循回神经网络（RNN）等序列模型已经颠覆了许多领域，如语言辨识，自然语言处...
论文阅读：BiLSTM-CRF实现序列标注
论文名称：《Bidirectional LSTM-CRF Models for Sequence Tagging》...
NLP论文收藏
序列标注 1、Bidirectional LSTM-CRF Models for Sequence Tagging...
Keras版Sequence2Sequence对对联实战——自然
今天我们来做NLP（自然语言处理）中Sequence2Sequence的任务。其中Sequence2Sequenc...
Neural Models for Sequence Chunk
论文：https://arxiv.org/pdf/1701.04027.pdf In this paper, we...
Attension is All You Need 论文笔记
概述主流的序列转换模型(dominant sequence transduction models)都是基于复杂...
TensorFlow深度学习笔记文本与序列的深度模型
Deep Models for Text and Sequence 转载请注明作者：梦里风林 Github工程地址...