美文网首页
3. NLP with Sequence Models

3. NLP with Sequence Models

作者: Kevin不会创作 | 来源:发表于2020-12-06 09:02 被阅读0次

Table of Contents

  • Neural Networks for Sentiment Analysis
    • Neural Networks in Trax
  • RNN for Language Modeling
    • Recurrent Neural Networks
    • Gated Recurrent Unit
    • Bi-directional RNNs
    • Deep RNNs
  • LSTMs and Named Entity Recognition
    • LSTMs
    • Named Entity Recognition
  • Siamese Networks

Neural Networks for Sentiment Analysis

Neural Networks in Trax

For simple architectures like a 3-layers NN, you will use a serial model.

from trax import layers as t1
Model = t1.Serial(t1.Dense(4), t1.Sigmoid(),
                  t1.Dense(4), t1.Sigmoid(),
                  t1.Dense(3), t1.Softmax())
  • Advantages of using frameworks

    • Run fast on CPUs, GPUs and TPUs
    • Parallel computing
    • Record algebraic computations for gradient evaluation
  • Layers

    • Dense Layer
      z^{[i]}=w^{[i]}a^{[i-1]}

    • ReLU Layer
      a^{[i]}=g(z^{[i]})=max(0, z^{[i]})

    • Embedding Layer

      An embedding layer takes an index assigned to each word from your vocabulary and maps it to a representation of that word with a determined dimension (embeddings).

      Embedding Layer
    • Mean Layer

      If you just use the embedding layer, you might end up with lots of parameters to train. As an alternative, you could just take the mean of each feature from the embedding and that's exactly what the mean layer does in tracks.

      Mean Layer

RNN for Language Modeling

  • Disadvantages of the traditional N-gram language model

    • Need large N-grams to capture dependencies between distant words
    • Need a lot of space and RAM

Recurrent Neural Networks

  • RNNs Basic Structure

    Compared to the traditional N-gram language model, RNNs look at every previous word.

    Also in RNNs a lot of computations share parameters.

    RNNs
  • Applications of RNNs

    • Caption generation
    • Sentiment analysis
    • Machine translation
  • Math in RNNs

    RNNs RNNs.png RNNs.png
    • Cost function

      Cross Entropy Loss
      J=-\frac{1}{T}\sum_{t=1}^{T}\sum_{j=1}^{K}y_j^{<t>}log\ \hat{y_j}^{<t>}

      where K is the number of categories or classes, T is the total number of steps.

Gated Recurrent Unit

Gated Recurrent Unit (GRU) has some parameters which allow you to control how much information to forget from the past and how much information to extract from the current input.

GRU

Bi-directional RNNs

In bi-directional RNNS, the outputs take information from the past and the future.

Bi-directional RNNs

Deep RNNs

Deep RNNs have more than one layer, which helps in complex tasks.

Deep RNNs

LSTMs and Named Entity Recognition

  • RNNs Advantages

    Captures dependencies within a short range

  • RNNs Disadvantages

    • Struggles with longer sequences
    • Prone to vanishing or exploding gradients
  • Solving for vanishing or exploding gradients

    Solutions

Long short-term memory

Long short-term memory (LSTM) are the best known solution to the vanishing gradient problem.

  • Applications of LSTMs

    • Next-character prediction
    • Chatbots
    • Music composition
    • Image captioning
    • Speech recognition
  • LSTM Architecture

    A typical LSTM consists of a cell state and a hidden state, which holds the outputs from the cell. You can think of the cell as the memory of your network carrying all the relevant information down the sequence. As the cell travels, each gate adds or removes information from the cell state.

    Cell and Hidden States

    The gates make up the hidden states of your LSTM. They contain activation functions and element-wise operations. LSTMs typically have three gates: the forget gate, the input gate, and the output gate.

    • Forget Gate

      The forget gate decides which information from the previous cell state and current input should be kept or tossed out. It does this with a sigmoid function, which squeezes each value from the cell states between zero and one.

      Forget Gate
    • Input Gate

      The input gate updates the cell states. The input gate is actually two layers, a sigmoid layer and a tanh layer. The sigmoid takes the previous hidden states and current inputs and chooses which values to update by assigning zero or one to each value. The closer to one, the higher its importance.

      The tanh layer also takes the hidden states and current inputs and squeezes the values between negative one and one. This helps to regulate the flow of information in your network.

      Input Gate
    • Output Gate

      The output gate will decide what your next hidden state should be.

      Output Gate

Named Entity Recognition

Named entity recognition (NER) is a fast and efficient way to scan text for certain kinds of information. NER systems locate and extract named entities from texts. Named entities can be anything from a place to an organization, to a person's name. They can even be times and dates.

  • Example of a labeled sentence

    NER

    NER system would extract and classifies Sharon as a personal name, Miami as a geographical entity, and Friday as a time indicator. All the other words are classified O for filler word.

  • Applications of NER systems

    • Search engine efficiency
    • Recommendation engines
    • Customer service
    • Automatic trading
  • Processing data for NERs

    • Assign each class a number
    • Assign each word a number
    • Set sequence length to a certain number
    • Use the PAD token to fill empty spaces
  • Training the NER

    1. Create a tensor for each input and its corresponding number
    2. Put them in a batch
    3. Feed it into an LSTM unit
    4. Run the output through a dense layer
    5. Predict using a log softmax over K classes

Siamese Networks

Siamese Networks is a neural network made up of two identical neural networks which are merged at the end.

  • Applications of Siamese Networks

    Applications
  • Architecture

    Architecture

    Noticed that the two subnetworks in a siamese network share the same parameters. That is the learned parameters of each sub-network are exactly the same. So you actually only need to train one sets of weights, not two.

  • Cost Function

    How old are you? \rightarrow Anchor (A)
    What is your age? \rightarrow Positive (P)
    Where are you from? \rightarrow Negative (N)

    To train your model, you'll be comparing the vectors that are outputs by each sub-network using similarity.
    s(A,P)\approx 1

    s(A,N)\approx -1

    diff=s(A,N)-s(A,P)

    • Triplets

      Having the three components here is what gives rise to the name triplets, which is to say, an anchor being used in conjunction with a positive and negative pairing. Accordingly, triplet loss is the name for a loss function that uses three components.

      To make sure that the model doesn't update itself to do worse, you can modify the loss so that whenever the diff is less than zero, the loss should just be zero. When the loss is zero, we're effectively not asking the model to update its weights, because it is performing as expected for that training example.
      L=\begin{cases} 0, & if\ diff\leq 0\\ diff, & if\ diff>0 \end{cases}

      Notice the non-linearity happens at the origin of this line chart. But you might also wonder what's happens when the model is correct but only by a tiny bits? The model is still correct if the difference is a tiny number, that is less than zero. What if you want the model to still learn from this example, and ask it to predict a wider difference for this training example? You can think of shifting this loss function a little to the left, by a margin that we'll refer to as Alpha.
      L=\begin{cases} 0, & if\ diff+\alpha\leq 0\\ diff+\alpha, & if\ diff+\alpha>0 \end{cases}

    • Triplet Selection

      • Random: Easy to satisfy. Little to learn.
      • Hard: Harder to train. More to learn.
        s(A,N)\approx s(A,P)
    • Calculation

      Firstly, prepare data in batches. Noticed that in each row all of the sentences in the columns are duplicates. But for any column, none of the rows in those column contain a sentence that is a duplicate of another sentence in those column.

      Preparation

      Then get vectors for these two batches. Each question in the batch 1 is a duplicate of its corresponding question in batch 2. But none of the questions in batch 1 are duplicates of each other.

      Obtain Vectors

      The last step is to combine the two branches of the Siamese network by calculating the similarity between all vector pair combinations of v_1 with v_2.

      Compute Similarity

      Now, you can just stop here and use these similarities with the triplet loss function you already know shown here. Then the overall costs for your Siamese network will be the sum of these individual losses over the training sets. Here you can see that superscripts i refers to a specific training example and there are m observations.
      diff=s(A,N)-s(A,P)

      L(A,P,N)=max(diff+\alpha,\ 0)

      J=\sum_{i=1}^{m}L(A^{(i)},P^{(i)},N^{(i)})

    • Modification

      You can use this off diagonal information to make some modifications to the loss function and really improve your models performance.

      • Mean negative: Mean of off-diagonal values in each row.
      • Closest negative: off-diagonal value closest to (but less than) the value on diagonal in each row.

      With the mean negative, this helps the model converge faster during training by reducing noise. It reduces noise by training on just the average of several observations, rather than training the model on each of these off-diagonal examples. So why does taking the average of several observations usually reduce noise? Well, we define noise to be a small value that comes from a distribution that is centered around 0. So if we took the average of several examples, this has the effect of cancelling out the individual noise from those observations.

      With the closest negative, this helps create a slightly larger penalty by diminishing the effects of the otherwise more negative similarity of A and N that it replaces. You can think of the closest negative as finding the negative example that results in the smallest difference between the two cosine similarities. If you had that small difference to alpha, then you're able to generate the largest loss among all of the other examples in that row. By focusing the training on the examples that produce higher loss values, you make the model update its weights more.

      L_{Original}=max(s(A,N)-s(A,P)+\alpha,0)

      L_1=max(mean\_neg-s(A,P)+\alpha,0)

      L_2=max(closest\_neg-s(A,P)+\alpha,0)

      L_{Full}(A,P,N)=L_1+L_2

      J=\sum_{i=1}^{m}L_{Full}(A^{(i)},P^{(i)},N^{(i)})

  • One Shot Learning

    In one-shot learning, you don't need to re-train the entire system when there's a new sample coming in. Instead, you just learn a similarity function that can be used to calculate a similarity score. That can in turn be used to identify whether two samples are the same.

    • Classification vs. One Shot Learning

      Comparison

相关文章

网友评论

      本文标题:3. NLP with Sequence Models

      本文链接:https://www.haomeiwen.com/subject/dggnwktx.html