Table of Contents
- Neural Networks for Sentiment Analysis
- Neural Networks in Trax
- RNN for Language Modeling
- Recurrent Neural Networks
- Gated Recurrent Unit
- Bi-directional RNNs
- Deep RNNs
- LSTMs and Named Entity Recognition
- LSTMs
- Named Entity Recognition
- Siamese Networks
Neural Networks for Sentiment Analysis
Neural Networks in Trax
For simple architectures like a 3-layers NN, you will use a serial model.
from trax import layers as t1
Model = t1.Serial(t1.Dense(4), t1.Sigmoid(),
t1.Dense(4), t1.Sigmoid(),
t1.Dense(3), t1.Softmax())
-
Advantages of using frameworks
- Run fast on CPUs, GPUs and TPUs
- Parallel computing
- Record algebraic computations for gradient evaluation
-
Layers
-
Dense Layer
-
ReLU Layer
-
Embedding Layer
An embedding layer takes an index assigned to each word from your vocabulary and maps it to a representation of that word with a determined dimension (embeddings).
Embedding Layer -
Mean Layer
If you just use the embedding layer, you might end up with lots of parameters to train. As an alternative, you could just take the mean of each feature from the embedding and that's exactly what the mean layer does in tracks.
Mean Layer
-
RNN for Language Modeling
-
Disadvantages of the traditional N-gram language model
- Need large N-grams to capture dependencies between distant words
- Need a lot of space and RAM
Recurrent Neural Networks
-
RNNs Basic Structure
Compared to the traditional N-gram language model, RNNs look at every previous word.
Also in RNNs a lot of computations share parameters.
RNNs -
Applications of RNNs
- Caption generation
- Sentiment analysis
- Machine translation
-
Math in RNNs
RNNs RNNs.png RNNs.png-
Cost function
Cross Entropy Loss
where is the number of categories or classes, is the total number of steps.
-
Gated Recurrent Unit
Gated Recurrent Unit (GRU) has some parameters which allow you to control how much information to forget from the past and how much information to extract from the current input.
GRUBi-directional RNNs
In bi-directional RNNS, the outputs take information from the past and the future.
Bi-directional RNNsDeep RNNs
Deep RNNs have more than one layer, which helps in complex tasks.
Deep RNNsLSTMs and Named Entity Recognition
-
RNNs Advantages
Captures dependencies within a short range
-
RNNs Disadvantages
- Struggles with longer sequences
- Prone to vanishing or exploding gradients
-
Solving for vanishing or exploding gradients
Solutions
Long short-term memory
Long short-term memory (LSTM) are the best known solution to the vanishing gradient problem.
-
Applications of LSTMs
- Next-character prediction
- Chatbots
- Music composition
- Image captioning
- Speech recognition
-
LSTM Architecture
A typical LSTM consists of a cell state and a hidden state, which holds the outputs from the cell. You can think of the cell as the memory of your network carrying all the relevant information down the sequence. As the cell travels, each gate adds or removes information from the cell state.
Cell and Hidden StatesThe gates make up the hidden states of your LSTM. They contain activation functions and element-wise operations. LSTMs typically have three gates: the forget gate, the input gate, and the output gate.
-
Forget Gate
The forget gate decides which information from the previous cell state and current input should be kept or tossed out. It does this with a sigmoid function, which squeezes each value from the cell states between zero and one.
Forget Gate -
Input Gate
The input gate updates the cell states. The input gate is actually two layers, a sigmoid layer and a tanh layer. The sigmoid takes the previous hidden states and current inputs and chooses which values to update by assigning zero or one to each value. The closer to one, the higher its importance.
The tanh layer also takes the hidden states and current inputs and squeezes the values between negative one and one. This helps to regulate the flow of information in your network.
Input Gate -
Output Gate
The output gate will decide what your next hidden state should be.
Output Gate
-
Named Entity Recognition
Named entity recognition (NER) is a fast and efficient way to scan text for certain kinds of information. NER systems locate and extract named entities from texts. Named entities can be anything from a place to an organization, to a person's name. They can even be times and dates.
-
Example of a labeled sentence
NERNER system would extract and classifies Sharon as a personal name, Miami as a geographical entity, and Friday as a time indicator. All the other words are classified O for filler word.
-
Applications of NER systems
- Search engine efficiency
- Recommendation engines
- Customer service
- Automatic trading
-
Processing data for NERs
- Assign each class a number
- Assign each word a number
- Set sequence length to a certain number
- Use the
PAD
token to fill empty spaces
-
Training the NER
- Create a tensor for each input and its corresponding number
- Put them in a batch
- Feed it into an LSTM unit
- Run the output through a dense layer
- Predict using a log softmax over K classes
Siamese Networks
Siamese Networks is a neural network made up of two identical neural networks which are merged at the end.
-
Applications of Siamese Networks
Applications -
Architecture
ArchitectureNoticed that the two subnetworks in a siamese network share the same parameters. That is the learned parameters of each sub-network are exactly the same. So you actually only need to train one sets of weights, not two.
-
Cost Function
How old are you?
Anchor (A)
What is your age?Positive (P)
Where are you from?Negative (N)
To train your model, you'll be comparing the vectors that are outputs by each sub-network using similarity.
-
Triplets
Having the three components here is what gives rise to the name triplets, which is to say, an anchor being used in conjunction with a positive and negative pairing. Accordingly, triplet loss is the name for a loss function that uses three components.
To make sure that the model doesn't update itself to do worse, you can modify the loss so that whenever the diff is less than zero, the loss should just be zero. When the loss is zero, we're effectively not asking the model to update its weights, because it is performing as expected for that training example.
Notice the non-linearity happens at the origin of this line chart. But you might also wonder what's happens when the model is correct but only by a tiny bits? The model is still correct if the difference is a tiny number, that is less than zero. What if you want the model to still learn from this example, and ask it to predict a wider difference for this training example? You can think of shifting this loss function a little to the left, by a margin that we'll refer to as Alpha.
-
Triplet Selection
- Random: Easy to satisfy. Little to learn.
- Hard: Harder to train. More to learn.
-
Calculation
Firstly, prepare data in batches. Noticed that in each row all of the sentences in the columns are duplicates. But for any column, none of the rows in those column contain a sentence that is a duplicate of another sentence in those column.
PreparationThen get vectors for these two batches. Each question in the batch 1 is a duplicate of its corresponding question in batch 2. But none of the questions in batch 1 are duplicates of each other.
Obtain VectorsThe last step is to combine the two branches of the Siamese network by calculating the similarity between all vector pair combinations of v_1 with v_2.
Compute SimilarityNow, you can just stop here and use these similarities with the triplet loss function you already know shown here. Then the overall costs for your Siamese network will be the sum of these individual losses over the training sets. Here you can see that superscripts refers to a specific training example and there are observations.
-
Modification
You can use this off diagonal information to make some modifications to the loss function and really improve your models performance.
- Mean negative: Mean of off-diagonal values in each row.
- Closest negative: off-diagonal value closest to (but less than) the value on diagonal in each row.
With the mean negative, this helps the model converge faster during training by reducing noise. It reduces noise by training on just the average of several observations, rather than training the model on each of these off-diagonal examples. So why does taking the average of several observations usually reduce noise? Well, we define noise to be a small value that comes from a distribution that is centered around 0. So if we took the average of several examples, this has the effect of cancelling out the individual noise from those observations.
With the closest negative, this helps create a slightly larger penalty by diminishing the effects of the otherwise more negative similarity of A and N that it replaces. You can think of the closest negative as finding the negative example that results in the smallest difference between the two cosine similarities. If you had that small difference to alpha, then you're able to generate the largest loss among all of the other examples in that row. By focusing the training on the examples that produce higher loss values, you make the model update its weights more.
-
-
One Shot Learning
In one-shot learning, you don't need to re-train the entire system when there's a new sample coming in. Instead, you just learn a similarity function that can be used to calculate a similarity score. That can in turn be used to identify whether two samples are the same.
-
Classification vs. One Shot Learning
Comparison
-
网友评论