ADVENTURES IN MACHINE LEARNING
LEARN AND EXPLORE MACHINE LEARNING
Keras LSTM tutorial – How to easily build a powerful deep learning language model
February 3, 2018 Andy Deep learning, Keras, LSTMs 8
Keras LSTM tutorial architecture
In previous posts, I introduced Keras for building convolutional neural networks and performing word embedding. The next natural step is to talk about implementing recurrent neural networks in Keras. In a previous tutorial of mine, I gave a very comprehensive introduction to recurrent neural networks and long short term memory (LSTM) networks, implemented in TensorFlow. In this tutorial, I’ll concentrate on creating LSTM networks in Keras, briefly giving a recap or overview of how LSTMs work. In this Keras LSTM tutorial, we’ll implement a sequence-to-sequence text prediction model by utilizing a large text data set called the PTB corpus. All the code in this tutorial can be found on this site’s Github repository.
Recommended online course: If you are more of a video course learner, I’d recommend this inexpensive Udemy course to learn more about Keras and LSTM networks: Zero to Deep Learning with Python and Keras
A brief introduction to LSTM networks
Recurrent neural networks
A LSTM network is a kind of recurrent neural network. A recurrent neural network is a neural network that attempts to model time or sequence dependent behaviour – such as language, stock prices, electricity demand and so on. This is performed by feeding back the output of a neural network layer at timet to the input of the same network layer at timet + 1. It looks like this:
Recurrent neural network diagram with nodes shown
Recurrent neural networks are “unrolled” programmatically during training and prediction, so we get something like the following:
Unrolled recurrent neural network
Here you can see that at each time step, a new word is being supplied – the output of the previous F(i.e.ht−1ht−1) is supplied to the network at each time step also. If you’re wondering what those example words are referring to, it is an example sentence I used in my previous LSTM tutorial in TensorFlow: “A girl walked into a bar, and she said ‘Can I have a drink please?’. The bartender said ‘Certainly’”.
The problem with vanilla recurrent neural networks, constructed from regular neural network nodes, is that as we try to model dependencies between words or sequence values that are separated by a significant number of other words, we experience the vanishing gradient problem (and also sometimes the exploding gradient problem) – to learn more about the vanishing gradient problem, see my post on the topic. This is because small gradients or weights (values less than 1) are multiplied many times over through the multiple time steps, and the gradients shrink asymptotically to zero. This means the weights of those earlier layers won’t be changed significantly and therefore the network won’t learn long-term dependencies.
LSTM networks are a way of solving this problem.
LSTM networks
As mentioned previously, in this Keras LSTM tutorial we will be building an LSTM network for text prediction. An LSTM network is a recurrent neural network that has LSTM cell blocks in place of our standard neural network layers. These cells have various components called the input gate, the forget gate and the output gate – these will be explained more fully later. Here is a graphical representation of the LSTM cell:
LSTM cell diagram
Notice first, on the left hand side, we have our new word/sequence valuextxt being concatenated to the previous output from the cellht−1ht−1. The first step for this combined input is for it to be squashed via atanh layer. The second step is that this input is passed through aninput gate. An input gate is a layer of sigmoid activated nodes whose output is multiplied by the squashed input. These input gate sigmoids can act to “kill off” any elements of the input vector that aren’t required. A sigmoid function outputs values between 0 and 1, so the weights connecting the input to these nodes can be trained to output values close to zero to “switch off” certain input values (or, conversely, outputs close to 1 to “pass through” other values).
The next step in the flow of data through this cell is the internal state / forget gate loop. LSTM cells have an internal state variablestst. This variable, lagged one time step i.e.st−1st−1isadded to the input data to create an effective layer of recurrence. Thisaddition operation, instead of a multiplication operation, helps to reduce the risk of vanishing gradients. However, this recurrence loop is controlled by a forget gate – this works the same as the input gate, but instead helps the network learn which state variables should be “remembered” or “forgotten”.
Finally, we have an output layertanh squashing function, the output of which is controlled by anoutput gate. This gate determines which values are actually allowed as an output from the cellhtht.
The mathematics of the LSTM cell looks like this:
Input
First, the input is squashed between -1 and 1 using atanh activation function. This can be expressed by:
g=tanh(bg+xtUg+ht−1Vg)g=tanh(bg+xtUg+ht−1Vg)
Where UgUg and VgVg are the weights for the input and previous cell output, respectively, andbgbg is the input bias. Note that the exponents g are not a raised power, but rather signify that these are the input weights and bias values (as opposed to the input gate, forget gate, output gate etc.).
This squashed input is then multiplied element-wise by the output of theinput gate, which, as discussed above, is a series of sigmoid activated nodes:
i=σ(bi+xtUi+ht−1Vi)i=σ(bi+xtUi+ht−1Vi)
The output of the input section of the LSTM cell is then given by:
g∘ig∘i
Where the ∘∘ operator expresses element-wise multiplication.
Forget gate and state loop
The forget gate output is expressed as:
f=σ(bf+xtUf+ht−1Vf)f=σ(bf+xtUf+ht−1Vf)
The output of the element-wise product of the previous state and the forget gate is expressed as st−1∘fst−1∘f. The output from the forget gate / state loop stage is:
st=st−1∘f+g∘ist=st−1∘f+g∘i
Output gate
The output gate is expressed as:
o=σ(bo+xtUo+ht−1Vo)o=σ(bo+xtUo+ht−1Vo)
So the final output of the cell , with the tanh squashing, can be shown as:
ht=tanh(st)∘oht=tanh(st)∘o
LSTM word embedding and hidden layer size
It should be remembered that in all of the mathematics above we are dealing with vectors i.e. the inputxtxt andht−1ht−1are not single valued scalars, but rather vectors of a certain length. Likewise, all the weights and bias values are matrices and vectors respectively. Now, you may be wondering, how do we represent words to input them to a neural network? The answer is word embedding. I’ve written about this extensively in previous tutorials, in particularWord2Vec word embedding tutorial in Python and TensorFlow andA Word2Vec Keras tutorial. Basically it involves taking a word and finding a vector representation of that word which captures some meaning of the word. In Word2Vec, this meaning is usually quantified by context – i.e. word vectors which are close together in vector space are those words which appear in sentences close to the same words.
The word vectors can be learnt separately, as in this tutorial, or they can be learnt during the training of your Keras LSTM network. In the example to follow, we’ll be setting up what is called anembedding layer, to convert each word into a meaningful word vector. We have to specify the size of the embedding layer – this is the length of the vector each word is represented by – this is usually in the region of between 100-500. In other words, if the embedding layer size is 250, each word will be represented by a 250-length vector i.e. [x1,x2,x3,…,x250x1,x2,x3,…,x250].
LSTM hidden layer size
We usually match up the size of the embedding layer output with the number of hidden layers in the LSTM cell. You might be wondering where the hidden layers in the LSTM cell come from. In my LSTM overview diagram, I simply showed “data rails” through which our input data flowed. However, eachsigmoid,tanh orhidden state layer in the cell is actually a set of nodes, whose number is equal to the hidden layer size. Therefore each of the “nodes” in the LSTM cell is actually a cluster of normal neural network nodes, as in each layer of a densely connected neural network.
The Keras LSTM architecture
This section will illustrate what a full LSTM architecture looks like, and show the architecture of the network that we are building in Keras. This will further illuminate some of the ideas expressed above, including the embedding layer and the tensor sizes flowing around the network. The proposed architecture looks like the following:
Keras LSTM tutorial architecture
The input shape of the text data is ordered as follows : (batch size, number of time steps, hidden size). In other words, for each batch sample and each word in the number of time steps, there is a 500 length embedding word vector to represent the input word. These embedding vectors will be learnt as part of the overall model learning. The input data is then fed into two “stacked” layers of LSTM cells (of 500 length hidden size) – in the diagram above, the LSTM network is shown as unrolled over all the time steps. The output from these unrolled cells is still (batch size, number of time steps, hidden size).
This output data is then passed to a Keras layer called TimeDistributed, which will be explained more fully below. Finally, the output layer has asoftmax activation applied to it. This output is compared to the trainingy data for each batch, and the error and gradient back propagation is performed from there in Keras. The trainingy data in this case is the inputx words advanced one time step – in other words, at each time step the model is trying to predict the very next word in the sequence. However, it does this atevery time step – hence the output layer has the same number of time steps as the input layer. This will be made more clear later.
Building the Keras LSTM model
In this section, each line of code to create the Keras LSTM architecture shown above will be stepped through and discussed. However, I’ll only briefly discuss the text preprocessing code which mostly uses the code found on the TensorFlow site here. The complete code for this Keras LSTM tutorial can be found at this site’s Github repositoryand is called keras_lstm.py. Note, you first have to download the Penn Tree Bank (PTB) dataset which will be used as the training and validation corpus. You’ll need to change thedata_path variable in the Github code to match the location of this downloaded data.
The text preprocessing code
In order to get the text data into the right shape for input into the Keras LSTM model, each unique word in the corpus must be assigned a unique integer index. Then the text corpus needs to be re-constituted in order, but rather than text words we have the integer identifiers in order. The three functions which do this in the code areread_words, build_vocab andfile_to_word_ids. I won’t go into these functions in detail, but basically, they first split the given text file into separate words and sentence based characters (i.e. end-of-sentence ). Then, each unique word is identified and assigned a unique integer. Finally, the original text file is converted into a list of these unique integers, where each word is substituted with its new integer identifier. This allows the text data to be consumed in the neural network.
The load_data function which I created to run these functions is shown below:
defload_data():# get the data pathstrain_path=os.path.join(data_path,"ptb.train.txt")valid_path=os.path.join(data_path,"ptb.valid.txt")test_path=os.path.join(data_path,"ptb.test.txt")# build the complete vocabulary, then convert text data to list of integersword_to_id=build_vocab(train_path)train_data=file_to_word_ids(train_path,word_to_id)valid_data=file_to_word_ids(valid_path,word_to_id)test_data=file_to_word_ids(test_path,word_to_id)vocabulary=len(word_to_id)reversed_dictionary=dict(zip(word_to_id.values(),word_to_id.keys()))print(train_data[:5])print(word_to_id)print(vocabulary)print(" ".join([reversed_dictionary[x]forxintrain_data[:10]]))returntrain_data,valid_data,test_data,vocabulary,reversed_dictionary
To call this function, we can run:
train_data,valid_data,test_data,vocabulary,reversed_dictionary=load_data()
The three outputs from this function are the training data, validation data and test data from the data set, respectively, but with each word represented as an integer in a list. Some information is printed out during the running of load_data(), one of which isprint(train_data[:5]) – this produces the following output:
[9970, 9971, 9972, 9974, 9975]
As you can observe, the training data is comprised of a list of integers, as expected.
Next, the outputvocabulary is simply the size of our text corpus. When words are incorporated into the training data, every single unique word is not considered – rather, in natural language processing, the text data is usually limited to a certainN number of the most common words. In this caseN = vocabulary = 10,000.
Finally,reversed_dictionary is a Python dictionary where the key is the unique integer identifier of a word, and the associated value is the word in text. This allows us to work backwards from predicted integer words that our model will produce, and translate them back to real text. For instance, the following code converts the integers intrain_data back to text which is then printed: print(” “.join([reversed_dictionary[x] for x in train_data[100:110]])). This code snippet produces:
workers exposed to it more than N years ago researchers
That’s about all the explanation required with regard to the text pre-processing, so let’s progress to setting up the input data generator which will feed samples into our Keras LSTM model.
Creating the Keras LSTM data generators
When training neural networks, we generally feed data into them in small batches, called mini-batches or just “batches” (for more information on mini-batch gradient descent, see my tutorial here). Keras has some handy functions which can extract training data automatically from a pre-supplied Python iterator/generator object and input it to the model. One of these Keras functions is calledfit_generator. The first argument tofit_generator is the Python iterator function that we will create, and it will be used to extract batches of data during the training process. This function in Keras will handle all of the data extraction, input into the model, executing gradient steps, logging metrics such as accuracy and executingcallbacks (these will be discussed later). The Python iterator function needs to have a form like:
whileTrue:#do some things to create a batch of data (x, y)yieldx,y
In this case, I have created a generator class which contains a method which implements such a structure. The initialization of this class looks like:
classKerasBatchGenerator(object):def__init__(self,data,num_steps,batch_size,vocabulary,skip_step=5):self.data=data self.num_steps=num_steps self.batch_size=batch_size self.vocabulary=vocabulary# this will track the progress of the batches sequentially through the# data set - once the data reaches the end of the data set it will reset# back to zeroself.current_idx=0# skip_step is the number of words which will be skipped before the next# batch is skimmed from the data setself.skip_step=skip_step
Here theKerasBatchGenerator object takes our data as the first argument. Note, this data can be either training, validation or test data – multiple instances of the same class can be created and used in the various stages of our machine learning development cycle – training, validation tuning, test. The next argument supplied is callednum_steps – this is the number of words that we will feed into the time distributed input layer of the network. In other words (pun intended), this is the set of words that the model will learn from to predict the words coming after. The argumentbatch_size is pretty self-explanatory, and we’ve discussedvocabularyalready (it is equal to 10,000 in this case). Finallyskip_steps is the number of words we want to skip over between training samples within each batch. To make this a bit clearer, consider the following sentence:
“The cat sat on the mat, and ate his hat. Then he jumped up and spat”
Ifnum_steps is set to 5, the data consumed as the input data for a given sample would be “The cat sat on the”. In this case, because we are predicted the very next word in the sequence via our model, for each time step, the matching output y or target data would be “cat sat on the mat”. Finally, theskip_steps is the number of words to skip over before the next data batch is taken. If, in this example, it isskip_steps=num_steps the next 5 input words for the next batch would be “mat and ate his hat”. Hopefully that makes sense.
One final item in the initialization of the class needs to be discussed. This is variable current_idx which is initialized at zero. This variable is required to track the extraction of data through the full data set – once the full data set has been consumed in the training, we need to resetcurrent_idx to zero so that the data consumption starts from the beginning of the data set again. In other words it is basically a data set location pointer.
Ok, now we need to discuss thegenerator method that will be called duringfit_generator:
defgenerate(self):x=np.zeros((self.batch_size,self.num_steps))y=np.zeros((self.batch_size,self.num_steps,self.vocabulary))whileTrue:foriinrange(self.batch_size):ifself.current_idx+self.num_steps>=len(self.data):# reset the index back to the start of the data setself.current_idx=0x[i,:]=self.data[self.current_idx:self.current_idx+self.num_steps]temp_y=self.data[self.current_idx+1:self.current_idx+self.num_steps+1]# convert all of temp_y into a one hot representationy[i,:,:]=to_categorical(temp_y,num_classes=self.vocabulary)self.current_idx+=self.skip_stepyieldx,y
In the first couple of lines our x and y output arrays are created. The size of variablex is fairly straight forward to understand – it’s first dimension is the number of samples we specify in the batch. The second dimension is the number of words we are going to base our predictions on. The size of variabley is a little more complicated. First it has the batch size as the first dimension, then it has the number of time steps as the second, as discussed above. However,y has an additional third dimension, equal to the size of our vocabulary, in this case 10,000.
The reason for this is that the output layer of our Keras LSTM network will be a standardsoftmax layer, which will assign a probability to each of the 10,000 possible words. The one word with the highest probability will be the predicted word – in other words, the Keras LSTM network will predict one word out of 10,000 possiblecategories. Therefore, in order to train this network, we need to create a training sample for each word that has a 1 in the location of thetrue word, and zeros in all the other 9,999 locations. It will look something like this: (0, 0, 0, …, 1, 0, …, 0, 0) – this is called a one-hot representation, or alternatively, a categorical representation. Therefore, for each target word, there needs to be a 10,000 length vector with only one of the elements in this vector set to 1.
Ok, now onto thewhile True: yield x, y paradigm that was discussed earlier for the generator. In the first line, we enter into a for loop of sizebatch_size, to populate all the data in the batch. Next, there is a condition to test regarding whether we need to reset thecurrent_idx pointer. Remember that for each training sample we consumenum_steps words. Therefore, if the current index point plusnum_steps is greater than the length of the data set, then thecurrent_idx pointer needs to be reset to zero to start over with the data set.
After this check is performed, the input data is consumed into thex array. The data indices consumed is pretty straight-forward to understand – it is the current index to the current-index-plus-num_steps number of words. Next, a temporaryy variable is populated which works in pretty much the same way – the only difference is that the starting point and the end point of the data consumption is advanced by 1 (i.e. + 1). If this is confusing, please refer to the “cat sat on the mat etc.” example discussed above.
The final step is converting each of the target words in each sample into the one-hot or categorical representation that was discussed previously. To do this, you can use the Kerasto_categorical function. This function takes a series of integers as its first arguments and adds an additional dimension to the vector of integers – this dimension is the one-hot representation of each integer. It’s size is specified by the second argument passed to the function. So say we have a series of integers with a shape (100, 1) and we pass it to theto_categorical function and specify the size to be equal to 10,000 – the returned shape will be (100, 10000). For instance, let’s say the series / vector of integers looked like: (0, 1, 2, 3, ….), theto_categorical output would look like:
(1, 0, 0, 0, 0, ….)
(0, 1, 0, 0, 0, ….)
(0, 0, 1, 0, 0, ….)
and so on…
Here the “…” represents a whole lot of zeroes ensuring that the total number of elements associated with each integer is 10,000. Hopefully that makes sense.
The final two lines of the generator function are straight-forward – first, thecurrent_idx pointer is incremented byskip_step whose role was discussed previously. The last line yields the batch ofx andy data.
Now that the generator class has been created, we need to create instances of it. As mentioned previously, we can setup instances of the same class to correspond to the training and validation data. In the code, this looks like the following:
train_data_generator=KerasBatchGenerator(train_data,num_steps,batch_size,vocabulary,skip_step=num_steps)valid_data_generator=KerasBatchGenerator(valid_data,num_steps,batch_size,vocabulary,skip_step=num_steps)
Now that the input data for our Keras LSTM code is all setup and ready to go, it is time to create the LSTM network itself.
Creating the Keras LSTM structure
In this example, the Sequential way of building deep learning networks will be used. This way of building networks was introduced in myKeras tutorial – build a convolutional neural network in 11 lines. The alternate way of building networks in Keras is the Functional API, which I used in my Word2Vec Keras tutorial. Basically, the sequential methodology allows you to easily stack layers into your network without worrying too much about all the tensors (and their shapes) flowing through the model. However, you still have to keep your wits about you for some of the more complicated layers, as will be discussed below. In this example, it looks like the following:
model=Sequential()model.add(Embedding(vocabulary,hidden_size,input_length=num_steps))model.add(LSTM(hidden_size,return_sequences=True))model.add(LSTM(hidden_size,return_sequences=True))ifuse_dropout:model.add(Dropout(0.5))model.add(TimeDistributed(Dense(vocabulary)))model.add(Activation('softmax'))
The first step involves creating a Keras model with the Sequential() constructor. The first layer in the network, as per the architecture diagram shown previously, is a word embedding layer. This will convert our words (referenced by integers in the data) into meaningful embedding vectors. This Embedding() layer takes the size of the vocabulary as its first argument, then the size of the resultant embedding vector that you want as the next argument. Finally, because this layer is the first layer in the network, we must specify the “length” of the input i.e. the number of steps/words in each sample.
It’s worthwhile keeping track of the Tensor shapes in the network – in this case, the input to the embedding layer is (batch_size, num_steps) and the output is (batch_size, num_steps, hidden_size). Note that Keras, in the Sequential model, always maintains the batch size as the first dimension. It receives the batch size from the Keras fitting function (i.e. fit_generator in this case), and therefore it is rarely (never?) included in the definitions of the Sequential model layers.
The next layer is the first of our two LSTM layers. To specify an LSTM layer, first you have to provide the number of nodes in the hidden layers within the LSTM cell, e.g. the number of cells in the forget gate layer, the tanh squashing input layer and so on. The next argument that is specified in the code above is thereturn_sequences=True argument. What this does is ensure that the LSTM cell returns all of the outputs from the unrolled LSTM cell through time. If this argument is left out, the LSTM cell will simply provide the output of the LSTM cell from the last time step. The diagram below shows what I mean:
Keras LSTM return sequences argument comparison
As can be observed in the diagram above, there is only one output whenreturn_sequences=False – $h_t$ . However, whenreturn_sequences=True all of the unrolled outputs from the LSTM cells are returnedh0…hth0…ht. In this case, we want the latter arrangement. Why? Well, in this example we are trying to predict the very next word in the sequence. However, if we are trying to train the model, it is best to be able to compare the LSTM cell output at each time step with the very next word in the sequence – in this way we getnum_stepssources to correct errors in the model (via back-propagation) rather than just one for each sample.
Therefore, for both stacked LSTM layers, we want to return all the sequences. The output shape of each LSTM layer is (batch_size, num_steps, hidden_size).
The next layer in our Keras LSTM network is a dropout layer to prevent overfitting. After that, there is a special Keras layer for use in recurrent neural networks called TimeDistributed. This function adds an independent layer for each time step in the recurrent model. So, for instance, if we have 10 time steps in a model, a TimeDistributed layer operating on a Dense layer would produce 10 independent Dense layers, one for each time step. The activation for these dense layers is set to be softmax in the final layer of our Keras LSTM model.
Compiling and running the Keras LSTM model
The next step in Keras, once you’ve completed your model, is to run the compile command on the model. It looks like this:
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['categorical_accuracy'])
In this command, the type of loss that Keras should use to train the model needs to be specified. In this case, we are using ‘categorical_crossentropy’ which is cross entropy applied in cases where there are many classes or categories, of which only one is true. Next, in this example, the optimizer that will be used is the Adam optimizer– an effective “all round” optimizer with adaptive stepping. Finally, a metric is specified – ‘categorical_accuracy’, which can let us see how the accuracy is improving during training.
The next line of code involves creating a Kerascallback – callbacks are certain functions which Keras can optionally call, usually after the end of a training epoch. For more on callbacks, see my Keras tutorial. The callback that is used in this example is a model checkpoint callback – this callback saves the model after each epoch, which can be handy for when you are running long-term training.
checkpointer=ModelCheckpoint(filepath=data_path+'/model-{epoch:02d}.hdf5',verbose=1)
Note that the model checkpoint function can include the epoch in its naming of the model, which is good for keeping track of things.
The final step in training the Keras LSTM model is to call the aforementionedfit_generator function. The line below shows you how to do this:
model.fit_generator(train_data_generator.generate(),len(train_data)//(batch_size*num_steps),num_epochs,validation_data=valid_data_generator.generate(),validation_steps=len(valid_data)//(batch_size*num_steps),callbacks=[checkpointer])
The first argument tofit_generator is our generator function that was explained earlier. The next argument is the number of iterations to run for each training epoch. The value givenlen(train_data)//(batch_size*num_steps) ensures that the whole data set is run through the model in each epoch. Likewise, a generator for the smaller validation data set is called, with the same argument for the number of iterations to run. At the end of each epoch, the validation data will be run through the model and the accuracy will be returned. Finally, the model checkpoint callback explained above is supplied via the callbacks argument infit_generator. Now the model is good to go!
Before some results are presented – some caveats are required. First the PTB data set is aserious text data set – not a toy problem to demonstrate how good LSTM models are. Therefore, in order to get good results, you’ll likely have to run the model over many epochs, and the model will need to have a significant level of complexity. Therefore, it is likely to take a long time on a CPU machine, and I’d suggest running it on a machine with a good GPU if you want to try and replicate things. If you don’t have a GPU machine yourself, you can create an Amazon EC2 instance as shown in my Amazon AWS tutorial. I’m in the latter camp, and wasn’t looking to givetoo many dollars to Amazon to train, optimize learning parameters and so on. However, I’ve run the model up to 40 epochs and gotten some reasonable initial results. My model parameters for the results presented below are as follows:
num_steps=30
batch_size=20
hidden_size=500
After 40 epochs, training data set accuracy was around 40%, while validation set accuracy reached approximately 20-25%. This is the sort of output you’ll see while running the training session:
Keras LSTM tutorial – example training output
The Keras LSTM results
In order to test the trained Keras LSTM model, one can compare the predicted word outputs against what the actual word sequences are in the training and test data set. The code below is a snippet of how to do this, where the comparison is against the predicted model output and thetraining data set (the same can be done with the test_data data).
model=load_model(data_path+"\model-40.hdf5")dummy_iters=40example_training_generator=KerasBatchGenerator(train_data,num_steps,1,vocabulary,skip_step=1)print("Training data:")foriinrange(dummy_iters):dummy=next(example_training_generator.generate())num_predict=10true_print_out="Actual words: "pred_print_out="Predicted words: "foriinrange(num_predict):data=next(example_training_generator.generate())prediction=model.predict(data[0])predict_word=np.argmax(prediction[:,num_steps-1,:])true_print_out+=reversed_dictionary[train_data[num_steps+dummy_iters+i]]+" "pred_print_out+=reversed_dictionary[predict_word]+" "print(true_print_out)print(pred_print_out)
In the code above, first the model is reloaded from the trained data (in the example above, it is the checkpoint from the 40th epoch of training). Then another KerasBatchGenerator class is created, as was discussed previously – in this case, a batch of length 1 is used, as we only want one num_steps worth of text data to compare. Then a loop of dummy data extractions from the generator is created – this is to control where in the data-set the comparison sentences are drawn from. The second loop, from 0 to num_predict is where the interesting stuff is happening.
First, a batch of data is extracted from the generator and this is passed to the model.predict() method. This returnsnum_steps worth of predicted words – however, each word is represented by acategoricalor one hot output. In other words, each word is represented by a vector of 10,000 items, with most being zero and only one element being equal to 1. The index of this “1” is the integer representation of the actual English word. So to extract the index where this “1” occurs, we can use the np.argmax() function. This function identifies the index where the maximum value occurs in a vector – in this case the maximum value is 1, compared to all the zeros, so this is a handy function for us to use.
Once the index has been identified, it can be translated into an actual English word by using thereverse_dictionary that was constructed during the data pre-processing. This English word is then added to the predicted words string, and finally the actual and predicted words are returned.
The output below is the comparison between the actual and predicted words after 10 epochs of training on the training data set:
Comparison on the training data set after 10 epochs of training
As can be observed, while some words match, after 10 epochs of training the match is pretty poor. By the way “” refers to words not included in the 10,000 length vocabulary of the data set. Alternatively, if we look at the comparison after 40 epochs of training (again, just on thetraining data set):
Comparison on the training data set after 40 epochs of training
It can be observed that the match is quite good between the actual and predicted words in thetrainingset.
However, when we look at the test data set, the match after 40 epochs of training isn’t quite as good:
Comparison on the test data set after 40 epochs of training
Despite there not being a perfect correspondence between the predicted and actual words, you can see that there is a rough correspondence and the predicted sub-sentence at least makes some grammatical sense. So not so bad after all. However, in order to train a Keras LSTM network which can perform well on this realistic, large text corpus, more training and optimization is required. I will leave it up to you, the reader, to experiment further if you desire. However, the current code is sufficient for you to gain an understanding of how to build a Keras LSTM network, along with an understanding of the theory behind LSTM networks.
I hope this (large) tutorial is a help to you in understanding Keras LSTM networks, and LSTM networks in general.
Recommended online course: If you are more of a video course learner, I’d recommend this inexpensive Udemy course to learn more about Keras and LSTM networks: Zero to Deep Learning with Python and Keras
How to create a TensorFlow deep learning powerhouse on Amazon AWS
Reinforcement learning tutorial using Python and Keras
8 COMMENTS
shoujun
Your blogs are very helpful!
Li
Thanks a lot for your awesome explanation, very helpful
Sally
Thank you very much for your detailed explanation!
neha
The Best Description found on lstms.
Could you also explain, encoder-decoder architectures and attention mechanisms?
Andy
Thanks neha. I hope to do a tutorial on these networks soon
Kien tran
thanks you for this useful tutorial. Do we really need the size of hidden layer equal to the dimension of the input (in this case, hidden layer size is 500)?
Andy
Thanks for the comment. No, they don’t have to be the same
This is really nice blog..They way you explained concepts with the help of elaborative diagrams and simple mathematical formulas is highly commendable.
Leave a Reply
Your email address will not be published.
Comment
Name*
Email*
Website
POPULAR TUTORIALS
Neural Networks Tutorial – A Pathway to Deep Learning
Python TensorFlow Tutorial – Build a Neural Network
Convolutional Neural Networks Tutorial in TensorFlow
Keras tutorial – build a convolutional neural network in 11 lines
Word2Vec word embedding tutorial in Python and TensorFlow
CATEGORIES
NEWSLETTER + FREE EBOOK
Email address:
FIND US ON FACEBOOK
Note: some posts contain Udemy affiliate links
Copyright © 2018 | WordPress Theme by MH Themes
网友评论