RNN

作者: 骑鲸公子_ | 来源:发表于2018-04-23 18:03 被阅读0次

02-25：RNN算法
RNN
[tensorflow](六) RNN
「深度学习」循环神经网络 RNN 学习笔记
深入浅出循环神经网络 RNN
深度学习——RNN(2)
keras学习-RNN
RNN入门：手动编写网络（一）
NLP in TensorFlow: 不同的神经网络模型
基于mxnet的LSTM实现

When you have very long sequences RNNs can face the problem of vanishing gradients and exploding gradients.

There are methods. The first thing you need to understand is why we need to try above methods? It's because back propagation through time can get real hard due to above mentioned problems.

Yes introduction of LSTM has reduced this by very large margin but still when it's is so long you can face such problems.

So one way is clipping the gradients. That means you set an upper bound to gradients. Refer to this stackoverflow question

One way is truncated back propagation through time. There are number of ways to implement this truncated BPTT. Simple idea is

1.Calculate the gradients only for number of given time steps That means if your sequence is 200 time steps and you only give 10 time steps it will only calculate gradient for 10 time step and then pass the stored memory value in that 10 time step to next sequence(as the initial cell state) . This method is what tensorflow using to calculate truncated BPTT.

2.Take the full sequence and only back propagate gradients for some given time steps from selected time block. It's a continuous way

How does LSTM help prevent the vanishing (and exploding) gradient problem in a recurrent neural network?

LSTM

There are two factors that affect the magnitude of gradients - the weights and the activation functions (or more precisely, their derivatives) that the gradient passes through.

If either of these factors is smaller than 1, then the gradients may vanish in time; if larger than 1, then exploding might happen. For example, the tanh derivative is <1<1 for all inputs except 0; sigmoid is even worse and is always ≤0.25≤0.25.

In the recurrency of the LSTM the activation function is thev with a derivative of 1.0. So, the backpropagated gradient neither vanishes or explodes when passing through, but remains constant.

The effective weight of the recurrency is equal to the forget gate activation. So, if the forget gate is on (activation close to 1.0), then the gradient does not vanish. Since the forget gate activation is never >1.0 the gradient can't explode either.

So that's why LSTM is so good at learning long range dependencies.