LSTM

作者: 雪茸川 | 来源:发表于2018-12-06 16:12 被阅读0次

pytorch1.0 搭建LSTM网络
keras lstm 杂记
详解 LSTM
LSTM Custom
keras lstm return sequence参数理解
LSTM
lstm理解
Tensorflow神经网络之LSTM
LSTM原理、源码、Demo及习题
双向 LSTM

2018-12-06
来看看udacity的深度学习课的lstm实现代码

RNN和LSTM

假设你有一个事件序列，这个序列是根据时间变化的，希望根据某个时间点的事件进行预测，并且把以前的事件也考虑在内，因为不可能将之前每个时间点的状态传递给当前时间点，所以RNN通过每个时间点都对前面的时间点进行总结传递给当前状态，就可以学习到序列的所有节点状态

RNN-rolled

RNN-unrolled

上下两幅图是等价的
其中序列应该是逐个读入RNN而不是同时读取的

存在问题

RNN的反向传播：
因为RNN在时间上共用权重，所以更新时非常不稳定，会出现梯度爆炸或梯度下降

解决方法

gradient clipping（梯度裁剪）

梯度裁剪
lstm（长短期模型）

记忆单元

代码

读入数据

仍然是text8.zip

创建一个小的验证集

valid_size = 1000
valid_text = text[:valid_size]
train_text = text[valid_size:]
train_size = len(train_text)
print(train_size, train_text[:64])
print(valid_size, valid_text[:64])

99999000 ons anarchists advocate social relations based upon voluntary as
1000  anarchism originated as a term of abuse first used against earl

建立字母到数字的映射

vocabulary_size = len(string.ascii_lowercase) + 1 # [a-z] + ' '
first_letter = ord(string.ascii_lowercase[0])

def char2id(char):
  if char in string.ascii_lowercase:
    return ord(char) - first_letter + 1
  elif char == ' ':
    return 0
  else:
    print('Unexpected character: %s' % char)
    return 0
  
def id2char(dictid):
  if dictid > 0:
    return chr(dictid + first_letter - 1)
  else:
    return ' '

print(char2id('a'), char2id('z'), char2id(' '), char2id('ï'))
print(id2char(1), id2char(26), id2char(0))

1 26 0 Unexpected character: ï
0
a z

为模型建立训练数据

batch_size=64
num_unrollings=10

class BatchGenerator(object):
  def __init__(self, text, batch_size, num_unrollings):
    self._text = text
    self._text_size = len(text)
    self._batch_size = batch_size
    self._num_unrollings = num_unrollings
    segment = self._text_size // batch_size
    self._cursor = [ offset * segment for offset in range(batch_size)]
    self._last_batch = self._next_batch()
  
  def _next_batch(self):
    """Generate a single batch from the current cursor position in the data."""
    batch = np.zeros(shape=(self._batch_size, vocabulary_size), dtype=np.float)
    for b in range(self._batch_size):
      batch[b, char2id(self._text[self._cursor[b]])] = 1.0
      self._cursor[b] = (self._cursor[b] + 1) % self._text_size
#这里是为了循环拿数据
    return batch
  
  def next(self):
    """Generate the next array of batches from the data. The array consists of
    the last batch of the previous array, followed by num_unrollings new ones.
    """
    batches = [self._last_batch]
#这里的batches我认为应该叫序列比较好分清楚， num_unrollings的长度就是batches的长度
    for step in range(self._num_unrollings):
      batches.append(self._next_batch())
    self._last_batch = batches[-1]
#每次会取上次的最后一序列
    return batches

train_batches = BatchGenerator(train_text, batch_size, num_unrollings)
valid_batches = BatchGenerator(valid_text, 1, 1)

batch_size是批次大小，num_unrollings 是序列长度
为了保证每次传递的批次对应的字符是一样的，所以设置了cursor游标

比如'abcdefghij'是长度为10的字符串，2是批次大小，序列长度也是2
下面的输出，一个array是一个批次，多少个array就是多少个序列

这里要讲清楚，批次大小为多少就认定有多少个字符是一个组，比如批次为2，那么认定有俩词，分别是‘abcde’和‘fhij',那么对应的批次当然是’a,f','b,h'等等，可以这样理解多少个批次就是多少个首字母，那么当然就有多少个词

因为每次也要返回上次的最后一个序列，所以每次有三个序列

test = BatchGenerator('abcdefghij',2, 2 )
test.next()

[array([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., #a
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],                             
         [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,    #f
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]),
 array([[0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,   #b
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,      #g
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]),
 array([[0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,    #c
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,       #h
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])]

工具函数

展示概率最大的字符

def characters(probabilities):
  """Turn a 1-hot encoding or a probability distribution over the possible
  characters back into its (most likely) character representation."""
  return [id2char(c) for c in np.argmax(probabilities, 1)]

将序列表示为字符

def batches2string(batches):
  """Convert a sequence of batches back into their (most likely) string
  representation."""
  s = [''] * batches[0].shape[0]
  for b in batches:
    s = [''.join(x) for x in zip(s, characters(b))]
  return s

简单的LSTM模型

num_nodes = 64

graph = tf.Graph()
with graph.as_default():

num_nodes 是lstm cell的个数

定义变量

  # Parameters:
  # Input gate: input, previous output, and bias.
  ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ib = tf.Variable(tf.zeros([1, num_nodes]))
  # Forget gate: input, previous output, and bias.
  fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  fb = tf.Variable(tf.zeros([1, num_nodes]))
  # Memory cell: input, state and bias.                             
  cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  cb = tf.Variable(tf.zeros([1, num_nodes]))
  # Output gate: input, previous output, and bias.
  ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
  om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
  ob = tf.Variable(tf.zeros([1, num_nodes]))
  # Variables saving state across unrollings.
  saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
  # Classifier weights and biases.
  w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
  b = tf.Variable(tf.zeros([vocabulary_size]))

再把lstm的图拿出来回忆一下：

lstm cell

上述代码提到了一下几个

input gate: ix, im, ib
forget gate: fx, fm, fb
memory cell : cx, cm, cb
output cell : ox, om, ob
saved_output, saved_state：初始的ht和ct
classifier: w,b最后用来分类的权重和偏置

定义lstm cell

  # Definition of the cell computation.
  def lstm_cell(i, o, state):
    """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
    Note that in this formulation, we omit (省略)the various connections between the
    previous state and the gates."""
    input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
    forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
    output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
    update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
    state = forget_gate * state + input_gate * tf.tanh(update)
    return output_gate * tf.tanh(state), state

LSTM

根据图来看，代码中的对应
input_gate: i
forget_gate: f
output_gate : o
update : g
三个输入
state: ct-1
o: ht-1
i :xt
输出分别为： ht， ct

定义输入接口

  # Input data.
  train_data = list()
  for _ in range(num_unrollings + 1):
    train_data.append(
      tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
  train_inputs = train_data[:num_unrollings]
  train_labels = train_data[1:]  # labels are inputs shifted by one time step.

训练数据的标签是序列向右位移一位

LSTM 循环训练

  # Unrolled LSTM loop.
  outputs = list()
  output = saved_output
  state = saved_state
  for i in train_inputs:
    output, state = lstm_cell(i, output, state)
    outputs.append(output)

定义loss

取自博客：
因为不是顺序执行语言，一般模型如果不是相关的语句，其执行是没有先后顺序的，control_dependencies 的作用就是建立先后顺序，保证前面两句被执行后，才执行后面的内容。

这里也就是先把 saved_output 和 saved_state 保存之后，再计算 logits 和 loss。否则因为下面计算时没有关联到 saved_output 和 saved_state，如果不用 control_dependencies 那上面两句保存就不会被优化语句触发。

tf.concat(0, values) 是指在 0 维上把 values 连接起来。本来 outputs 是一个 list，每一个元素都是一个27维向量表示一个字母。

  # State saving across unrollings.
  with tf.control_dependencies([saved_output.assign(output),
                                saved_state.assign(state)]):
   # Classifier.
    logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
    loss = tf.reduce_mean(
      tf.nn.softmax_cross_entropy_with_logits(
        labels=tf.concat(train_labels, 0), logits=logits))

定义训练优化

clip_by_global_norm 的具体计算是，先计算 global_norm ，也就是整个 W 的模（二范数）。看这个模是否大于文中的1.25，如果大于，则结果等于 gradients * 1.25 / global_norm，如果不大于，就不变。

最后，apply_gradients。这里传入的 global_step 是会被修改的，每次加一，这样下次计算 learning_rate 的时候就会使用新的 global_step 值。

  # Optimizer.
  global_step = tf.Variable(0)
  learning_rate = tf.train.exponential_decay(
    10.0, global_step, 5000, 0.1, staircase=True)
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  gradients, v = zip(*optimizer.compute_gradients(loss))
  gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
#防止梯度爆炸
  optimizer = optimizer.apply_gradients(
    zip(gradients, v), global_step=global_step)

定义预测


  # Predictions.
  train_prediction = tf.nn.softmax(logits)

取样并且验证评估

  # Sampling and validation eval: batch 1, no unrolling.
  sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
  saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
  saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
  reset_sample_state = tf.group(
    saved_sample_output.assign(tf.zeros([1, num_nodes])),
    saved_sample_state.assign(tf.zeros([1, num_nodes])))
  sample_output, sample_state = lstm_cell(
    sample_input, saved_sample_output, saved_sample_state)
  with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                saved_sample_state.assign(sample_state)]):
    sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))

训练过程

这里评判训练的标注是交叉熵困惑度
根据信息论，perplexity wikipedia定义和 cross_entropy 的关系如下：
$perplexity = e^{cross\_entropy}$

num_steps = 7001
summary_frequency = 100

with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print('Initialized')
  mean_loss = 0
  for step in range(num_steps):
    batches = train_batches.next() #循环导入batches训练序列
    feed_dict = dict()
    for i in range(num_unrollings + 1):
      feed_dict[train_data[i]] = batches[i]  #训练数据列表，每个列表是个batch
    _, l, predictions, lr = session.run(
      [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)

    mean_loss += l
    if step % summary_frequency == 0:
      if step > 0:
        mean_loss = mean_loss / summary_frequency
      # The mean loss is an estimate of the loss over the last few batches.也就是前几次的平均
      print(
        'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
      mean_loss = 0
'''这里注意几个辅助函数'''
      labels = np.concatenate(list(batches)[1:])
      print('Minibatch perplexity: %.2f' % float(
        np.exp(logprob(predictions, labels))))
      if step % (summary_frequency * 10) == 0:
        # Generate some samples.
      '''这里用来生成一些可视化的样本'''
        print('=' * 80)
        for _ in range(5):
          feed = sample(random_distribution())
          sentence = characters(feed)[0]
          reset_sample_state.run()
          for _ in range(79):
            prediction = sample_prediction.eval({sample_input: feed})
            feed = sample(prediction)
            sentence += characters(feed)[0]
          print(sentence)
        print('=' * 80)
      # Measure validation set perplexity.
      reset_sample_state.run()
      valid_logprob = 0
      for _ in range(valid_size):
        b = valid_batches.next()
        predictions = sample_prediction.eval({sample_input: b[0]})
        valid_logprob = valid_logprob + logprob(predictions, b[1])
      print('Validation set perplexity: %.2f' % float(np.exp(
        valid_logprob / valid_size)))

几个辅助函数介绍：
logprob：计算label和预测值的交叉熵。

先回忆一下 cross_entropy：

$Cross Entropy = - \sum_{i}^N({predictions \cdot \log(labels)})$
那么，

$logprob = { Cross Entropy \over N }$

def logprob(predictions, labels):
  """Log-probability of the true labels in a predicted batch."""
  predictions[predictions < 1e-10] = 1e-10
  return np.sum(np.multiply(labels, -np.log(predictions))) / labels.shape[0]

random_distribution()：[0,1]区间内生成一个正态分布，值加和为1

def random_distribution():
  """Generate a random column of probabilities."""
  b = np.random.uniform(0.0, 1.0, size=[1, vocabulary_size])
  return b/np.sum(b, 1)[:,None]

sample_distribution(distribution):随机选择[0,len(distribution)]中任意一个整数值


def sample_distribution(distribution):
  """Sample one element from a distribution assumed to be an array of normalized
  probabilities.
  """
  r = random.uniform(0, 1)
  s = 0
  for i in range(len(distribution)):
    s += distribution[i]
    if s >= r:
      return i
  return len(distribution) - 1

sample(prediction):随机one-hot

def sample(prediction):
  """Turn a (column) prediction into 1-hot encoded samples."""
  p = np.zeros(shape=[1, vocabulary_size], dtype=np.float)
  p[0, sample_distribution(prediction[0])] = 1.0
  return p

pytorch1.0 搭建LSTM网络
torch.nn包下实现了LSTM函数，实现LSTM层。多个LSTMcell组合起来是LSTM。 LSTM自动实现...
keras lstm 杂记
1、例子情感分析情感分析（苏剑林） lstm多曲线预测 lstm多曲线预测（原文） 2、lstm参数 lstm...
详解 LSTM
今天的内容有： LSTM 思路 LSTM 的前向计算 LSTM 的反向传播关于调参 LSTM 长短时记忆网络(L...
LSTM Custom
def InitLSTM(self,LSTM,Name,InputSize) LSTM.Forget_Wight...
keras lstm return sequence参数理解
使用keras构建多层lstm网络时，除了最后一层lstm，中间过程的lstm中的return sequence参...
LSTM
Chris Olah's LSTM postEdwin Chen's LSTM postAndrej Karpat...
lstm理解
本文是自己对于lstm的理解的总结，但是最好的文章帮助理解lstm一定是这篇Understanding LSTM ...
Tensorflow神经网络之LSTM
LSTM 简介公式 LSTM LSTM作为门控循环神经网络因此我们从门控单元切入理解。主要包括：输入门：It ...
LSTM原理、源码、Demo及习题
全面整理LSTM相关原理，源码，以及开发demo，设计习题。如转载请注明转载出处。 LSTM 框架 lstm 由3...
双向 LSTM
本文结构：为什么用双向 LSTM 什么是双向 LSTM 例子为什么用双向 LSTM？单向的 RNN，是根据前...