LSTM

作者: 雪茸川 | 来源:发表于2018-12-06 16:12 被阅读0次

    2018-12-06
    来看看udacity的深度学习课的lstm实现代码

    RNN和LSTM

    假设你有一个事件序列,这个序列是根据时间变化的,希望根据某个时间点的事件进行预测,并且把以前的事件也考虑在内,因为不可能将之前每个时间点的状态传递给当前时间点,所以RNN通过每个时间点都对前面的时间点进行总结传递给当前状态,就可以学习到序列的所有节点状态


    RNN-rolled
    RNN-unrolled

    上下两幅图是等价的
    其中序列应该是逐个读入RNN而不是同时读取的

    存在问题

    RNN的反向传播:
    因为RNN在时间上共用权重,所以更新时非常不稳定,会出现梯度爆炸或梯度下降

    解决方法
    • gradient clipping(梯度裁剪)


      梯度裁剪
    • lstm(长短期模型)



      记忆单元

    代码

    读入数据

    仍然是text8.zip

    创建一个小的验证集

    valid_size = 1000
    valid_text = text[:valid_size]
    train_text = text[valid_size:]
    train_size = len(train_text)
    print(train_size, train_text[:64])
    print(valid_size, valid_text[:64])
    
    99999000 ons anarchists advocate social relations based upon voluntary as
    1000  anarchism originated as a term of abuse first used against earl
    

    建立字母到数字的映射

    vocabulary_size = len(string.ascii_lowercase) + 1 # [a-z] + ' '
    first_letter = ord(string.ascii_lowercase[0])
    
    def char2id(char):
      if char in string.ascii_lowercase:
        return ord(char) - first_letter + 1
      elif char == ' ':
        return 0
      else:
        print('Unexpected character: %s' % char)
        return 0
      
    def id2char(dictid):
      if dictid > 0:
        return chr(dictid + first_letter - 1)
      else:
        return ' '
    
    print(char2id('a'), char2id('z'), char2id(' '), char2id('ï'))
    print(id2char(1), id2char(26), id2char(0))
    
    1 26 0 Unexpected character: ï
    0
    a z  
    

    为模型建立训练数据

    batch_size=64
    num_unrollings=10
    
    class BatchGenerator(object):
      def __init__(self, text, batch_size, num_unrollings):
        self._text = text
        self._text_size = len(text)
        self._batch_size = batch_size
        self._num_unrollings = num_unrollings
        segment = self._text_size // batch_size
        self._cursor = [ offset * segment for offset in range(batch_size)]
        self._last_batch = self._next_batch()
      
      def _next_batch(self):
        """Generate a single batch from the current cursor position in the data."""
        batch = np.zeros(shape=(self._batch_size, vocabulary_size), dtype=np.float)
        for b in range(self._batch_size):
          batch[b, char2id(self._text[self._cursor[b]])] = 1.0
          self._cursor[b] = (self._cursor[b] + 1) % self._text_size
    #这里是为了循环拿数据
        return batch
      
      def next(self):
        """Generate the next array of batches from the data. The array consists of
        the last batch of the previous array, followed by num_unrollings new ones.
        """
        batches = [self._last_batch]
    #这里的batches我认为应该叫序列比较好分清楚, num_unrollings的长度就是batches的长度
        for step in range(self._num_unrollings):
          batches.append(self._next_batch())
        self._last_batch = batches[-1]
    #每次会取上次的最后一序列
        return batches
    
    train_batches = BatchGenerator(train_text, batch_size, num_unrollings)
    valid_batches = BatchGenerator(valid_text, 1, 1)
    

    batch_size是批次大小,num_unrollings 是序列长度
    为了保证每次传递的批次对应的字符是一样的,所以设置了cursor游标

    比如'abcdefghij'是长度为10的字符串,2是批次大小,序列长度也是2
    下面的输出,一个array是一个批次,多少个array就是多少个序列

    这里要讲清楚,批次大小为多少就认定有多少个字符是一个组,比如批次为2,那么认定有俩词,分别是‘abcde’和‘fhij',那么对应的批次当然是’a,f','b,h'等等,可以这样理解多少个批次就是多少个首字母,那么当然就有多少个词

    因为每次也要返回上次的最后一个序列,所以每次有三个序列

    test = BatchGenerator('abcdefghij',2, 2 )
    test.next()
    
    [array([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., #a
             0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],                             
             [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,    #f
             0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]),
     array([[0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,   #b
             0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
            [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,      #g
             0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]),
     array([[0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,    #c
             0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
            [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,       #h
             0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])]
    

    工具函数

    • 展示概率最大的字符
    def characters(probabilities):
      """Turn a 1-hot encoding or a probability distribution over the possible
      characters back into its (most likely) character representation."""
      return [id2char(c) for c in np.argmax(probabilities, 1)]
    
    • 将序列表示为字符
    def batches2string(batches):
      """Convert a sequence of batches back into their (most likely) string
      representation."""
      s = [''] * batches[0].shape[0]
      for b in batches:
        s = [''.join(x) for x in zip(s, characters(b))]
      return s
    

    简单的LSTM模型

    num_nodes = 64
    
    graph = tf.Graph()
    with graph.as_default():
    

    num_nodes 是lstm cell的个数

    定义变量
      # Parameters:
      # Input gate: input, previous output, and bias.
      ix = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
      im = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
      ib = tf.Variable(tf.zeros([1, num_nodes]))
      # Forget gate: input, previous output, and bias.
      fx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
      fm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
      fb = tf.Variable(tf.zeros([1, num_nodes]))
      # Memory cell: input, state and bias.                             
      cx = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
      cm = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
      cb = tf.Variable(tf.zeros([1, num_nodes]))
      # Output gate: input, previous output, and bias.
      ox = tf.Variable(tf.truncated_normal([vocabulary_size, num_nodes], -0.1, 0.1))
      om = tf.Variable(tf.truncated_normal([num_nodes, num_nodes], -0.1, 0.1))
      ob = tf.Variable(tf.zeros([1, num_nodes]))
      # Variables saving state across unrollings.
      saved_output = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
      saved_state = tf.Variable(tf.zeros([batch_size, num_nodes]), trainable=False)
      # Classifier weights and biases.
      w = tf.Variable(tf.truncated_normal([num_nodes, vocabulary_size], -0.1, 0.1))
      b = tf.Variable(tf.zeros([vocabulary_size]))
    

    再把lstm的图拿出来回忆一下:


    lstm cell

    上述代码提到了一下几个

    • input gate: ix, im, ib
    • forget gate: fx, fm, fb
    • memory cell : cx, cm, cb
    • output cell : ox, om, ob
    • saved_output, saved_state:初始的ht和ct
    • classifier: w,b最后用来分类的权重和偏置
    定义lstm cell
      # Definition of the cell computation.
      def lstm_cell(i, o, state):
        """Create a LSTM cell. See e.g.: http://arxiv.org/pdf/1402.1128v1.pdf
        Note that in this formulation, we omit (省略)the various connections between the
        previous state and the gates."""
        input_gate = tf.sigmoid(tf.matmul(i, ix) + tf.matmul(o, im) + ib)
        forget_gate = tf.sigmoid(tf.matmul(i, fx) + tf.matmul(o, fm) + fb)
        output_gate = tf.sigmoid(tf.matmul(i, ox) + tf.matmul(o, om) + ob)
        update = tf.matmul(i, cx) + tf.matmul(o, cm) + cb
        state = forget_gate * state + input_gate * tf.tanh(update)
        return output_gate * tf.tanh(state), state
    
    LSTM

    根据图来看,代码中的对应
    input_gate: i
    forget_gate: f
    output_gate : o
    update : g
    三个输入
    state: ct-1
    o: ht-1
    i :xt
    输出分别为: ht, ct

    定义输入接口
      # Input data.
      train_data = list()
      for _ in range(num_unrollings + 1):
        train_data.append(
          tf.placeholder(tf.float32, shape=[batch_size,vocabulary_size]))
      train_inputs = train_data[:num_unrollings]
      train_labels = train_data[1:]  # labels are inputs shifted by one time step.
    

    训练数据的标签是序列向右位移一位

    LSTM 循环训练
      # Unrolled LSTM loop.
      outputs = list()
      output = saved_output
      state = saved_state
      for i in train_inputs:
        output, state = lstm_cell(i, output, state)
        outputs.append(output)
    
    定义loss

    取自博客
    因为不是顺序执行语言,一般模型如果不是相关的语句,其执行是没有先后顺序的,control_dependencies 的作用就是建立先后顺序,保证前面两句被执行后,才执行后面的内容。

    这里也就是先把 saved_output 和 saved_state 保存之后,再计算 logits 和 loss。否则因为下面计算时没有关联到 saved_output 和 saved_state,如果不用 control_dependencies 那上面两句保存就不会被优化语句触发。

    tf.concat(0, values) 是指在 0 维上把 values 连接起来。本来 outputs 是一个 list,每一个元素都是一个27维向量表示一个字母。

      # State saving across unrollings.
      with tf.control_dependencies([saved_output.assign(output),
                                    saved_state.assign(state)]):
       # Classifier.
        logits = tf.nn.xw_plus_b(tf.concat(outputs, 0), w, b)
        loss = tf.reduce_mean(
          tf.nn.softmax_cross_entropy_with_logits(
            labels=tf.concat(train_labels, 0), logits=logits))
    
    定义训练优化

    clip_by_global_norm 的具体计算是,先计算 global_norm ,也就是整个 W 的模(二范数)。看这个模是否大于文中的1.25,如果大于,则结果等于 gradients * 1.25 / global_norm,如果不大于,就不变。

    最后,apply_gradients。这里传入的 global_step 是会被修改的,每次加一,这样下次计算 learning_rate 的时候就会使用新的 global_step 值。

      # Optimizer.
      global_step = tf.Variable(0)
      learning_rate = tf.train.exponential_decay(
        10.0, global_step, 5000, 0.1, staircase=True)
      optimizer = tf.train.GradientDescentOptimizer(learning_rate)
      gradients, v = zip(*optimizer.compute_gradients(loss))
      gradients, _ = tf.clip_by_global_norm(gradients, 1.25)
    #防止梯度爆炸
      optimizer = optimizer.apply_gradients(
        zip(gradients, v), global_step=global_step)
    
    定义预测
    
      # Predictions.
      train_prediction = tf.nn.softmax(logits)
    
    取样并且验证评估
      # Sampling and validation eval: batch 1, no unrolling.
      sample_input = tf.placeholder(tf.float32, shape=[1, vocabulary_size])
      saved_sample_output = tf.Variable(tf.zeros([1, num_nodes]))
      saved_sample_state = tf.Variable(tf.zeros([1, num_nodes]))
      reset_sample_state = tf.group(
        saved_sample_output.assign(tf.zeros([1, num_nodes])),
        saved_sample_state.assign(tf.zeros([1, num_nodes])))
      sample_output, sample_state = lstm_cell(
        sample_input, saved_sample_output, saved_sample_state)
      with tf.control_dependencies([saved_sample_output.assign(sample_output),
                                    saved_sample_state.assign(sample_state)]):
        sample_prediction = tf.nn.softmax(tf.nn.xw_plus_b(sample_output, w, b))
    

    训练过程

    这里评判训练的标注是交叉熵困惑度
    根据信息论,perplexity wikipedia定义 和 cross_entropy 的关系如下:
    perplexity = e^{cross\_entropy}

    num_steps = 7001
    summary_frequency = 100
    
    with tf.Session(graph=graph) as session:
      tf.global_variables_initializer().run()
      print('Initialized')
      mean_loss = 0
      for step in range(num_steps):
        batches = train_batches.next() #循环导入batches训练序列
        feed_dict = dict()
        for i in range(num_unrollings + 1):
          feed_dict[train_data[i]] = batches[i]  #训练数据列表,每个列表是个batch
        _, l, predictions, lr = session.run(
          [optimizer, loss, train_prediction, learning_rate], feed_dict=feed_dict)
    
        mean_loss += l
        if step % summary_frequency == 0:
          if step > 0:
            mean_loss = mean_loss / summary_frequency
          # The mean loss is an estimate of the loss over the last few batches.也就是前几次的平均
          print(
            'Average loss at step %d: %f learning rate: %f' % (step, mean_loss, lr))
          mean_loss = 0
    '''这里注意几个辅助函数'''
          labels = np.concatenate(list(batches)[1:])
          print('Minibatch perplexity: %.2f' % float(
            np.exp(logprob(predictions, labels))))
          if step % (summary_frequency * 10) == 0:
            # Generate some samples.
          '''这里用来生成一些可视化的样本'''
            print('=' * 80)
            for _ in range(5):
              feed = sample(random_distribution())
              sentence = characters(feed)[0]
              reset_sample_state.run()
              for _ in range(79):
                prediction = sample_prediction.eval({sample_input: feed})
                feed = sample(prediction)
                sentence += characters(feed)[0]
              print(sentence)
            print('=' * 80)
          # Measure validation set perplexity.
          reset_sample_state.run()
          valid_logprob = 0
          for _ in range(valid_size):
            b = valid_batches.next()
            predictions = sample_prediction.eval({sample_input: b[0]})
            valid_logprob = valid_logprob + logprob(predictions, b[1])
          print('Validation set perplexity: %.2f' % float(np.exp(
            valid_logprob / valid_size)))
    

    几个辅助函数介绍:
    logprob: 计算label和预测值的交叉熵。

    先回忆一下 cross_entropy:

    Cross Entropy = - \sum_{i}^N({predictions \cdot \log(labels)})
    那么,

    logprob = { Cross Entropy \over N }

    def logprob(predictions, labels):
      """Log-probability of the true labels in a predicted batch."""
      predictions[predictions < 1e-10] = 1e-10
      return np.sum(np.multiply(labels, -np.log(predictions))) / labels.shape[0]
    

    random_distribution():[0,1]区间内生成一个正态分布,值加和为1

    def random_distribution():
      """Generate a random column of probabilities."""
      b = np.random.uniform(0.0, 1.0, size=[1, vocabulary_size])
      return b/np.sum(b, 1)[:,None]
    

    sample_distribution(distribution):随机选择[0,len(distribution)]中任意一个整数值

    
    def sample_distribution(distribution):
      """Sample one element from a distribution assumed to be an array of normalized
      probabilities.
      """
      r = random.uniform(0, 1)
      s = 0
      for i in range(len(distribution)):
        s += distribution[i]
        if s >= r:
          return i
      return len(distribution) - 1
    

    sample(prediction):随机one-hot

    def sample(prediction):
      """Turn a (column) prediction into 1-hot encoded samples."""
      p = np.zeros(shape=[1, vocabulary_size], dtype=np.float)
      p[0, sample_distribution(prediction[0])] = 1.0
      return p
    

    相关文章

      网友评论

          本文标题:LSTM

          本文链接:https://www.haomeiwen.com/subject/kgkncqtx.html