美文网首页自然语言处理
LSTM原理、源码、Demo及习题

LSTM原理、源码、Demo及习题

作者: 潘旭 | 来源:发表于2020-02-04 10:53 被阅读0次

全面整理LSTM相关原理,源码,以及开发demo,设计习题。如转载请注明转载出处。

LSTM 框架

lstm.png

lstm 由3个门和一个当前细胞输出值也就是 \tilde{C}_t来控制输入

  1. 遗忘门 - 表示 h_{t-1} 有多少被遗忘. f_t = \sigma (W_{if}x_t + b_if + W_{hf}h_{(t-1)} + b_hf) = \sigma (W_{f}[x_t,h_{(t-1)}] + b_f)
  2. 输入门 - 表示当前时刻有多少被保存下来。i_t = \sigma (W_{ii}x_t + b_ii + W_{hi}h_{(t-1)} + b_hi) = \sigma (W_{i}[x_t,h_{(t-1)}] + b_i)
  3. 输出门 - 控制多少信息被输出到隐层。o_t = \sigma (W_{io}x_t + b_io + W_{ho}h_{(t-1)} + b_ho) = \sigma (W_{o}[x_t,h_{(t-1)}] + b_o)
  4. 当前Cell输出 \tilde{C}_tg_t 表示 - 当前t时刻的实际 cell结果, \tilde{C}_t = g_t = tanh(W_{ig}x_t + b_ig + W_{hg}h_{(t-1)} + b_hg) = tanh(W_{g}[x_t,h_{(t-1)}] + b_g)

通过3个门以及 \tilde{C}_t = g_t 计算如下:

  1. c_t - 当前细胞输出。 c_t = f_{t}*c_{t-1} + i_{t}*g_t
  2. h_t - 隐层输出。 h_t = o_{t}*tanh(c_t)

参数计算

一共3个门加一个g_t,所以一共 4组参数(W_f, b_f), (W_i, b_i), (W_o, b_o), (W_g, b_g), [x_t,h_{(t-1)}] size是 input_size+hidden_size, 因为 W * [x_t,h_{(t-1)}] 输出的维度是与 h_t一样的也就是 hidden_size, 所以 W的维度是 (input_size+hidden_size, hidden_size). b 的size就是 hdden_size. 所以总共的参数量就是:

 4 * ((input_size+hidden_size) * hidden_size + hidden_size)

GRU

gru.png

GRU由2个门以及一个 隐层输出值\tilde{h}_t 也叫做 n_t来控制最终的h_t

  1. r_t - 重置门. r_t = \sigma (W_{ir}x_t + b_ir + W_{hr}h_{(t-1)} + b_hr) = \sigma (W_r[x_t, h_{(t-1)}] + b_r)
  2. z_t - 更新门. z_t = \sigma (W_{iz}x_t + b_iz + W_{hz}h_{(t-1)} + b_hz) = \sigma (W_z[x_t, h_{(t-1)}] + b_z)
  3. \tilde{h}_t - 隐层细胞输出. \tilde{h}_t = n_t = tanh(W_{in}x_t + b_in + r_t*(W_{hn}h_(t-1) + b_hn)) = tanh(W_{n}[x_t, r_t*h_{(t-1)}] + b_n)
  4. h_t - 最后输出. h_t = (1-z_{t})*h_{t-1} + z_{t}*\tilde{h}_t = (1-z_{t})*h_{t-1} + z_{t}*n_t

从上面来看,GRU 实际上比LSTM 少了一组参数。在数据量较大的时候使用lstm,而数据量较少使用GRU. 同时GRU不像lstm,有c_t, h_t两个输出,GRU只有一个 h_t

pytorch lstm 解析

\begin{array}{ll} \\ i_t = \sigma(W_{ii} x_t + b_{ii} + W_{hi} h_{(t-1)} + b_{hi}) \\ f_t = \sigma(W_{if} x_t + b_{if} + W_{hf} h_{(t-1)} + b_{hf}) \\ g_t = \tanh(W_{ig} x_t + b_{ig} + W_{hg} h_{(t-1)} + b_{hg}) \\ o_t = \sigma(W_{io} x_t + b_{io} + W_{ho} h_{(t-1)} + b_{ho}) \\ c_t = f_t * c_{(t-1)} + i_t * g_t \\ h_t = o_t * \tanh(c_t) \\ \end{array}

Parameters

  • input_size – The number of expected features in the input x
  • hidden_size – The number of features in the hidden state h
  • num_layers – Number of recurrent layers. E.g., setting num_layers=2 would mean stacking two LSTMs together to form a stacked LSTM, with the second LSTM taking in outputs of the first LSTM and computing the final results. Default: 1
  • bias – If False, then the layer does not use bias weights b_ih and b_hh. Default: True
  • batch_first – If True, then the input and output tensors are provided as (batch, seq, feature). Default: False
  • dropout – If non-zero, introduces a Dropout layer on the outputs of each LSTM layer except the last layer, with dropout probability equal to dropout. Default: 0
  • bidirectional – If True, becomes a bidirectional LSTM. Default: False

Inputs: input, (h_0, c_0)

  • input of shape (seq_len, batch, input_size): tensor containing the features of the input sequence. The input can also be a packed variable length sequence. See torch.nn.utils.rnn.pack_padded_sequence or torch.nn.utils.rnn.pack_sequence for details.
  • h_0 of shape (num_layers * num_directions, batch, hidden_size): tensor containing the initial hidden state for each element in the batch. If the LSTM is bidirectional, num_directions should be 2, else it should be 1.
  • c_0 of shape (num_layers * num_directions, batch, hidden_size): tensor containing the initial cell state for each element in the batch. If (h_0, c_0) is not provided, both h_0 and c_0 default to zero.

Outputs: output, (h_n, c_n)

  • output of shape (seq_len, batch, num_directions * hidden_size): tensor containing the output features (h_t) from the last layer of the LSTM, for each t. If a :class:torch.nn.utils.rnn.PackedSequence has been given as the input, the output will also be a packed sequence. For the unpacked case, the directions can be separated using output.view(seq_len, batch, num_directions, hidden_size), with forward and backward being direction 0 and 1 respectively. Similarly, the directions can be separated in the packed case. 从shape来看, 实际上这是所有 h_t 的输出. 所以对于 seq2seq 来说这个结果就够用了。
  • h_n of shape (num_layers * num_directions, batch, hidden_size): tensor containing the hidden state for t = seq_len. Like output, the layers can be separated using h_n.view(num_layers, num_directions, batch, hidden_size) and similarly for c_n. 从这个shaple num_layers * num_directions 来看,说明这是每一层的最后一个输出。对于最上面的一层来说,这个是包含最后一个输出的。也就是 -1, batch, hidden_size 就是最后一个输出。特别注意,即使设置了 batch _first=True, h_n 的维度依然是 num_layers * num_directions, batch, hidden_size 如果要转换需要 h_n.transpose(0, 1) 来转换到 batch, num_layers * num_directions, hidden_size
  • c_n of shape (num_layers * num_directions, batch, hidden_size): tensor containing the cell state for t = seq_len. 与 h_n 类似, 是每一层的 cell_state 输出。

变长序列处理

我们知道 lstm, 处理的是定长序列,那么,对于nlp来说,变长序列该如何处理?

对于变长序列处理,采用了:

  • pack_padded_sequence: 将变长序列打包
  • pad_packed_sequence: 将打包的结果解包

pack_padded_sequence

在理解这两个函数之前,我们看看lstm是如何进行运行的。

No. w_0 w_1 w_2 w_3 w_4
0 I love mom ' cooking
1 Yes 0 0 0 0
2 No way 0 0 0
3 I love you too !
4 This is the shit 0

一共有5个序列,长度各种各样。其中的 0 表示的是padding. 因为一个batch必须是有同样维度的才可以。

shape=BatchSize \times SequeceLength \times HiddenDim=1 \times 5 \times *

对于lstm的实际运行中会将长度一样的放在,这样能够批量运行同一批。所以会按照长度进行排序。重新排列的结果如下:

No. w_0 w_1 w_2 w_3 w_4
0 I love mom ' cooking
3 I love you too !
4 This is the shit
2 No way
1 Yes

这个过程就是pack的过程。pack之后,会重新将长度一样的放在一起,因为是长度一样的放在一起,那么,也就是将padding的0全部去掉后的排列结果。

示例图:

pack_pad.jpg

转载链接

  • batch_sizes: 看着其中不同的颜色。绿色:5, 橘色: 4 以此类推。 那么, pytorch在实际运算的时候是如何运算的呢?

批量运算,会一次性将上面所有数据进行运算。大体流程:

  1. 设置循环步数,为 max_length=batch_sizes[0], 这里就是5
  2. 开始循环 i=0:
    • 设置输入为 batch_size[i] 进入 lstm cell 批量运算. (i=0 时是 绿色的5个, i=1时是橘色的4个,一次类推)
    • i = i + 1

这样每一次处理实际上是运算递减的,同时,进行的也是 序列实际长度的lstm运算。(PS: 早期,因为看到padding, 所以以为会将padding一起运算,这样就不用进行pack了,但是, 这样会增加运算量,同时, 对于 最后一个输出 h_n, c_n 不是实际序列最后一个输出,而是 padding后的输出。而看了pytorch源码后,理解,在每一次的 lstm cell 运算,会重新取batch, 而这个batch是变化,与实际sequence长度一致. 从这个角度来看,我觉得之所以pack,对长度排序,是为了方便 每一次 lstm cell 取batch 方便运算; 如果不排序,每一次通过mask取会在lstm循环运算的时候效率较低)

pytorch c++ 源码 aten/src/ATen/native/RNN.cpp:

template<typename hidden_type, typename cell_params>

struct PackedLayer : Layer<PackedSequence, hidden_type, cell_params> {

using output_type = typename Layer<PackedSequence, hidden_type, cell_params>::output_type;

  PackedLayer(Cell<hidden_type, cell_params>& cell)
    : cell_(cell) {};

output_type operator()(

    const PackedSequence& input, 
    const hidden_type& input_hidden, 
    const cell_params& params) const override
{
    std::vector<at::Tensor> step_outputs;
    
    std::vector<hidden_type> hiddens;
    int64_t input_offset = 0;
    int64_t num_steps = input.batch_sizes.size(0);
    int64_t* batch_sizes = input.batch_sizes.data<int64_t>();
    int64_t last_batch_size = batch_sizes[0];

    // Batch sizes is a sequence of decreasing lengths, which are offsets
    // into a 1D list of inputs. At every step we slice out batch_size elements,
    // and possibly account for the decrease in the batch size since the last step,
    // which requires us to slice the hidden state (since some sequences
    // are completed now). The sliced parts are also saved, because we will need
    // to return a tensor of final hidden state.
    auto hidden = input_hidden;
    for (int64_t i = 0; i < num_steps; ++i) {
      int64_t batch_size = batch_sizes[i];
      auto step_input = input.data.narrow(0, input_offset, batch_size);
      input_offset += batch_size;

      int64_t dec = last_batch_size - batch_size;
      if (dec > 0) {
        hiddens.push_back(hidden_slice(hidden, last_batch_size - dec, last_batch_size));
        hidden = hidden_slice(hidden, 0, last_batch_size - dec);
      }

      last_batch_size = batch_size;
      hidden = cell_(step_input, hidden, params);
      step_outputs.push_back(hidden_as_output(hidden));
    }
    hiddens.push_back(hidden);
    std::reverse(hiddens.begin(), hiddens.end());

    return { PackedSequence{ at::cat(step_outputs, 0), input.batch_sizes }, hidden_concat(hiddens) };
  }

  Cell<hidden_type, cell_params>& cell_;
};

解释:

  • num_steps: 就是最长的batch 其实也就是最长的sequence length
  • input.data.narrow(0, input_offset, batch_size): 从 batch_size 中取每一步 lstm cell 要运算的 所有sequence x_t, 对应到代码就是 x_{input\_offset}
  • dec = last_batch_size - batch_size: 对于其他没有参与到 lstm cell 运算的,用0来补上,保证lstm运算后,所有的 sequence hidden layer 长度是一样的。

结论: 在pack后的变长序列,运算每一步都是有效运算。所以在来看 lstm 的输出

  • output: shape (seq_len, batch, num_directions * hidden_size) 是包含padding的序列长度(padding的为0)
  • h_n: shape (num_layers * num_directions, batch, hidden_size), 实际的sequence 长度计算的lstm 最后一个 隐层,与padding无关. 从shape看是所有layer的最后一个隐层.
  • c_n: shape (num_layers * num_directions, batch, hidden_size), 实际的sequence 长度计算的 lstm 最后一个 cell state 与padding无关. 从shape看是所有layer的最后一层

enforce_sorted 参数

这个参数特别说明一下, 默认是 True, 也就是说 输入的batch sequence 必须是按照长度降序排好序的。

如果这个参数是 False, 那么,这个排序的工作会由 pack_padded_sequence 来做。

lstm demo

下面的demo包含了前面说的参数设置。

import torch
from torch.nn.modules.rnn import LSTM
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

X_SORTED = torch.tensor([
    [
        [1, 2], [3, 4], [0, 0]
    ],
    [
        [5, 6], [7, 8], [0, 0]
    ],
    [
        [9, 10], [0, 0], [0, 0]
    ]


], dtype=torch.float)

X_UNSORTED = torch.tensor([
    [
        [1, 2], [3, 4], [0, 0]
    ],

    [
        [9, 10], [0, 0], [0, 0]
    ],
    [
        [5, 6], [7, 8], [0, 0]
    ]
], dtype=torch.float)

sequence_length_sorted = torch.tensor([2, 2, 1], dtype=torch.long)
sequence_length_unsorted = torch.tensor([2, 1, 2], dtype=torch.long)


def demo_pack_padded_sequence(x, sequence_length, is_sorted):
    print(f"x: {x.numpy()}")
    print(f"Batch size: {x.shape[0]}, "
          f"Sequence length: {x.shape[1]}, "
          f"hidden dim: {x.shape[2]}")

    print(f"sequence length: {sequence_length.numpy()}")

    pack = pack_padded_sequence(input=x,
                                lengths=sequence_length,
                                batch_first=True,
                                enforce_sorted=is_sorted)

    print(f"pack: {pack}")

    pad = pad_packed_sequence(sequence=pack, batch_first=True)
    print(f"pad: {pad}")

    lstm = LSTM(input_size=x.shape[-1],
                hidden_size=4,
                num_layers=1,
                batch_first=True,
                bidirectional=False)
    output, (hn, cn) = lstm(pack)

    print("output", "-" * 80)
    print(output)

    pad_output = pad_packed_sequence(output, batch_first=True, padding_value=0.0)
    print("+" * 80)
    print(f"output: {pad_output}")
    print(f"hn: {hn}")
    print(f"cn: {cn}")
    print(f"output[:-1:]: {output[:-1:]}")
demo_pack_padded_sequence(x=X_SORTED,
                              sequence_length=sequence_length_sorted,
                              is_sorted=True)
x: [[[ 1.  2.]
  [ 3.  4.]
  [ 0.  0.]]

 [[ 5.  6.]
  [ 7.  8.]
  [ 0.  0.]]

 [[ 9. 10.]
  [ 0.  0.]
  [ 0.  0.]]]
Batch size: 3, Sequence length: 3, hidden dim: 2
sequence length: [2 2 1]
pack: PackedSequence(data=tensor([[ 1.,  2.],
        [ 5.,  6.],
        [ 9., 10.],
        [ 3.,  4.],
        [ 7.,  8.]]), batch_sizes=tensor([3, 2]), sorted_indices=None, unsorted_indices=None)
pad: (tensor([[[ 1.,  2.],
         [ 3.,  4.]],

        [[ 5.,  6.],
         [ 7.,  8.]],

        [[ 9., 10.],
         [ 0.,  0.]]]), tensor([2, 2, 1]))
output --------------------------------------------------------------------------------
PackedSequence(data=tensor([[-0.0644,  0.1670,  0.1466,  0.0274],
        [-0.0926,  0.1344,  0.4401,  0.0731],
        [-0.0940,  0.1009,  0.5912,  0.0238],
        [-0.1254,  0.2073,  0.3820,  0.0958],
        [-0.1485,  0.1642,  0.5456,  0.0616]], grad_fn=<CatBackward>), batch_sizes=tensor([3, 2]), sorted_indices=None, unsorted_indices=None)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
output: (tensor([[[-0.0644,  0.1670,  0.1466,  0.0274],
         [-0.1254,  0.2073,  0.3820,  0.0958]],

        [[-0.0926,  0.1344,  0.4401,  0.0731],
         [-0.1485,  0.1642,  0.5456,  0.0616]],

        [[-0.0940,  0.1009,  0.5912,  0.0238],
         [ 0.0000,  0.0000,  0.0000,  0.0000]]], grad_fn=<TransposeBackward0>), tensor([2, 2, 1]))
hn: tensor([[[-0.1254,  0.2073,  0.3820,  0.0958],
         [-0.1485,  0.1642,  0.5456,  0.0616],
         [-0.0940,  0.1009,  0.5912,  0.0238]]], grad_fn=<StackBackward>)
cn: tensor([[[-0.1698,  0.4718,  0.7097,  0.4341],
         [-0.1641,  0.3882,  0.8470,  1.2613],
         [-0.0996,  0.2737,  0.8373,  0.9169]]], grad_fn=<StackBackward>)
output[:-1:]: (tensor([[-0.0644,  0.1670,  0.1466,  0.0274],
        [-0.0926,  0.1344,  0.4401,  0.0731],
        [-0.0940,  0.1009,  0.5912,  0.0238],
        [-0.1254,  0.2073,  0.3820,  0.0958],
        [-0.1485,  0.1642,  0.5456,  0.0616]], grad_fn=<CatBackward>), tensor([3, 2]), None)

特别提示: hn 的结果是包含在 output 中的, 如何从 output 中提取出 hn 参考习题4

lstm 应用

从前面阐述,明白了lstm实际的原理和输出。那么,在实际应用的时候,如果是 使用 每一个时间步的隐层进行运算, 那么,要注意将mask进入运算,因为输出的隐层是包含padding部分的。当然可以利用,padding的值全是0,是有些便利计算方法的,但是不推荐,要使用mask运算。

如果是使用lstm最后一个隐层的输出,那么,直接使用就可以了。

习题

  1. sequence 长度是 100, x_t embedding 维度是 200, 隐层输出维度是 300, 计算 lstm 参数是多少?

  2. lstm 对变长序列padding,在实际计算lstm cell的时候 padding 部分是否参与计算? 如果不参与计算,lstm是如何进行变长计算的?

  3. lstm 输出 output(也就是 每个时间步的hidden输出) 是否 包含 h_n 输出?如果不包含,请说明情况?

  4. 使用h_n, 如何提取 最后一个 最后的 最后一层的 hidden 输出 (最后的hidden输出常常作为整个句子的编码结果); 在不使用 h_n 的情况下, 使用 lstm 输出 output(也就是 每个时间步的hidden输出), 如何提取出最后一个 最后一层的 hidden 输出?

  5. 是否注意到了mask的使用?

相关文章

网友评论

    本文标题:LSTM原理、源码、Demo及习题

    本文链接:https://www.haomeiwen.com/subject/ftxrxhtx.html