问题背景:
1、将单词的index序列输入embedding层编码成嵌入表示
2、将单词的嵌入序列输入由RNN构成的编码器进行编码
那么RNN编码器的输出的格式是怎么样的呢?在网上我们可以看到很多序列模型用到了双向的RNN,并堆叠了多层构成了多层的双向RNN。但是我们有时候也是需要中间层的状态的,通常的做法是需要另外构造一个model进行输出,这显然是不自由的。
所以这次我们自己直接构造一个多层双向的RNN来检测他的输出结果到底是什么。这次测试针对的版本是tensorflow2.0,由于2.0版本的eager计算方式和自动图更新,所以下面都采用面向对象来编程。
1 构造嵌入层
import tensorflow.keras as keras
class Embedding(keras.layers.Layer):
def __init__(self, input_size,
output_size,
weights=None):
super(Embedding, self).__init__()
if weights is not None:
self.embedding = keras.layers.Embedding(input_size, output_size, weights=weights, mask_zero=True)
else:
self.embedding = keras.layers.Embedding(input_size, output_size, mask_zero=True)
def __call__(self, input): # [batch, len]
return self.embedding(input) # [batch, len, output_size]
这个嵌入层类主要封装了keras.layers.Embedding(input_size, output_size, weights=weights, mask_zero=True),这个api涉及的几个参数:
- 第一个参数为词汇表的维度
- 第二个参数为词嵌入维度
- weights为初始化的权值,如果有预训练的词嵌入,可以通过这个传入
- mask_zero这个参数实现了对index=0的单词的mask,我们通常把pad的符号设置为词汇表中的index=0,于是它产生一个mask并向后传递,在RNN中防止对句子中多余的pad符号进行解码。在tensorflow1.x中是通过在tf.nn.dynamic_rnn()这个api中传入一个encoder_len实现的,在pytorch中torch.nn.utils.rnn.pack_padded_sequence也起到了相同的作用。这个mask只会在RNN编码时起作用,并不会把pad的词嵌入变成全0,原来是什么就是什么。
# 嵌入层测试
weights = [np.array([[1, 2, 3, 4, 5], [2, 3, 4, 5, 6], [3, 4, 5, 6, 7]], dtype=np.float64)]
embedding = Embedding(3, 5, weights) # (num_vocab, embedding_size)
word_id = tf.convert_to_tensor([[1, 2, 0], [1, 0, 0]], dtype=tf.int64)
word_embed = embedding(word_id) # [batch, seq, embedding_size]
print(word_embed)
>>tf.Tensor(
[[[2. 3. 4. 5. 6.]
[3. 4. 5. 6. 7.]
[1. 2. 3. 4. 5.]]
[[2. 3. 4. 5. 6.]
[1. 2. 3. 4. 5.]
[1. 2. 3. 4. 5.]]], shape=(2, 3, 5), dtype=float32)
2 构造编码器
import tensorflow.keras as keras
class Encoder(keras.layers.Layer):
def __init__(self, rnn_type, # rnn类型
input_size,
output_size,
num_layers, # rnn层数
bidirectional=False):
super(Encoder, self).__init__()
assert rnn_type in ['GRU', 'LSTM']
if bidirectional:
assert output_size % 2 == 0
if bidirectional:
self.num_directions = 2
else:
self.num_directions = 1
units = int(output_size / self.num_directions)
if rnn_type == 'GRU':
rnnCell = [getattr(keras.layers, 'GRUCell')(units) for _ in range(num_layers)]
else:
rnnCell = [getattr(keras.layers, 'LSTMCell')(units) for _ in range(num_layers)]
self.rnn = keras.layers.RNN(rnnCell, input_shape=(None, None, input_size),
return_sequences=True, return_state=True)
self.rnn_type = rnn_type
self.num_layers = num_layers
if bidirectional:
self.rnn = keras.layers.Bidirectional(self.rnn, merge_mode='concat')
self.bidirectional = bidirectional
def __call__(self, input): # [batch, timesteps, input_dim]
outputs = self.rnn(input)
output = outputs[0]
states = outputs[1:]
print(outputs) # 用于测试的输出
print(len(outputs)) # 用于测试的输出
print(len(states)) # 用于测试的输出
return output, states
构造方法:
- 通过rnnCell = [getattr(keras.layers, 'GRUCell')(units) for _ in range(num_layers)]或rnnCell = [getattr(keras.layers, 'LSTMCell')(units) for _ in range(num_layers)]获得num_layers层RNN单元列表
- 通过rnn = keras.layers.RNN(rnnCell, input_shape=(None, None, input_size),
return_sequences=True, return_state=True)传入RNN单元列表构造一个多层的RNN,return_sequences=True代表输出每个时间步的输出,而不是最后一个时间步的输出,return_state=True代表返回RNN状态,False的话就不返回状态了。 - 通过rnn = keras.layers.Bidirectional(self.rnn, merge_mode='concat')加上双向,merge_mode='concat'代表通过拼接方式产生输出
3 测试输出
前面说了那么多废话,下面开始对输出进行测试,对这部分没兴趣的可以直接看第4部分的结论。
3.0 实验参数设置
# num_vocab = 3
# embedding_size = 5
# batch_size = 2
# encoder_len = 3
# num_units = 10
# num_layers = 2
3.1 单向多层GRU
对2层进行测试:
outputs = rnn(input)
>>[<tf.Tensor: id=553, shape=(2, 3, 10), dtype=float32, numpy=
array([[[-3.5420116e-02, -8.9026507e-05, 2.2907217e-01, 1.9754110e-01,
-3.2863699e-02, -2.4253847e-01, 1.2058940e-01, 6.2615253e-02,
-1.8428519e-01, -2.1019778e-01],
[-7.6624170e-02, 3.7288409e-02, 3.4195143e-01, 3.2474262e-01,
-7.6712951e-02, -3.0440533e-01, 1.9677658e-01, 1.2763622e-01,
-2.7749074e-01, -3.2409826e-01],
[-7.6624170e-02, 3.7288409e-02, 3.4195143e-01, 3.2474262e-01,
-7.6712951e-02, -3.0440533e-01, 1.9677658e-01, 1.2763622e-01,
-2.7749074e-01, -3.2409826e-01]],
[[-3.5420127e-02, -8.9021691e-05, 2.2907217e-01, 1.9754107e-01,
-3.2863699e-02, -2.4253847e-01, 1.2058940e-01, 6.2615216e-02,
-1.8428519e-01, -2.1019775e-01],
[-3.5420127e-02, -8.9021691e-05, 2.2907217e-01, 1.9754107e-01,
-3.2863699e-02, -2.4253847e-01, 1.2058940e-01, 6.2615216e-02,
-1.8428519e-01, -2.1019775e-01],
[-3.5420127e-02, -8.9021691e-05, 2.2907217e-01, 1.9754107e-01,
-3.2863699e-02, -2.4253847e-01, 1.2058940e-01, 6.2615216e-02,
-1.8428519e-01, -2.1019775e-01]]], dtype=float32)>, <tf.Tensor: id=542, shape=(2, 10), dtype=float32, numpy=
array([[ 0.10095029, -0.998891 , -0.48548818, -0.00963031, -0.97031355,
-0.12160255, 0.999949 , -0.10839747, -0.18006183, -0.17532544],
[ 0.03464954, -0.9603172 , -0.53084654, 0.00194323, -0.8031896 ,
-0.07652862, 0.9911491 , -0.06364062, -0.11014236, -0.14036107]],
dtype=float32)>, <tf.Tensor: id=543, shape=(2, 10), dtype=float32, numpy=
array([[-7.6624170e-02, 3.7288409e-02, 3.4195143e-01, 3.2474262e-01,
-7.6712951e-02, -3.0440533e-01, 1.9677658e-01, 1.2763622e-01,
-2.7749074e-01, -3.2409826e-01],
[-3.5420127e-02, -8.9021691e-05, 2.2907217e-01, 1.9754107e-01,
-3.2863699e-02, -2.4253847e-01, 1.2058940e-01, 6.2615216e-02,
-1.8428519e-01, -2.1019775e-01]], dtype=float32)>]
可以看到output包含三个部分:
output[0]:[batch_size, decoder_len, num_units]
output[1]:[batch_size, num_units]
output[2]:[batch_size, num_units]
观察数据可以看出output[2]是最后的h,代表output[2]为双层RNN的最后一层。查看output[0]的数据我们也能发现受到嵌入层mask的作用,pad部分的编码结果和句子结束时状态是一样的,只是向后复制了。
所以结论就是输出结果形式:
[output[0], output[1], output[2], ..., output[num_layers]]
output[0]是每个时间步的输出[batch_size, decoder_len, num_units]
output[N]是每一层的状态h[batch_size, num_units]
3.2 双向多层GRU
对2层进行测试:
outputs = rnn(input)
>>[<tf.Tensor: id=1096, shape=(2, 3, 10), dtype=float32, numpy=
array([[[-0.01417219, 0.13640611, 0.32041013, 0.00786568,
-0.03442783, 0.46687838, 0.14251477, -0.0060271 ,
-0.03813943, -0.4147334 ],
[-0.0154626 , 0.22333089, 0.49720186, -0.02729558,
-0.13843244, 0.30179217, 0.10419664, -0.0332097 ,
-0.06268977, -0.33545047],
[ 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ,
0. , 0. ]],
[[-0.01417219, 0.13640611, 0.32041013, 0.00786567,
-0.03442784, 0.29038957, 0.09369997, -0.00535166,
-0.02358363, -0.31432554],
[ 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ,
0. , 0. ],
[ 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ,
0. , 0. ]]], dtype=float32)>, <tf.Tensor: id=632, shape=(2, 5), dtype=float32, numpy=
array([[-0.01106511, 0.97525597, 0.38123077, 0.15792789, -0.8506844 ],
[-0.00910319, 0.8093642 , 0.2359951 , -0.14750779, -0.56568766]],
dtype=float32)>, <tf.Tensor: id=633, shape=(2, 5), dtype=float32, numpy=
array([[-0.0154626 , 0.22333089, 0.49720186, -0.02729558, -0.13843244],
[-0.01417219, 0.13640611, 0.32041013, 0.00786567, -0.03442784]],
dtype=float32)>, <tf.Tensor: id=1081, shape=(2, 5), dtype=float32, numpy=
array([[0.3142835 , 0.98540443, 0.26638144, 0.00319364, 0.98887223],
[0.36952233, 0.9663322 , 0.17328681, 0.00246616, 0.9730079 ]],
dtype=float32)>, <tf.Tensor: id=1082, shape=(2, 5), dtype=float32, numpy=
array([[ 0.46687838, 0.14251477, -0.0060271 , -0.03813943, -0.4147334 ],
[ 0.29038957, 0.09369997, -0.00535166, -0.02358363, -0.31432554]],
dtype=float32)>]
可以看到output包含五个部分:
output[0]:[batch_size, decoder_len, num_units]
output[1]:[batch_size, num_units/2]
output[2]:[batch_size, num_units/2]
output[3]:[batch_size, num_units/2]
output[4]:[batch_size, num_units/2]
观察数据可以看出output[2]是最后的h,代表output[2]为双层RNN的最后一层的前向RNN。查看output[0]的数据我们也能发现受到嵌入层mask的作用,pad部分的编码结果为0(和单向略有不同)。
所以结论就是输出结果形式:
[output[0], output[1], output[2], ..., output[num_layers*2]]
output[0]是每个时间步的输出[batch_size, decoder_len, num_units]
output[1]:是第0层的正向h[batch_size, num_units/2]
output[2]:是第1层的正向h[batch_size, num_units/2]
output[3]:是第0层的反向h[batch_size, num_units/2]
output[4]:是第1层的反向h[batch_size, num_units/2]
依次类推:
output[0]是每个时间步的输出[batch_size, decoder_len, num_units]
output[1: 1+num_layers]是每一层的正向h[batch_size, num_units/2]
output[1+num_layers:]是每一层的反向h[batch_size, num_units/2]
3.3 单向多层LSTM
对2层进行测试:
outputs = rnn(input)
>>[<tf.Tensor: id=413, shape=(2, 3, 10), dtype=float32, numpy=
array([[[ 0.03599537, -0.01473989, 0.05308587, -0.00895863,
0.01214957, -0.03720263, 0.02418177, -0.01348425,
-0.01298695, -0.03001863],
[ 0.07842067, -0.03227948, 0.09026823, -0.02830549,
0.01443951, -0.07027332, 0.05110155, -0.02023602,
-0.01933629, -0.05507426],
[ 0.07842067, -0.03227948, 0.09026823, -0.02830549,
0.01443951, -0.07027332, 0.05110155, -0.02023602,
-0.01933629, -0.05507426]],
[[ 0.03599537, -0.01473989, 0.05308587, -0.00895863,
0.01214957, -0.03720263, 0.02418176, -0.01348425,
-0.01298695, -0.03001863],
[ 0.03599537, -0.01473989, 0.05308587, -0.00895863,
0.01214957, -0.03720263, 0.02418176, -0.01348425,
-0.01298695, -0.03001863],
[ 0.03599537, -0.01473989, 0.05308587, -0.00895863,
0.01214957, -0.03720263, 0.02418176, -0.01348425,
-0.01298695, -0.03001863]]], dtype=float32)>, [<tf.Tensor: id=400, shape=(2, 10), dtype=float32, numpy=
array([[ 0.03796372, -0.00646253, -0.10610048, 0.2621497 , 0.00817543,
0.08675741, 0.03996095, 0.16117425, 0.65429616, -0.07473923],
[ 0.03174995, -0.0089063 , -0.07151143, 0.1907991 , 0.01177687,
0.04312354, 0.02712633, 0.19289187, 0.51734495, -0.09216765]],
dtype=float32)>, <tf.Tensor: id=401, shape=(2, 10), dtype=float32, numpy=
array([[ 0.44051132, -0.2818818 , -0.11988518, 1.2482902 , 0.17308153,
0.69406235, 0.06025018, 1.0685071 , 0.797681 , -0.1052426 ],
[ 0.22792174, -0.23269363, -0.0844808 , 0.6085427 , 0.16032045,
0.3221852 , 0.04220397, 0.8066951 , 0.5936996 , -0.12931918]],
dtype=float32)>], [<tf.Tensor: id=402, shape=(2, 10), dtype=float32, numpy=
array([[ 0.07842067, -0.03227948, 0.09026823, -0.02830549, 0.01443951,
-0.07027332, 0.05110155, -0.02023602, -0.01933629, -0.05507426],
[ 0.03599537, -0.01473989, 0.05308587, -0.00895863, 0.01214957,
-0.03720263, 0.02418176, -0.01348425, -0.01298695, -0.03001863]],
dtype=float32)>, <tf.Tensor: id=403, shape=(2, 10), dtype=float32, numpy=
array([[ 0.15394947, -0.06263469, 0.19750515, -0.05156851, 0.02507691,
-0.14487514, 0.0979518 , -0.03745949, -0.04038396, -0.11667444],
[ 0.07117884, -0.02876606, 0.11274612, -0.01666791, 0.02163199,
-0.07605074, 0.0462449 , -0.0253415 , -0.02669653, -0.0623024 ]],
dtype=float32)>]]
可以看到output包含三个部分:
output[0]:[batch_size, decoder_len, num_units]
output[1]:[[batch_size, num_units], [batch_size, num_units]]
output[2]:[[batch_size, num_units], [batch_size, num_units]]
观察数据可以看出output[2][1]是最后的h,代表output[2][1]为双层RNN的最后一层的h。
所以结论就是输出结果形式:
[output[0], output[1], output[2], ..., output[num_layers]]
output[0]是每个时间步的输出[batch_size, decoder_len, num_units]
output[N]是每一层的状态[h, c] [[batch_size, num_units], [batch_size, num_units]]
3.4 双向多层LSTM
对2层进行测试:
outputs = rnn(input)
>>[<tf.Tensor: id=816, shape=(2, 3, 10), dtype=float32, numpy=
array([[[-0.06421194, -0.00754393, -0.04505453, 0.05208206,
-0.03166301, -0.0243494 , -0.00789784, 0.10367834,
0.09167746, 0.01394088],
[-0.1210794 , -0.01336129, -0.09259984, 0.08671384,
-0.06314958, -0.00972542, 0.00197651, 0.04819337,
0.05299319, -0.00179022],
[ 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ,
0. , 0. ]],
[[-0.06421195, -0.00754394, -0.04505453, 0.05208206,
-0.031663 , -0.00825483, 0.00164982, 0.0411781 ,
0.04471161, -0.00124086],
[ 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ,
0. , 0. ],
[ 0. , 0. , 0. , 0. ,
0. , 0. , 0. , 0. ,
0. , 0. ]]], dtype=float32)>, [<tf.Tensor: id=506, shape=(2, 5), dtype=float32, numpy=
array([[ 8.3107513e-01, -6.2514983e-02, -2.2869313e-01, 2.0354016e-02,
-2.1946893e-04],
[ 6.7014122e-01, -6.0981486e-02, -1.2038765e-01, 1.5553602e-02,
-9.7971398e-04]], dtype=float32)>, <tf.Tensor: id=507, shape=(2, 5), dtype=float32, numpy=
array([[ 1.234918 , -0.3281948 , -0.28206116, 0.06127462, -0.39995325],
[ 0.8689103 , -0.22541635, -0.15223289, 0.04101423, -0.34894544]],
dtype=float32)>], [<tf.Tensor: id=508, shape=(2, 5), dtype=float32, numpy=
array([[-0.1210794 , -0.01336129, -0.09259984, 0.08671384, -0.06314958],
[-0.06421195, -0.00754394, -0.04505453, 0.05208206, -0.031663 ]],
dtype=float32)>, <tf.Tensor: id=509, shape=(2, 5), dtype=float32, numpy=
array([[-0.23299618, -0.03142868, -0.214081 , 0.20567834, -0.14606045],
[-0.12106555, -0.01712375, -0.10137994, 0.11782492, -0.07105252]],
dtype=float32)>], [<tf.Tensor: id=799, shape=(2, 5), dtype=float32, numpy=
array([[ 0.00424142, 0.3668591 , -0.5833647 , -0.03675587, 0.0019763 ],
[ 0.00434441, 0.32393652, -0.36846292, -0.01977784, 0.0016813 ]],
dtype=float32)>, <tf.Tensor: id=800, shape=(2, 5), dtype=float32, numpy=
array([[ 0.00942778, 0.5103172 , -1.1896598 , -0.48518264, 0.3304861 ],
[ 0.00973888, 0.4245502 , -0.58340454, -0.2169237 , 0.32636905]],
dtype=float32)>], [<tf.Tensor: id=801, shape=(2, 5), dtype=float32, numpy=
array([[-0.0243494 , -0.00789784, 0.10367834, 0.09167746, 0.01394088],
[-0.00825483, 0.00164982, 0.0411781 , 0.04471161, -0.00124086]],
dtype=float32)>, <tf.Tensor: id=802, shape=(2, 5), dtype=float32, numpy=
array([[-0.04924963, -0.0166596 , 0.21717109, 0.15558058, 0.02793371],
[-0.01635381, 0.00336652, 0.08587213, 0.07893328, -0.00250007]],
dtype=float32)>]]
可以看到output包含五个部分:
output[0]:[batch_size, decoder_len, num_units]
output[1]:[[batch_size, num_units/2], [batch_size, num_units/2]]
output[2]:[[batch_size, num_units/2], [batch_size, num_units/2]]
output[3]:[[batch_size, num_units/2], [batch_size, num_units/2]]
output[4]:[[batch_size, num_units/2], [batch_size, num_units/2]]
观察数据可以看出output[2][1]是最后的h,代表output[2][1]为双层RNN的最后一层的前向RNN。
所以结论就是输出结果形式:
[output[0], output[1], output[2], ..., output[num_layers*2]]
output[0]是每个时间步的输出[batch_size, decoder_len, num_units]
output[1]:是第0层的正向[h, c] [[batch_size, num_units/2], [batch_size, num_units/2]]
output[2]:是第1层的正向[h, c] [[batch_size, num_units/2], [batch_size, num_units/2]]
output[3]:是第0层的反向[h, c] [[batch_size, num_units/2], [batch_size, num_units/2]]
output[4]:是第1层的反向[h, c] [[batch_size, num_units/2], [batch_size, num_units/2]]
依次类推:
output[0]是每个时间步的输出[batch_size, decoder_len, num_units]
output[1: 1+num_layers]是每一层的正向[h, c] [[batch_size, num_units/2], [batch_size, num_units/2]]
output[1+num_layers:]是每一层的反向[h, c] [[batch_size, num_units/2], [batch_size, num_units/2]]
结论
单向多层GRU
output[0]是每个时间步的输出[batch_size, decoder_len, num_units]
output[N]是每一层的状态h[batch_size, num_units]
双向多层GRU
output[0]是每个时间步的输出[batch_size, decoder_len, num_units]
output[1: 1+num_layers]是每一层的正向h[batch_size, num_units/2]
output[1+num_layers:]是每一层的反向h[batch_size, num_units/2]
单向多层LSTM
output[0]是每个时间步的输出[batch_size, decoder_len, num_units]
output[N]是每一层的状态[h, c] [[batch_size, num_units], [batch_size, num_units]]
双向多层LSTM
output[0]是每个时间步的输出[batch_size, decoder_len, num_units]
output[1: 1+num_layers]是每一层的正向[h, c] [[batch_size, num_units/2], [batch_size, num_units/2]]
output[1+num_layers:]是每一层的反向[h, c] [[batch_size, num_units/2], [batch_size, num_units/2]]
另外,单向时,pad的解码是最后一个时间步解码结果向后复制,双向时,pad的解码直接为0。
网友评论