2018-NAACL- Bi-model based RNN S

2018-NAACL- Bi-model based RNN S

作者: 君子有三畏 | 来源:发表于2018-09-24 20:45 被阅读0次

In this paper, new Bi-model based RNN semantic frame parsing network structures are designed to perform the intent detection and slot filling tasks jointly, by considering their cross-impact to each other using two correlated bidirectional LSTMs (BLSTM).

Abstract 摘要

Intent detection and slot filling are two main tasks for building a spoken language understanding (SLU) system.

Multiple deep learning based models have demonstrated good results on these tasks .

The most effective algorithms are based on the structures of sequence to sequence models (or ”encoder-decoder” models), and generate the intents and semantic tags either using separate models((Yao et al., 2014; Mesnil et al., 2015; Peng and Yao, 2015; Kurata et al., 2016; Hahn et al., 2011)) or a joint model ((Liu and Lane, 2016a; Hakkani-T¨ur et al., 2016; Guo et al., 2014)).
最有效的算法是基于序列模型(或“编码-解码”模型)的结构,并使用单独的模型生成意图和语义标记(姚等人,2014;Mesnil等人,2015年;彭和姚,2015;库拉塔等人,2016年;哈恩等人,2011年))或联合模型(刘和莱恩,2016a;Hakkani-T ur等人,2016年;郭等人,2014年))。

Most of the previous studies, however, either treat the intent detection and slot filling as two separate parallel tasks, or use a sequence to sequence model to generate both semantic tags and intent.

Most of these approaches use one (joint) NN based model (including encoderdecoder structure) to model two tasks, hence may not fully take advantage of the cross impact between them.

In this paper, new Bi-model based RNN semantic frame parsing network structures are designed to perform the intent detection and slot filling tasks jointly, by considering their cross-impact to each other using two correlated bidirectional LSTMs (BLSTM).</br>

Our Bi-model structure with a decoder achieves state-of-the-art result on the benchmark ATIS data (Hemphill et al., 1990; Tur et al., 2010), with about 0.5% intent accuracy improvement and 0.9 % slot filling improvement.

1 Introduction 介绍

The research on spoken language understanding (SLU) system has progressed extremely fast during the past decades.

Two important tasks in an SLU system are intent detection and slot filling.

These two tasks are normally considered as parallel tasks but may have cross-impact on each other.</br>

The intent detection is treated as an utterance classification problem, which can be modeled using conventional classifiers including regression, support vector machines (SVMs) or even deep neural networks (Haffner et al., 2003; Sarikaya et al., 2011).</br>

The slot filling task can be formulated as a sequence labeling problem, and the most popular approaches with good performances are using conditional random fields (CRFs) and recurrent neural networks (RNN) as recent works (Xu and Sarikaya, 2013).</br>

Some works also suggested using one joint RNN model for generating results of the two tasks together, by taking advantage of the sequence to sequence(Sutskever et al., 2014) (or encoderdecoder) model, which also gives decent results as in literature(Liu and Lane, 2016a).</br>

In this paper, Bi-model based RNN structures are proposed to take the cross-impact between two tasks into account, hence can further improve the performance of modeling an SLU system.</br>

These models can generate the intent and semantic tags concurrently for each utterance.</br>

In our Bi-model structures, two task-networks are built for the purpose of intent detection and slot filling.

Each task-network includes one BLSTM with or without a LSTM decoder (Hochreiter and Schmidhuber, 1997; Graves and Schmidhuber, 2005).</br>

The paper is organized as following: In section 2, a brief overview of existing deep learning approaches for intent detection and slot fillings are given.</br>

The new proposed Bi-model based RNN approach will be illustrated in detail in section 3.

In section 4, two experiments on different datasets will be given.</br>

One is performed on the ATIS benchmark dataset, in order to demonstrate a state-of-the-art result for both semantic parsing tasks.</br>

The other experiment is tested on our internal multi-domain dataset by comparing our new algorithm with the current best performed RNN based joint model in literature for intent detection and slot filling.</br>

2 Background 背景

In this section, a brief background overview on using deep learning and RNN based approaches to perform intent detection and slot filling tasks is given.</br>

The joint model algorithm is also discussed for further comparison purpose.

2.1 Deep neural network for intent detection 意图识别深度神经网络

Using deep neural networks for intent detection is similar to a standard classification problem, the only difference is that this classifier is trained under a specific domain.

For example, all data in ATIS dataset is under the flight reservation domain with 18 different intent labels.

There are mainly two types of models that can be used: one is a feed-forward model by taking the average of all words’ vectors in an utterance as its input, the other way is by using the recurrent neural network which can take each word in an utterance as a vector one by one (Xu and Sarikaya, 2014).

2.2 Recurrent Neural network for slot filling 循环神经网络的槽填充

The slot filling task is a bit different from intent detection as there are multiple outputs for the task, hence only RNN model is a feasible approach for this scenario.</br>

The most straight-forward way is using single RNN model generating multiple semanctic tags sequentially by reading in each word one by one (Liu and Lane, 2015; Mesnil et al., 2015; Peng and Yao, 2015).</br>

This approach has a constrain that the number of slot tags generated should be the same as that of words in an utterance.</br>

Another way to overcome this limitation is by using an encoder-decoder model containing two RNN models as an encoder for input and a decoder for output (Liu and Lane, 2016a).</br>

The advantage of doing this is that it gives the system capability of matching an input utterance and output slot tags with different lengths without the need of alignment. Besides using RNN, It is also possible to use the convolutional neural network (CNN) together with a conditional random field (CRF) to achieve slot filling task (Xu and Sarikaya, 2013).</br>

2.3 Joint model for two tasks 两个任务的联合模型
It is also possible to use one joint model for intent detection and slot filling (Guo et al., 2014; Liu and Lane, 2016a,b; Zhang and Wang, 2016; Hakkani-T¨ur et al., 2016). One way is by using one encoder with two decoders, the first decoder will generate sequential semantic tags and the second decoder generates the intent.
也可以使用一个联合模型来进行意图检测和插槽填充(郭等,2014;刘和莱恩,2016a,b;张和王,2016;Hakkani-T ur等人,2016年)。一种方法是使用一个编码器和两个解码器,第一个解码器将生成顺序语义标签,第二个解码器会产生意图。

Another approach is by consolidating the hidden states information from an RNN slot filling model, then generates its intent using an attention model (Liu and Lane, 2016a).

Both of the two approaches demonstrates very good results on ATIS dataset.</br>

3 Bi-model RNN structures for joint semantic frame parsing 用于联合语义框架解析的双向RNN结构

Despite the success of RNN based sequence to sequence (or encoder-decoder) model on both tasks, most of the approaches in literature still use one single RNN model for each task or both tasks.

They treat the intent detection and slot filling as two separate tasks.</br>

In this section, two new Bi-model structures are proposed to take their cross-impact into account, hence further improve their performance.

One structure takes the advantage of a decoder structure and the other doesn’t.

An asynchronous training approach based on two models’ cost functions is designed to adapt to these new structures.

3.1 Bi-model RNN Structures 双向RNN模型架构

A graphical illustration of two Bi-model structures with and without a decoder is shown in Figure 1.


The two structures are quite similar to each other except that Figure 1a contains a LSTM based decoder, hence there is an extra decoder state st to be cascaded besides the encoder state ht.

Remarks: 评论
The concept of using information from multiplemodel/ multi-modal to achieve better performance has been widely used in deep learning (Dean et al., 2012; Wang, 2017; Ngiam et al., 2011; Srivastava and Salakhutdinov, 2012), system identification (Murray-Smith and Johansen, 1997; Narendra et al., 2014, 2015) and also reinforcement learning field recently (Narendra et al., 2016; Wang and Jin, 2018).
在深度学习中广泛使用了利用多模式/多模式信息来获得更好的性能的概念(Dean et al.,2012;王,2017;Ngiam等人,2011年;2012年斯利瓦斯塔瓦和萨拉赫哈丁诺夫,系统识别(Murray-Smith和约翰森,1997;纳伦德拉等人,2014年,2015年),以及最近的强化学习领域(纳伦德拉等人,2016年;王和金,2018年)。

Instead of using collective information, in this paper, our work introduces a totally new approach of training multiple neural networks asynchronously by sharing their internal state information.

3.1.1 Bi-model structure with a decoder 基于解码结构的双模型

The Bi-model structure with a decoder is shown as in Figure 1a.

There are two inter-connected bidirectional LSTMs (BLSTMs) in the structure, one is for intent detection and the other is for slot filling.

Each BLSTM reads in the input utterance sequences(x_1, x_2, · · ·, x_n) forward and backward, and generates two sequences of hidden states hf_t and hb_t
每一个BLSTM都在输入的话语序列中(x_1, x_2, · · ·, x_n)向前和向后读取,并产生两个隐藏状态 hf_thb_t的序列。

A concatenation of hf_t and hb_t forms a final BLSTM state ht = [hf_t,hb_t] at time step t.
在时间t的时候,hf_thb_t的串联形成了一个最终的BLSTM状态ht = [hf_t,hb_t]。

Hence, Our bidirectional LSTM fi(·) generates a sequence of hidden states (h^i_1, h^i_2,· · ·, h^i_n), where i = 1 corresponds the network for intent detection task and i = 2 is for the slot filling task.
因此,我们的双向LSTM fi(·)产生一系列隐藏状态(h^i_1, h^i_2,· · ·, h^i_n),其中i=1对应于意图检测任务的网络,而i=2则用于槽填充任务。

In order to detect intent, hidden state h^1_t is combined together with h^2_t from the other bidirectional LSTM f2(·) in slot filling task-network to generate the state of g1(·), s^1_t, at time step t:
s^1_t = \phi(s^1_{t-1}, h^1_{n-1}, h^2_{n-1})

y^1_{intent} = \arg\max_{y^1_i}P(\hat{y}^1_n|s^1_{n-1}, h^1_{n-1}, h^2_{n-1}) \qquad \qquad (1)
为了检测意图,隐藏状态h^1_t与其他双向LSTM f2()在槽填充任务网络中的h^2_t结合在一起,以生成g1()、s^1_t的状态,在时间t:
s^1_t = \phi(s^1_{t-1}, h^1_{n-1}, h^2_{n-1})

y^1_{intent} = \arg\max_{y^1_i}P(\hat{y}^1_n|s^1_{n-1}, h^1_{n-1}, h^2_{n-1}) \qquad\qquad (1)

where \hat{y}^1_n contains the predicted probabilities for all intent labels at the last time step n.</br>

For the slot filling task, a similar network structure is constructed with a BLSTM f2(·) and a LSTM g2(·). f2(·) is the same as f1(·), by reading in the a word sequence as its input.
对于槽填充任务,类似的网络结构是用BLSTM f2(·)和lstmg2(·)构建的。f2(·)与f1(·)相同,通过在一个单词序列中阅读作为输入。

The difference is that there will be an output y^2_t at each time step t for g2(·), as it is a sequence labeling problem.

At each step t:</br>
s^2_t=\psi(h^2_{t-1}, h^1_{t-1}, s^2_{t-1}, y^2_{t-1})

y^2_t=\arg\max_{\hat{y}^2_t}P(\hat{y}^2_t|h^1_{t-1},h^2_{t-1},s^2_{t-1},y^2_{t-1}) \qquad\qquad (2)
s^2_t=\psi(h^2_{t-1}, h^1_{t-1}, s^2_{t-1}, y^2_{t-1})

y^2_t=\arg\max_{\hat{y}^2_t}P(\hat{y}^2_t|h^1_{t-1},h^2_{t-1},s^2_{t-1},y^2_{t-1}) \qquad\qquad (2)
where y^2_t is the predicted semantic tags at time step t.

3.1.2 Bi-Model structure without a decoder 无解码器的双模结构

The Bi-model structure without a decoder is shown as in Figure 1b.</br>

In this model, there is no LSTM decoder as in the previous model.</br>

For the intent task, only one predicted output label y1 intent is generated from BLSTM f1(·) at the last time step n, where n is the length of the utterance.</br>
对于意图任务,只有一个预测的输出标签y1意图是从BLSTM f1()在最后一次步骤n中生成的,其中n是话语的长度。

Similarly, the state value h^1_t and output intent label are generated as:</br>

y^1_{intent}=\arg\max_{\hat{y}^1_n}P(\hat{y}^1_n|h^1_{n-1},h^2_{n-1}) \qquad \qquad (3)

For the slot filling task, the basic structure of BLSTM f2(·) is similar to that for the intent detection task f1(·), except that there is one slot tag label y^2_t generated at each time step t.
对于槽填充任务,BLSTM f2()的基本结构与意图检测任务f1()相似,只是在每个时间步骤t中生成一个槽标签标签y^2_t

It also takes the hidden state from two BLSTMs f1(·) and f2(·), i.e. h^1_{t-1} and h^2_{t-1} , plus the output tag y^2_{t-1} together to generate its next state value h^2_t and also the slot tag y^2_t . To represent this as a function mathematically:
它还从两个BLSTMs f1()和f2()中获取隐藏状态,即h^1_{t-1}h^2_{t-1},加上输出标签y^2_{t-1}一起产生下一个状态值$h^2_t$和插槽标记y^2_t。用数学方法来表示这个函数:
h^2_t= \phi(h^1_{t-1},h^2_{t-1},y^2_{t-1})
y^2_t=\arg\max_{\hat{y}^2_t}P(\hat{y}^2_t|h^1_{t-1},h^2_{t-1},y^2_{t-1}) \qquad\qquad (2)
3.1.3 Asynchronous training 异步训练
One of the major differences in the Bi-model structure is its asynchronous training, which trains two task-networks based on their own cost functions in an asynchronous manner.

The loss function for intent detection task-network is \mathcal{L1} , and for slot filling is \mathcal{L2}. \mathcal{L1} and \mathcal{L2} are defined using cross entropy as:

\mathcal{L1}\triangleq -\sum_{i-1}^k \hat{y}^{1,i}_{intent} \log(y^{1,i}_{intent}) \qquad (5)
\mathcal{L2}\triangleq -\sum_{j-1}^n \sum^m_{i-1} \hat{y}^{2,i}_{j} \log(y^{2,i}_{j}) \qquad (6)

where k is the number of intent label types, m is the number of semantic tag types and n is the number of words in a word sequence.

In each training iteration, both intent detection and slot filling networks will generate a groups of hidden states h^1 and h^2 from the models in previous iteration.

The intent detection task-network reads in a batch of input data x_i and hidden states h^2, and generates the estimated intent labels \hat{y}^1_{intent}.

The intent detection task-network computes its cost based on function \mathcal{L1} and trained on that.

Then the same batch of data x_i will be fed into the slot filling task network together with the hidden state h^1 from intent task-network, and further generates a batch of outputs y^2_i for each time step.

Its cost value is then computed based on cost function \mathcal{L2}, and further trained on that.

The reason of using asynchronous training approach is because of the importance of keeping two separate cost functions for different tasks. Doing this has two main advantages:

  1. It filters the negative impact between two tasks in comparison to using only one joint model, by capturing more useful information and overcoming the structural limitation of one model.

  2. 它通过捕获更有用的信息和克服一个模型的结构限制,来过滤两个任务之间的负面影响,而不是只使用一个联合模型。

  3. The cross-impact between two tasks can only be learned by sharing hidden states of two models, which are trained using two cost functions separately.

  4. 两个任务之间的交叉影响只能通过共享两个模型的隐藏状态来学习,这两个模型分别使用两个成本函数进行训练。



      本文标题:2018-NAACL- Bi-model based RNN S
