美文网首页
Pointer Networks

Pointer Networks

作者: MasterXiong | 来源:发表于2019-07-12 15:06 被阅读0次

    Pointer Networks

    Oriol Vinyals, Meire Fortunato, Navdeep Jaitly
    Google, Berkeley
    NIPS 2015

    Introduction

    The key motivation and contribution of this work is a pointer network framework which can solve discrete mapping problems with different dictionary size across instances.

    Our model solves the problem of variable size output dictionaries using a recently proposed mechanism of neural attention.

    The method is to

    use attention as a pointer to select a member of the input sequence as the output.

    This work provides a new approach to solve discrete optimization problems with sequential model.

    Problem Setup

    This paper focuses on solving a specific type of seqence-to-sequence task in a supervised learning approach. In the training data, the inputs are planar point sets \mathcal{P} = \{ P_1, \cdots , P_n \} with n elements each, where P_i = (x_i, y_i) are the cartesian coordinates of the points. The training instances are sampled from a uniform distribution in [0, 1] \times [0, 1]. The outputs \mathcal{C}^{\mathcal{P}} = \{ C_1, \cdots , C_{m(\mathcal{P})} \} are sequences of point indices representing the solution associated to the point set \mathcal{P}.

    Models

    Sequence-to-Sequence Model

    This model learns the parameters \theta of an encoder-decoder to maximize the conditional probabilitity of the output sequence on the training samples:
    \theta^* = \mathop{\arg\max}_{\theta} \sum_{\mathcal{P}, \mathcal{C}^{\mathcal{P}}} \log{p(\mathcal{C}^{\mathcal{P}} | \mathcal{P}; \theta)} ,
    where
    p(\mathcal{C}^{\mathcal{P}} | \mathcal{P}; \theta) = \prod_{i = 1}^{m(\mathcal{P})} p_\theta (C_i | C_1, \cdots , C_{i - 1}, \mathcal{P}; \theta) .
    Note that this model makes no statistical independence assumptions, thus it is not a Markov chain.

    During inference, as the number of possible suquences grows exponentially with the sequence length, beam search is utilized to find the best possible sequence.

    Notice that in the standard sequence-to-sequence model, the output dictionary size for all symbols C_i is fixed and equal to n, which means that the model trained for a particular sequence length can not generalize to other sequences with different length.

    Content Based Input Attention

    The vanilla sequence-to-sequence model only uses the final state of the encoder to represent the whole input sequence, which constrains the amount of information and computation that can flow through to the decoder. The attention model augments the decoder RNNs with an attention module over the encoder states to provide further information:
    u_j^i = v^T \tanh{W_1 e_j + W_2 d_i} ,
    a^i = softmax(u^i) ,
    d'_i = \sum_{i = 1}^n = a_j^i e_j ,
    where e_j and d_i are encoder state and decoder state respectively. d'_i and d_i are concatenated and used as the hidden state for prediction and input to the next time step.

    Pointer Network

    The pointer network simplifies the attention mechanism by normalizing the vector u^i to be an output distribution over the dictionary of inputs, which guarantees that the dictionary size is always consistent with the input dictionary size:
    p(C_i | C_1, \cdots , C_{i - 1}, \mathcal{P}) = softmax(u^i) .

    Experiments

    The paper experiments on three different problems, i.e. convex hull, Delaunay triangulation and TSP, all relating to finding a solution with respect to a discrete input sequence.

    (The output is actually a cycle or set in these problems, which means that any point in the solution can be the start point in the decoder sequence. RNN actually can't reflect this property, and the authors had to artificially define a start point of the output sequence in the experimental setup. )

    Convex Hull

    • Instance representation: The elements C_i are indices between 1 and n corresponding to positions in the sequence P. To represent the output as a sequence, start from the point with the lowest index, and go counter-clockwise.

    Delaunay Triangulation

    • Instance representation: The outputs \mathcal{C}^{\mathcal{P}} = \{ C_1, \cdots , C_{m(\mathcal{P})} \} are the corresponding sequences representing the triangulation of the point set \mathcal{P}. Each C_i is a triple of integers from 1 to n corresponding to the position of triangle vertices in \mathcal{P}.

    TSP

    • Instance representation: For consistency, in the training dataset, always start in the first city without loss of generality.

    Generally speaking, the pointer network can work across different sequence length, and perform relatively well for small instance size. However, as it uses 1M training instances for each task, and all of them are uniformly sampled in an unit square, I doubt that it is fitting instead of actually learning.

    相关文章

      网友评论

          本文标题:Pointer Networks

          本文链接:https://www.haomeiwen.com/subject/ujaekctx.html