无标题文章

作者: Freesoul_7a7e | 来源:发表于2019-02-27 19:26 被阅读0次

# Summary of papers in VLN

> Created by Binghui Xie

[TOC]

## Task

![](C:\Users\18916\Desktop\NLVpapers\summary\d6J2gNX.gif)

VLN requires that an autonomous agent follows a natural language navigation instruction to navigate to a target location in a previously unseen real-world building. Thus, the main difference between VLN and similar tasks is that the agent need navigate in unseen environment with unseen natural-language navigation command.

## Interpreting visually-grounded navigation instructions in real environments

The paper presents the Matterport3D Simulator – a large-scale reinforcement learning environment based on real imagery . with this simulator, the authors also provide the ﬁrst benchmark dataset for visually-grounded natural language navigation in real buildings – the Room-to-Room (R2R) dataset1. what's more, the paper proposed sequence-to-sequence neural networks to the R2R dataset and established several baselines.

### Related work

**Navigation and language** : Compared with NLV, the work in this field was involved less visual perception, which means the agent in the tasks don't need understand the image or has a low degree of visual perception.

**Vision and language** : There are VQA, image caption and visual dialog included in the field. However, there is no agent which need our natural language navigation to move.

**Navigation based simulators** : The Matterport3D Simulator proposed in the paper provides a simulator with much more visual diversity and richness than former simulators.

**RL in navigation** : The tasks are visually and linguistically less complex than VLN

### Matterport3D Simulator

The simulator is based on Matterport3D Dataset consists of 10,800 panoramic views constructed from 194,400 RGB-D images of 90 building-scale scenes. gent poses are deﬁned in terms of 3D position v ∈ V , heading ψ ∈ [0,2π), and camera elevation θ ∈ [−π/2 , π/2 ], where V is the set of 3D points associated with panoramic viewpoints in the scene. At each step t, the simulator outputs an RGB image observation $o_t$ corresponding to the agent’s ﬁrst person camera view. At each step t the simulator also outputs a set of next step reachable viewpoints $W_{t+1}$ ⊆ V . Agents interact with the simulator by selecting a new viewpoint v_{t+1} ∈ W_{t+1}, and nominating camera heading (∆ψt+1) and elevation (∆θt+1) adjustments.

### Task

Formally, at the beginning of each episode the agent is given as input a natural language instruction $x_hat = <x_1,x_2,... x_L>$, where L is the length of the instruction and $x_i$ is a single word token. The agent observes an initial RGB image o_0, determined by the agent’s initial pose comprising a tuple of 3D position, heading and elevation $s_0 = <v_0, ψ_0, θ_0>$. The agent must execute a sequence of actions <s_0,a_0,s_1,a_1,...,s_T ,a_T>, with each action at leading to a new pose s_{t+1} = <v_{t+1},ψ_{t+1},θ_{t+1}>, and generating a new image observation o_{t+1}. The episode ends when the agent selects the special stop action.

### Evaluation Protocol

The paper deﬁned navigation error as the shortest path distance in the navigation graph G between the agent’s ﬁnal position $v_T$ and the goal location v∗. We consider an episode to be a success if the navigation error is less than 3m. The paper also considered stopping to be a fundamental aspect of completing the task, demonstrating understanding, but also freeing the agent to potentially undertake further tasks at the goal.

### Sequence-to-Sequence Model

We model the agent with a recurrent neural network policy using an LSTM-based sequence-to-sequence architecture with an attention mechanism. Recall that the agent begins with a natural language instruction $x_hat = <x_1,x_2,... x_L>$, and an initial image observation $o_0$. The encoder computes a representation of$x_hat$. At each step t, the decoder observes representations of the current image $o_t$ and the previous action a_{t−1} as input, applies an attention mechanism to the hidden states of the language encoder, and predicts a distribution over the next action at $a_t$

For each image observation o_t, we use a ResNet-152 to extract a mean-pooled feature vector. Analogously to the embedding of instruction words, an embedding is learned for each action. The encoded image and previous action features are then concatenated together to form a single vector qt. The decoder LSTM operates as h_t^[\'] = LSTM(q_t, h_{t-1}^{\'})

**Action prediction with attention mechanism** : To predict a distribution over actions at step t, we ﬁrst use an attention mechanism to identify the most relevant parts of the navigation instruction. This is achieved by using the global, general alignment function to compute an instruction context $c_t = f(h^{\'}_t, h_hat). When then compute an attentional hidden state $\tilde{h_t} = tanh(W_c[c_t;h^{\'}_t])$, and calculate the predictive distribution over the next action as $a_t = softmax(\tilde{h_t})$. The paper didn't make use of visual attention.

### Training

with "student-forcing" approach , at each step the next action is sampled from the agent’s output probability distribution.

### Result

38.6% success in seen environment vs. 21.8% unseen environment success with "studentforcing" approach.

网友评论

本文标题：无标题文章

本文链接：https://www.haomeiwen.com/subject/knheaftx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

无标题文章

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读