Introduction

However, there are at least two aspects of human performance that they starkly lack. First, deep RL typically requires a massive volume of training data, whereas human learners can attain reasonable performance on any of a wide range of tasks with comparatively little experience.
深度强化学习的缺点之一是需要大量的样本

Second, deep RL systems typically specialize on one restricted task domain, whereas human learners can ﬂexibly adapt to changing task conditions.
其次深度强化学习系统经常被严格限制于一个任务领域

Methods

META-LEARNING IN RECURRENT NEURAL NETWORKS

Meta-learning is then defined as an effect whereby the agent improves its performance in each new task more rapidly, on average, than in past tasks (Thrun and Pratt, 1998).
Meta Learning被定义为一种效应，即平均来说，相对于以往的任务agent在新任务中能够更快的学习

A critical aspect of their setup is that the network receives, on each step within a task, an auxiliary input indicating the target output for the preceding step.
For example, in a regression task, on each step the network receives as input an x value for which it is desired to output the corresponding y, but the network also receives an input disclosing the target y value for the preceding step (see Hochreiter et al., 2001; Santoro et al., 2016).
Hochreiter...在2001年提出的基本的元学习方法，每一步都辅助输入前一步的结果；
比如回归任务中把上一个样本的y作为下一次预测的输入

Indeed, after an initial training period, the network can improve its performance on new tasks even if the weights are held constant (see also Cotter and Conwell, 1990; Prokhorov et al., 2002; Younger et al., 1999). A second important aspect of the
approach is that the learning procedure implemented in the recurrent network is fit to the structure that spans the family of tasks on which the network is trained, embedding biases that allow it to learn efficiently when dealing with tasks from that family.
循环网络实现的学习程序能适应跨越一个家族中多个任务的结构，embedding biases使其能够更快速的在新任务中学习

DEEP META-RL: DEFINITION AND KEY FEATURES

前面的工作只解决了监督学习领域的问题

Critically, as in the supervised case, the learned RL procedure will be fit to the statistics spanning the multi-task environment, allowing it to adapt rapidly to new task instances.
Meta强化学习能够适应多任务环境，并且可在新任务中快速适配

FORMALISM

An appropriately structured agent, embedding a recurrent neural network, is trained by interacting with a sequence of MDP environments (also called tasks) through episodes.
嵌入一个循环神经网络和MDP序列交互

Since the policy learned by the agent is history-dependent (as it
makes uses of a recurrent network), when exposed to any new MDP environment, it is able to adapt and deploy a strategy that optimizes rewards for that task.
因为策略的学习是依赖于历史状态的（利用循环神经网络），从而可以适应新的MDP任务

Experinment

x状态
r回报
a动作
t time step

BANDIT PROBLEMS

多臂赌博机问题
假设有个老虎机并排放在我们面前,我们首先给它们编号,每一轮,我们可以选择一个老虎机来按,同时记录老虎机给出的奖励. 假设各个老虎机不是完全相同的,经过多轮操作后,我们可以勘探出各个老虎机的部分统计信息,然后选择那个看起来奖励最高的老虎机. 在多臂赌博机中,我们把老虎机称为臂.

这里有两个问题:

奖励以什么方式产生?

随机式(stochastic bandit): 臂的奖励服从某种固定的概率分布
对抗式(adversarial bandit): 赌场老板使坏,会动态调整臂的奖励,比如让你选的臂的奖励很低,但是其它未选的臂奖励变高.注意这里赌场老板不能也不会使全部臂的奖励变为0,因为这样会使我们无法得到奖励,这时我们体验到的是任何策略都是无差别的.
马尔可夫式(Markovian bandit): 臂奖励由马尔可夫链定义.

如何测量策略的好坏?
简单的以总奖励作为测量策略好坏的标准是不切实际的. 所以我们定义遗憾(regret)作为策略好坏的指标,指的是我们可以达到的最理想总奖励与实际得到的总奖励.

BANDIT WITH INDEPENDENT ARMS
At the beginning of each episode, a new bandit task is sampled and held constant for 100 trials.
Training lasted for 20,000 episodes. The network is given as input the last reward, last action taken, and the trial number t, subsequently producing the action for the next trial t + 1 (Figure 1).
After training, we evaluated on 300 new episodes with the learning rate set to zero (the learned policy is fixed).

BANDITS WITH DEPENDENT ARMS(I)
We consider Bernoulli distributions where the parameters (p1, p2) of the two arms are correlated in the sense that p1 = 1 − p2.

BANDITS WITH DEPENDENT ARMS (II)
In this experiment, the agent was trained on 11-armed bandits with strong dependencies between arms.

Arm a11 was always “informative”, in that the target arm was
indexed by 10 times a11’s reward (e.g. a reward of 0.2 on a11 indicated that a2 was the target arm whose reward==5).

RESTLESS BANDITS
In previous experiments we considered stationary problems where the agent’s actions yielded information about task parameters that remained fixed throughout each episode.
Next, we consider abandit problem in which reward probabilities change over the course of an episode, with different rates of change (volatilities) in different episodes

MARKOV DECISION PROBLEMS

The foregoing experiments focused on bandit tasks in which actions do not affect the task’s underlying state. We turn now to MDPs where actions do influence state.

THE TWO-STEP TASK

LEARNING ABSTRACT TASK STRUCTURE
In the final experiment we conducted, we took a step towards examining the scalabilty of meta-RL, by studying a task that involves rich visual inputs, longer time horizons and sparse rewards.

ONE-SHOT NAVIGATION

RELEATED WORK

During completion of the present research, closely related work was reported by Duan et al. (2016). Likeus,DuanandcolleaguesusedeepRLtotrainarecurrentnetworkonaseriesofinterrelatedtasks, with the result that the network dynamics learn a second RL procedure which operates on a faster time-scale than the original algorithm.

CONCLUSION

A current challenge in artificial intelligence is to design agents that can adapt rapidly to new tasks by leveraging knowledge acquired through previous experience with related activities. In the present work we have reported initial explorations of what we believe is one promising avenue toward this goal. Deep meta-RL involves a combination of three ingredients:
(1) Use of a deep RL algorithm to train a recurrent neural network
(2) a training set that includes a series of interrelated tasks,
(3)network input that includes the action selected and reward received in the previous time interval.