Reinforcement Learning 重要性采样和Pri

作者: syat_e6da | 来源:发表于2021-01-07 14:21 被阅读0次

今天详细阅读了Prioritized Experience Replay这篇论文，记录下心得体会。

Introduction

online RL目前面临的2个问题：

strongly correlated updates that break the i.i.d. assumption of many popular stochastic gradient-based algorithms.
the rapid forgetting of possibly rare experiences that would be useful later on.
为了解决上述2个问题，有了experience replay 技术。比如在DQN中，用 sliding window replay memory 随机抽取经验回放。
本文的思路在于如何能高效抽取经验回放，使an RL agent can learn more effectively from some transitions than from others.
具体来说：we propose to more frequently replay transitions with high expected learning progress, as measured by the magnitude of their temporal-difference (TD) error. 但是这个方式会使样本多样性减少，并且引入样本分布误差，所以用stochastic prioritization来减轻样本多样性减少问题，同时用importance sampling来校正所引入的样本分布误差。
注：importance sampling 我最开始接触这个概念是在蒙特卡洛的off-policy训练中用到。
重要度采样的定义：是一种给定来自其他分布的样本的情况下，估计某种分布的期望值的通用方法。

Background

Sequences associated with rewards appear to be replayed more frequently;
Experiences with high magnitude TD error
also appear to be replayed more often;
A recent paper introduced a form of re-sampling in the context of deep RL with experience replay (Narasimhan et al., 2015); the method separates experience into two buckets, one for positive and one for negative rewards, and then picks a fixed fraction from each to replay. This is only applicable in domains that (unlike ours) have a natural notion of ‘positive/negative’ experience. Furthermore, Hinton (2007) introduced a form of non-uniform sampling based on error, with an importance sampling correction, which led to a 3x speed-up on MNIST digit classification.

Prioritized Replay

使用replay memory我们可以在2方面进行设计：

which experiences to store
which experiences to replay (and how to do so）
这篇论文主要是在第2点进行创新。

a motivating example

对比了在一个游戏中uniformly 选取样本和选择可最大化减小全局loss的样本的效果差异。表达这个方向是对的。

prioritizing with TD-error

The central component of prioritized replay is the criterion by which the importance of each transition is measured. One idealised criterion would be the amount the RL agent can learn from a transition in its current state (expected learning progress). While this measure is not directly accessible, a reasonable proxy is the magnitude of a transition’s TD error $\delta$ , which indicates how ‘surprising’ or unexpected the transition is: specifically, how far the value is from its next-step bootstrap estimate.
具体实现：use a binary heap data structure for the pri- ority queue, for which finding the maximum priority transition when sampling is O(1) and updating priorities (with the new TD-error after a learning step) is O(logN).

Stochastic Prioritization

greedy TD-error prioritization有以下几个缺点：

to avoid expensive sweeps over the entire replay memory, TD errors are only updated for the transitions that are replayed. One consequence is that transitions that have a low TD error on first visit may not be replayed for a long time (which means effectively never with a sliding window replay memory)
it is sensitive to noise spikes (e.g. when rewards are stochastic), which can be exacerbated by bootstrapping, where approximation errors appear as another source of noise;
greedy prioritization focuses on a small subset of the experience: errors shrink slowly, especially when using function approximation, meaning that the initially high error transitions get replayed frequently. This lack of diversity that makes the system prone to over-fitting.
所以用了stochastic prioritization，介于greedy和uniform之间，只是样本被抽取的概率与td-error有关。每个样本被抽取的概率定义：

the probability of sampling transition $i$ 有2种定义方式：

the direct, proportional prioritization: $p_i = |\delta_i|+\epsilon$
an indirect, rank-based prioritization: $p_i = \frac{1}{rank(i)}$
这两种probability的计算的高效实现：1. proportional prioritization可以用sum-tree data structure；2. rank-based prioritization可以将samples按照所得到的probability分成k份.（we can approximate the cumulative density function with a piecewise linear function with k segments of equal probability. The segment boundaries can be precomputed (they change only when N or α change). At runtime, we sample a segment, and then sample uniformly among the transitions within it. This works particularly well in conjunction with a minibatch- based learning algorithm: choose k to be the size of the minibatch, and sample exactly one transition from each segment – this is a form of stratified sampling that has the added advantage of balancing out the minibatch (there will always be exactly one transition with high magnitude δ, one with medium magnitude, etc). ）
算法流程

Annealing the bias

因为加了prioritized 之后认为改变了原本的样本数据分布，所以需要加上importance-sampling weight去校正所引入的误差。

In typical reinforcement learning scenarios, the unbiased nature of the updates is most important near convergence at the end of training, as the process is highly non-stationary anyway, due to changing policies, state distributions and bootstrap targets; we hypothesize that a small bias can be ignored in this context . We therefore exploit the flexibility of annealing the amount of importance-sampling correction over time, by defining a schedule on the exponent β that reaches 1 only at the end of learning. In practice, we linearly anneal β from its initial value β0 to 1. Note that the choice of this hyperparameter interacts with choice of prioritization exponent α; increasing both simultaneously prioritizes sampling more aggressively at the same time as correcting for it more strongly.
importance sampling除了化解prioritized replay带来的bias，还有一个好处是避免了梯度爆炸：（Importance sampling has another benefit when combined with prioritized replay in the context of non-linear function approximation (e.g. deep neural networks): here large steps can be very disrup- tive, because the first-order approximation of the gradient is only reliable locally, and have to be pre- vented with a smaller global step-size. In our approach instead, prioritization makes sure high-error transitions are seen many times, while the IS correction reduces the gradient magnitudes (and thus the effective step size in parameter space), and allowing the algorithm follow the curvature of highly non-linear optimization landscapes because the Taylor expansion is constantly re-approximated.）

Experiement

Discussion

the rank-based variant to be more robust because

it is not affected by outliers nor error magnitudes.
its heavy-tail property also guarantees that samples will be diverse.
the stratified sampling from partitions of different errors will keep the total minibatch gradient at a stable magnitude throughout training.
the ranks make the algorithm blind to the relative error scales, which could incur a performance drop when there is structure in the distribution of errors to be exploited, such as in sparse reward scenarios.

We hypothesize that deep neural networks interact with prioritized replay in another interesting way. When we distinguish learning the value given a representation (i.e., the top layers) from learning an improved representation (i.e., the bottom layers), then transitions for which the representation is good will quickly reduce their error and then be replayed much less, increasing the learning focus on others where the representation is poor, thus putting more resources into distinguishing aliased states – if the observations and network capacity allow for it.

网友评论

本文标题：Reinforcement Learning 重要性采样和Pri

本文链接：https://www.haomeiwen.com/subject/fciooktx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！