美文网首页
Scalable trust-region method for

Scalable trust-region method for

作者: 初七123 | 来源:发表于2019-01-09 13:12 被阅读3次

    作者解读: https://v.qq.com/x/page/k05026fl3wa.html

    Introduction

    Sample efficiency is a dominant concern in RL; robotic interaction with the real world is typically scarcer than computation time, and even in simulated environments the cost of simulation often dominates that of the algorithm itself.

    One way to effectively reduce the sample size is to use more advanced optimization techniques for gradient updates.

    Natural gradient is intractable because it requires inverting the Fisher information matrix

    Trust-region policy optimization (TRPO)[22] avoids explicitly storing and inverting the Fisher matrix by using Fisher-vector products [21]. However, it typically requires many steps of conjugate gradient to obtain a single parameter update, and accurately estimating the curvature requires a large number of samples in each batch;

    Kronecker-factored approximated curvature (K-FAC)[16,7] is a scalable approximation to naturalgradient.

    Background

    Reinforcement learning and actor-critic methods

    Natural gradient using Kronecker-factored approximation

    The method of natural gradient constructs the norm using the Fisher information matrix F, a local quadratic approximation to the KL divergence. This norm is independent of the model parameterizationθon the class of probability distributions, providing amore stable and effective update

    What is Natural gradient?(https://www.zhihu.com/question/266846405/answer/314354512)

    The pre-activation vector for the next layer s = W*a. Note that the weight gradient is given by ∇WL= (∇sL)a⊺
    K-FAC utilizes this fact and further approximates the blockFℓ corresponding to layerℓ asˆFℓ

    This approximation can be interpreted as making the assumption that the second-order statistics of the activations and the backpropagated derivatives are uncorrelated

    Methods

    Natural gradient in actor-critic

    For actor

    To define the Fisher metric for reinforcement learning objectives, one natural choice is to use the policy function which defines a distribution over the action given the current state, and take the expectation over the trajectory distribution:

    For critic

    we assume the output of the critic vis defined to be a Gaussian distribution p(v|st)∼N(v;V(st), σ2)

    Step-size Selection and trust-region optimization

    Experiments

    相关文章

      网友评论

          本文标题:Scalable trust-region method for

          本文链接:https://www.haomeiwen.com/subject/ylobrqtx.html