美文网首页
Scalable trust-region method for

Scalable trust-region method for

作者: 初七123 | 来源:发表于2019-01-09 13:12 被阅读3次

作者解读: https://v.qq.com/x/page/k05026fl3wa.html

Introduction

Sample efficiency is a dominant concern in RL; robotic interaction with the real world is typically scarcer than computation time, and even in simulated environments the cost of simulation often dominates that of the algorithm itself.

One way to effectively reduce the sample size is to use more advanced optimization techniques for gradient updates.

Natural gradient is intractable because it requires inverting the Fisher information matrix

Trust-region policy optimization (TRPO)[22] avoids explicitly storing and inverting the Fisher matrix by using Fisher-vector products [21]. However, it typically requires many steps of conjugate gradient to obtain a single parameter update, and accurately estimating the curvature requires a large number of samples in each batch;

Kronecker-factored approximated curvature (K-FAC)[16,7] is a scalable approximation to naturalgradient.

Background

Reinforcement learning and actor-critic methods

Natural gradient using Kronecker-factored approximation

The method of natural gradient constructs the norm using the Fisher information matrix F, a local quadratic approximation to the KL divergence. This norm is independent of the model parameterizationθon the class of probability distributions, providing amore stable and effective update

What is Natural gradient?(https://www.zhihu.com/question/266846405/answer/314354512)

The pre-activation vector for the next layer s = W*a. Note that the weight gradient is given by ∇WL= (∇sL)a⊺
K-FAC utilizes this fact and further approximates the blockFℓ corresponding to layerℓ asˆFℓ

This approximation can be interpreted as making the assumption that the second-order statistics of the activations and the backpropagated derivatives are uncorrelated

Methods

Natural gradient in actor-critic

For actor

To define the Fisher metric for reinforcement learning objectives, one natural choice is to use the policy function which defines a distribution over the action given the current state, and take the expectation over the trajectory distribution:

For critic

we assume the output of the critic vis defined to be a Gaussian distribution p(v|st)∼N(v;V(st), σ2)

Step-size Selection and trust-region optimization

Experiments

相关文章

网友评论

      本文标题:Scalable trust-region method for

      本文链接:https://www.haomeiwen.com/subject/ylobrqtx.html