作者解读: https://v.qq.com/x/page/k05026fl3wa.html
Introduction
Sample efficiency is a dominant concern in RL; robotic interaction with the real world is typically scarcer than computation time, and even in simulated environments the cost of simulation often dominates that of the algorithm itself.
One way to effectively reduce the sample size is to use more advanced optimization techniques for gradient updates.
Natural gradient is intractable because it requires inverting the Fisher information matrix
Trust-region policy optimization (TRPO)[22] avoids explicitly storing and inverting the Fisher matrix by using Fisher-vector products [21]. However, it typically requires many steps of conjugate gradient to obtain a single parameter update, and accurately estimating the curvature requires a large number of samples in each batch;
Kronecker-factored approximated curvature (K-FAC)[16,7] is a scalable approximation to naturalgradient.
Background
Reinforcement learning and actor-critic methods
Natural gradient using Kronecker-factored approximation
The method of natural gradient constructs the norm using the Fisher information matrix F, a local quadratic approximation to the KL divergence. This norm is independent of the model parameterizationθon the class of probability distributions, providing amore stable and effective update
What is Natural gradient?(https://www.zhihu.com/question/266846405/answer/314354512)
The pre-activation vector for the next layer s = W*a. Note that the weight gradient is given by ∇WL= (∇sL)a⊺
K-FAC utilizes this fact and further approximates the blockFℓ corresponding to layerℓ asˆFℓ
This approximation can be interpreted as making the assumption that the second-order statistics of the activations and the backpropagated derivatives are uncorrelated
Methods
Natural gradient in actor-critic
For actor
To define the Fisher metric for reinforcement learning objectives, one natural choice is to use the policy function which defines a distribution over the action given the current state, and take the expectation over the trajectory distribution:
For critic
we assume the output of the critic vis defined to be a Gaussian distribution p(v|st)∼N(v;V(st), σ2)
Step-size Selection and trust-region optimization
网友评论