美文网首页
Deterministic Policy Gradient Al

Deterministic Policy Gradient Al

作者: 初七123 | 来源:发表于2019-01-08 13:28 被阅读10次

    Background

    优化目标


    随机策略梯度理论


    这个公式使得随机策略梯度变为简单的计算一个期望

    Off-Policy Actor-Critic

    Gradients of Deterministic Policies

    Action-Value Gradients

    对于连续的情况,使策略参数的移动方向正比于

    所以

    However, the theory below shows that, like the stochastic policy gradient theorem, there is no need to compute the gradient of the state distribution; and that the intuitive update outlined above is following precisely the gradient of the performance objective

    随机性策略取极限

    Deterministic Policy Gradient Theorem

    Deterministic Actor-Critic Algorithms

    On-Policy Deterministic Actor-Critic

    Off-Policy Deterministic Actor-Critic

    目标函数变为target policy

    We note that stochastic off-policy actor-critic algorithms typically use importance sampling for both actor and critic(Degris et al., 2012b). However, because the deterministic policy gradient removes the integral over actions, we can avoid importance sampling in the actor; and by using Q-learning, we can avoid importance sampling in the critic

    Compatible Function Approximation

    相关文章

      网友评论

          本文标题:Deterministic Policy Gradient Al

          本文链接:https://www.haomeiwen.com/subject/ojomrqtx.html