美文网首页
Deterministic Policy Gradient Al

Deterministic Policy Gradient Al

作者: 初七123 | 来源:发表于2019-01-08 13:28 被阅读10次

Background

优化目标


随机策略梯度理论


这个公式使得随机策略梯度变为简单的计算一个期望

Off-Policy Actor-Critic

Gradients of Deterministic Policies

Action-Value Gradients

对于连续的情况,使策略参数的移动方向正比于

所以

However, the theory below shows that, like the stochastic policy gradient theorem, there is no need to compute the gradient of the state distribution; and that the intuitive update outlined above is following precisely the gradient of the performance objective

随机性策略取极限

Deterministic Policy Gradient Theorem

Deterministic Actor-Critic Algorithms

On-Policy Deterministic Actor-Critic

Off-Policy Deterministic Actor-Critic

目标函数变为target policy

We note that stochastic off-policy actor-critic algorithms typically use importance sampling for both actor and critic(Degris et al., 2012b). However, because the deterministic policy gradient removes the integral over actions, we can avoid importance sampling in the actor; and by using Q-learning, we can avoid importance sampling in the critic

Compatible Function Approximation

相关文章

网友评论

      本文标题:Deterministic Policy Gradient Al

      本文链接:https://www.haomeiwen.com/subject/ojomrqtx.html