Background
优化目标
随机策略梯度理论
这个公式使得随机策略梯度变为简单的计算一个期望
Off-Policy Actor-Critic
Gradients of Deterministic Policies
Action-Value Gradients
对于连续的情况,使策略参数的移动方向正比于
所以
However, the theory below shows that, like the stochastic policy gradient theorem, there is no need to compute the gradient of the state distribution; and that the intuitive update outlined above is following precisely the gradient of the performance objective
随机性策略取极限
Deterministic Policy Gradient Theorem
Deterministic Actor-Critic Algorithms
On-Policy Deterministic Actor-Critic
Off-Policy Deterministic Actor-Critic
目标函数变为target policy
We note that stochastic off-policy actor-critic algorithms typically use importance sampling for both actor and critic(Degris et al., 2012b). However, because the deterministic policy gradient removes the integral over actions, we can avoid importance sampling in the actor; and by using Q-learning, we can avoid importance sampling in the critic
Compatible Function Approximation
网友评论