美文网首页
强化学习导论——Policy Gradient Methods

强化学习导论——Policy Gradient Methods

作者: 初七123 | 来源:发表于2019-01-07 15:15 被阅读11次

    在这一章中,我们讨论策略梯度

    Policy Approximation and its Advantages

    1. the approximate policy can approach a deterministic policy, whereas withε-greedy action selection over action values there is always an ε probability of selecting a random action
    2. In problems with significant function approximation, the best approximate policy may be stochastic

    The Policy Gradient Theorem

    there is also an important theoretical advantage:
    With continuous policy parameterization the action probabilities change smoothly as a function of the learned parameter, whereas inε-greedy selection the action probabilities may change dramatically for an arbitrarily small change in the estimated action values

    目标函数即价值

    Policy Gradient Theorem 证明

    REINFORCE: Monte Carlo Policy Gradient

    REINFORCE with Baseline

    The baseline can be any function, even a random variable, as long as it does not vary with a; the equation remains valid because the subtracted quantity is zero

    One natural choice for the baseline is an estimate of the state value, ˆv(St,w),

    Actor–Critic Methods

    Although the REINFORCE-with-baseline method learns both a policy and a state-value function, we do not consider it to be an actor–critic method because its state-value function is used only as a baseline, not as a critic.

    REINFORCE with baseline is unbiased and will converge asymptotically to a local minimum, but like all Monte Carlo methods it tends to learn slowly (produce estimates of high variance) and to be inconvenient to implement online or for continuing problems.

    First consider one-step actor–critic methods, the analog of the TD methods introduced in Chapter 6such as TD(0), Sarsa(0), and Q-learning.

    加入资格迹

    Policy Gradient for Continuing Problems

    μ is the steady-state distribution underπ

    Policy Gradient Theorem 连续版本证明

    Policy Parameterization for Continuous Actions

    相关文章

      网友评论

          本文标题:强化学习导论——Policy Gradient Methods

          本文链接:https://www.haomeiwen.com/subject/bgclnftx.html