Policy Approximation and its Advantages

the approximate policy can approach a deterministic policy, whereas withε-greedy action selection over action values there is always an ε probability of selecting a random action
In problems with significant function approximation, the best approximate policy may be stochastic

The Policy Gradient Theorem

there is also an important theoretical advantage:
With continuous policy parameterization the action probabilities change smoothly as a function of the learned parameter, whereas inε-greedy selection the action probabilities may change dramatically for an arbitrarily small change in the estimated action values

目标函数即价值

Policy Gradient Theorem 证明

REINFORCE: Monte Carlo Policy Gradient

REINFORCE with Baseline

The baseline can be any function, even a random variable, as long as it does not vary with a; the equation remains valid because the subtracted quantity is zero

One natural choice for the baseline is an estimate of the state value, ˆv(St,w),

Actor–Critic Methods

Although the REINFORCE-with-baseline method learns both a policy and a state-value function, we do not consider it to be an actor–critic method because its state-value function is used only as a baseline, not as a critic.

REINFORCE with baseline is unbiased and will converge asymptotically to a local minimum, but like all Monte Carlo methods it tends to learn slowly (produce estimates of high variance) and to be inconvenient to implement online or for continuing problems.

First consider one-step actor–critic methods, the analog of the TD methods introduced in Chapter 6such as TD(0), Sarsa(0), and Q-learning.