Author:David Silver
Outline
- Introduction
- Monte-Carlo Learning
- Temporal-Difference Learning
- TD(λ)
Model-Free Reinforcement Learning
- Last lecture:
Planning by dynamic programming
- Solve a known MDP
- This lecture:
Model-free prediction
- Estimate the value function of an unknown MDP
- Next lecture:
Model-free control
- Optimise the value function of an unknown MDP
Monte-Carlo Reinforcement Learning
- MC methods learn directly from episodes of experience
- MC is model-free: no knowledge of MDP transitions / rewards
- MC learns from complete episodes: no bootstrapping
- MC uses the simplest possible idea: value = mean return
- Caveat: can only apply MC to episodic MDPs
- All episodes must terminate
Monte-Carlo Policy Evaluation
-
Goal: learn from episodes of experience under policy
- Recall that the return is the total discounted reward:
- Recall that the value function is the expected return:
- Monte-Carlo** policy evaluation uses
empirical mean return
instead of expected return**
First-Visit Monte-Carlo Policy Evaluation
- To evaluate state s
- The
first
time-step t that state s is visited in an episode, - Increment counter
- Increment total return
- Value is estimated by mean return
- By law of large numbers, as
Every-Visit Monte-Carlo Policy Evaluation
- To evaluate state s
-
Every
time-step t that state s is visited in an episode, - Increment counter
- Increment total return
- Value is estimated by mean return
- Again, as
Blackjack Example
Blackjack Value Function after Monte-Carlo Learning
Incremental Mean
Incremental Monte-Carlo Updates
- Update incrementally after episode
- For each state with return
-
In non-stationary problems
, it can be useful to track a running
mean, i.e. forget old episodes.
Temporal-Difference Learning
- TD methods learn directly from episodes of experience
- TD is model-free: no knowledge of MDP transitions / rewards
- TD learns from
incomplete
episodes, bybootstrapping
- TD updates a guess towards a guess
MC and TD
Driving Home Example
Driving Home Example: MC vs. TD
Advantages and Disadvantages of MC vs. TD
-
TD
can learn before knowing the final outcome-
TD
can learn online after every step -
MC
must wait until end of episode before return is known
-
-
TD
can learn without the final outcome-
TD
can learn from incomplete sequences -
MC
can only learn from complete sequences -
TD
works in continuing (non-terminating) environments -
MC
only works for episodic (terminating) environments
-
Bias/Variance Trade-Off
- Return is
unbiased
estimate of - True TD target is
unbiased
estimate of - TD target is
biased
estimate of - TD target is much
lower variance
than the return:- Return depends on
many
random actions, transitions, rewards - TD target depends on
one
random action, transition, reward
- Return depends on
Advantages and Disadvantages of MC vs. TD (2)
- MC has high variance, zero bias
- Good convergence properties
- (even with function approximation)
- Not very sensitive to initial value
- Very simple to understand and use
- TD has low variance, some bias
- Usually more efficient than MC
- TD(0) converges to vπ(s)
- (but not always with function approximation)
- More sensitive to initial value
Random Walk Example
Random Walk: MC vs. TD
Batch MC and TD
AB Example
AB Example
Certainty Equivalence
Advantages and Disadvantages of MC vs. TD (3)
- TD exploits Markov property
- Usually more efficient in Markov environments
- MC does not exploit Markov property
- Usually more effective in non-Markov environments
Monte-Carlo Backup
Temporal-Difference Backup
Dynamic Programming Backup
Bootstrapping and Sampling
Unified View of Reinforcement Learning
n-Step Prediction
n-Step Return
Large Random Walk Example
Averaging n-Step Returns
λ-return
TD(λ) Weighting Function
Forward-view TD(λ)
Forward-View TD(λ) on Large Random Walk
Backward View TD(λ)
- Forward view provides theory
- Backward view provides mechanism
- Update online, every step, from incomplete sequences
Eligibility Traces
Backward View TD(λ)
TD(λ) and TD(0)
TD(λ) and MC
MC and TD(1)
Telescoping in TD(1)
TD(λ) and TD(1)
- TD(1) is roughly equivalent to every-visit Monte-Carlo
- Error is accumulated online, step-by-step
- If value function is only updated offline at end of episode
- Then total update is exactly the same as MC
Telescoping in TD(λ)
Forwards and Backwards TD(λ)
Offline Equivalence of Forward and Backward TD
Offline
updates
- Updates are accumulated within episode
- but applied in batch at the end of episode
Onine Equivalence of Forward and Backward TD
Online
updates
- TD(λ) updates are applied online at each step within episode
- Forward and backward-view TD(λ) are slightly different
-
NEW
: Exact online TD(λ) achieves perfect equivalence - By using a slightly different form of eligibility trace
- Sutton and von Seijen, ICML 2014
Summary of Forward and Backward TD(λ)
Reference:《UCL Course on RL》
网友评论