Lecture 4: Model-Free Prediction

作者: 魏鹏飞 | 来源:发表于2020-04-13 06:42 被阅读0次

Lecture 4: Model-Free Prediction
Lecture 4: Model-Free Prediction
2019-10-19 Lecture 4: Model-Free
#David Silver Reinforcement Lear
Lecture 5: Model-Free Control
Lecture 5: Model-Free Control
Lecture 4
Lecture 4
2018-10-13
Java notes

Author：David Silver

Outline

Introduction
Monte-Carlo Learning
Temporal-Difference Learning
TD(λ)

Model-Free Reinforcement Learning

Last lecture:
- Planning by dynamic programming
- Solve a known MDP
This lecture:
- Model-free prediction
- Estimate the value function of an unknown MDP
Next lecture:
- Model-free control
- Optimise the value function of an unknown MDP

Monte-Carlo Reinforcement Learning

MC methods learn directly from episodes of experience
MC is model-free: no knowledge of MDP transitions / rewards
MC learns from complete episodes: no bootstrapping
MC uses the simplest possible idea: value = mean return
Caveat: can only apply MC to episodic MDPs
- All episodes must terminate

Monte-Carlo Policy Evaluation

Goal: learn $v_{\pi}$ from episodes of experience under policy $\pi$
$S_1,A_1,R_2,...,S_k\sim\pi$
Recall that the return is the total discounted reward:
$G_t=R_{t+1}+\gamma R_{t+2}+...+\gamma^{T-1}R_T$
Recall that the value function is the expected return:
$v_{\pi}(s)=E_{\pi}[G_t|S_t=s]$
Monte-Carlo** policy evaluation uses empirical mean return instead of expected return**

First-Visit Monte-Carlo Policy Evaluation

To evaluate state s
The first time-step t that state s is visited in an episode,
Increment counter $N(s) ← N(s) + 1$
Increment total return $S(s) ← S(s) + G_t$
Value is estimated by mean return $V(s) = S(s)/N(s)$
By law of large numbers, $V(s) → v_π(s)$ as $N(s) → \infty$

Every-Visit Monte-Carlo Policy Evaluation

To evaluate state s
Every time-step t that state s is visited in an episode,
Increment counter $N(s) ← N(s) + 1$
Increment total return $S(s) ← S(s) + G_t$
Value is estimated by mean return $V(s) = S(s)/N(s)$
Again, $V(s) → v_π(s)$ as $N(s) → \infty$

Blackjack Example

Blackjack Value Function after Monte-Carlo Learning

Incremental Mean

Incremental Monte-Carlo Updates

Update $V(s)$ incrementally after episode $S_1, A_1, R_2, ..., S_T$
For each state $S_t$ with return $G_t$
$N(S_t)\gets N(S_t)+1$
$V(S_t)\gets V(S_t)+\frac{1}{N(S_t)}(G_t-V(S_t))$
In non-stationary problems, it can be useful to track a running
mean, i.e. forget old episodes.
$V(S_t)\gets V(S_t)+\alpha(G_t-V(S_t))$

Temporal-Difference Learning

TD methods learn directly from episodes of experience
TD is model-free: no knowledge of MDP transitions / rewards
TD learns from incomplete episodes, by bootstrapping
TD updates a guess towards a guess

MC and TD

Driving Home Example

Driving Home Example: MC vs. TD

Advantages and Disadvantages of MC vs. TD

TD can learn before knowing the final outcome
- TD can learn online after every step
- MC must wait until end of episode before return is known
TD can learn without the final outcome
- TD can learn from incomplete sequences
- MC can only learn from complete sequences
- TD works in continuing (non-terminating) environments
- MC only works for episodic (terminating) environments

Bias/Variance Trade-Off

Return $G_t = R_{t+1} + γR_{t+2} + ... + γ^{T−1}R_T$ is unbiased estimate of $v_π(S_t)$
True TD target $R_{t+1} + γv_π(S_{t+1})$ is unbiased estimate of $v_π(S_t)$
TD target $R_{t+1} +γV(S_{t+1})$ is biased estimate of $v_π(S_t)$
TD target is much lower variance than the return:
- Return depends on many random actions, transitions, rewards
- TD target depends on one random action, transition, reward

Advantages and Disadvantages of MC vs. TD (2)

MC has high variance, zero bias
- Good convergence properties
- (even with function approximation)
- Not very sensitive to initial value
- Very simple to understand and use
TD has low variance, some bias
- Usually more efficient than MC
- TD(0) converges to vπ(s)
- (but not always with function approximation)
- More sensitive to initial value

Random Walk Example

Random Walk: MC vs. TD

Batch MC and TD

AB Example

Certainty Equivalence

Advantages and Disadvantages of MC vs. TD (3)

TD exploits Markov property
- Usually more efficient in Markov environments
MC does not exploit Markov property
- Usually more effective in non-Markov environments

Monte-Carlo Backup

Temporal-Difference Backup

Dynamic Programming Backup

Bootstrapping and Sampling

Unified View of Reinforcement Learning

n-Step Prediction

n-Step Return

Large Random Walk Example

Averaging n-Step Returns

λ-return

TD(λ) Weighting Function

Forward-view TD(λ)

Forward-View TD(λ) on Large Random Walk

Backward View TD(λ)

Forward view provides theory
Backward view provides mechanism
Update online, every step, from incomplete sequences

Eligibility Traces

Backward View TD(λ)

TD(λ) and TD(0)

TD(λ) and MC

MC and TD(1)

Telescoping in TD(1)

TD(λ) and TD(1)

TD(1) is roughly equivalent to every-visit Monte-Carlo
Error is accumulated online, step-by-step
If value function is only updated offline at end of episode
Then total update is exactly the same as MC

Telescoping in TD(λ)

Forwards and Backwards TD(λ)

Offline Equivalence of Forward and Backward TD

Offline updates

Updates are accumulated within episode
but applied in batch at the end of episode

Onine Equivalence of Forward and Backward TD

Online updates

TD(λ) updates are applied online at each step within episode
Forward and backward-view TD(λ) are slightly different
NEW: Exact online TD(λ) achieves perfect equivalence
By using a slightly different form of eligibility trace
Sutton and von Seijen, ICML 2014

Summary of Forward and Backward TD(λ)

Reference：《UCL Course on RL》

Lecture 4: Model-Free Prediction
一、Monte-Carlo Learning （一）Monte-Carlo Reinforcement Learn...
Lecture 4: Model-Free Prediction
Author：David Silver Outline Introduction Monte-Carlo Lear...
2019-10-19 Lecture 4: Model-Free
Model-Free Prediction Interduction 区别上节课讲已知MDP，使用动态规划方法...
#David Silver Reinforcement Lear
本节主要介绍 Model-free prediction。从一个未知的MDP中估计出值函数。 1. Monte-C...
Lecture 5: Model-Free Control
一、Introduction （一）Model-Free Reinforcement Learning Last ...
Lecture 5: Model-Free Control
Author：David SilverHe was awarded the 2019 ACM Prize in C...
Lecture 4
A. Creating function: 1. Syntax def ( ) : ...
Lecture 4
2.2 定点加法、减法运算 2.2.1 补码加法 2.2.2 补码减法 2.2.3 溢出概念与检测方法 2.3 定...
2018-10-13
Lecture 4, Systematic errors and sampling Additional stra...
Java notes
lecture 4 Animal foo = new Dog(); // ok override heiachy ...

Lecture 4: Model-Free Prediction

Outline

Model-Free Reinforcement Learning

Monte-Carlo Reinforcement Learning

Monte-Carlo Policy Evaluation

First-Visit Monte-Carlo Policy Evaluation

Every-Visit Monte-Carlo Policy Evaluation

Blackjack Example

Blackjack Value Function after Monte-Carlo Learning

Incremental Mean

Incremental Monte-Carlo Updates

Temporal-Difference Learning

MC and TD

Driving Home Example

Driving Home Example: MC vs. TD

Advantages and Disadvantages of MC vs. TD

Bias/Variance Trade-Off

Advantages and Disadvantages of MC vs. TD (2)

Random Walk Example

Random Walk: MC vs. TD

Batch MC and TD

AB Example

AB Example

Certainty Equivalence

Advantages and Disadvantages of MC vs. TD (3)

Monte-Carlo Backup

Temporal-Difference Backup

Dynamic Programming Backup

Bootstrapping and Sampling

Unified View of Reinforcement Learning

n-Step Prediction

n-Step Return

Large Random Walk Example

Averaging n-Step Returns

λ-return

TD(λ) Weighting Function

Forward-view TD(λ)

Forward-View TD(λ) on Large Random Walk

Backward View TD(λ)

Eligibility Traces

Backward View TD(λ)

TD(λ) and TD(0)

TD(λ) and MC

MC and TD(1)

Telescoping in TD(1)

TD(λ) and TD(1)

Telescoping in TD(λ)

Forwards and Backwards TD(λ)

Offline Equivalence of Forward and Backward TD

Onine Equivalence of Forward and Backward TD

Summary of Forward and Backward TD(λ)

相关文章

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读