美文网首页
Lecture 4: Model-Free Prediction

Lecture 4: Model-Free Prediction

作者: 魏鹏飞 | 来源:发表于2020-04-13 06:42 被阅读0次

Author:David Silver

Outline

  1. Introduction
  2. Monte-Carlo Learning
  3. Temporal-Difference Learning
  4. TD(λ)

Model-Free Reinforcement Learning

  • Last lecture:
    • Planning by dynamic programming
    • Solve a known MDP
  • This lecture:
    • Model-free prediction
    • Estimate the value function of an unknown MDP
  • Next lecture:
    • Model-free control
    • Optimise the value function of an unknown MDP

Monte-Carlo Reinforcement Learning

  • MC methods learn directly from episodes of experience
  • MC is model-free: no knowledge of MDP transitions / rewards
  • MC learns from complete episodes: no bootstrapping
  • MC uses the simplest possible idea: value = mean return
  • Caveat: can only apply MC to episodic MDPs
    • All episodes must terminate

Monte-Carlo Policy Evaluation

  • Goal: learn v_{\pi} from episodes of experience under policy \pi
    S_1,A_1,R_2,...,S_k\sim\pi
  • Recall that the return is the total discounted reward:
    G_t=R_{t+1}+\gamma R_{t+2}+...+\gamma^{T-1}R_T
  • Recall that the value function is the expected return:
    v_{\pi}(s)=E_{\pi}[G_t|S_t=s]
  • Monte-Carlo** policy evaluation uses empirical mean return instead of expected return**

First-Visit Monte-Carlo Policy Evaluation

  • To evaluate state s
  • The first time-step t that state s is visited in an episode,
  • Increment counter N(s) ← N(s) + 1
  • Increment total return S(s) ← S(s) + G_t
  • Value is estimated by mean return V(s) = S(s)/N(s)
  • By law of large numbers, V(s) → v_π(s) as N(s) → \infty

Every-Visit Monte-Carlo Policy Evaluation

  • To evaluate state s
  • Every time-step t that state s is visited in an episode,
  • Increment counter N(s) ← N(s) + 1
  • Increment total return S(s) ← S(s) + G_t
  • Value is estimated by mean return V(s) = S(s)/N(s)
  • Again, V(s) → v_π(s) as N(s) → \infty

Blackjack Example

Blackjack Value Function after Monte-Carlo Learning

Incremental Mean

Incremental Monte-Carlo Updates

  • Update V(s) incrementally after episode S_1, A_1, R_2, ..., S_T
  • For each state S_t with return G_t
    N(S_t)\gets N(S_t)+1
    V(S_t)\gets V(S_t)+\frac{1}{N(S_t)}(G_t-V(S_t))
  • In non-stationary problems, it can be useful to track a running
    mean, i.e. forget old episodes.
    V(S_t)\gets V(S_t)+\alpha(G_t-V(S_t))

Temporal-Difference Learning

  • TD methods learn directly from episodes of experience
  • TD is model-free: no knowledge of MDP transitions / rewards
  • TD learns from incomplete episodes, by bootstrapping
  • TD updates a guess towards a guess

MC and TD

Driving Home Example

Driving Home Example: MC vs. TD

Advantages and Disadvantages of MC vs. TD

  • TD can learn before knowing the final outcome
    • TD can learn online after every step
    • MC must wait until end of episode before return is known
  • TD can learn without the final outcome
    • TD can learn from incomplete sequences
    • MC can only learn from complete sequences
    • TD works in continuing (non-terminating) environments
    • MC only works for episodic (terminating) environments

Bias/Variance Trade-Off

  • Return G_t = R_{t+1} + γR_{t+2} + ... + γ^{T−1}R_T is unbiased estimate of v_π(S_t)
  • True TD target R_{t+1} + γv_π(S_{t+1}) is unbiased estimate of v_π(S_t)
  • TD target R_{t+1} +γV(S_{t+1}) is biased estimate of v_π(S_t)
  • TD target is much lower variance than the return:
    • Return depends on many random actions, transitions, rewards
    • TD target depends on one random action, transition, reward

Advantages and Disadvantages of MC vs. TD (2)

  • MC has high variance, zero bias
    • Good convergence properties
    • (even with function approximation)
    • Not very sensitive to initial value
    • Very simple to understand and use
  • TD has low variance, some bias
    • Usually more efficient than MC
    • TD(0) converges to vπ(s)
    • (but not always with function approximation)
    • More sensitive to initial value

Random Walk Example

Random Walk: MC vs. TD

Batch MC and TD

AB Example

AB Example

Certainty Equivalence

Advantages and Disadvantages of MC vs. TD (3)

  • TD exploits Markov property
    • Usually more efficient in Markov environments
  • MC does not exploit Markov property
    • Usually more effective in non-Markov environments

Monte-Carlo Backup

Temporal-Difference Backup

Dynamic Programming Backup

Bootstrapping and Sampling

Unified View of Reinforcement Learning

n-Step Prediction

n-Step Return

Large Random Walk Example

Averaging n-Step Returns

λ-return

TD(λ) Weighting Function

Forward-view TD(λ)

Forward-View TD(λ) on Large Random Walk

Backward View TD(λ)

  • Forward view provides theory
  • Backward view provides mechanism
  • Update online, every step, from incomplete sequences

Eligibility Traces

Backward View TD(λ)

TD(λ) and TD(0)

TD(λ) and MC

MC and TD(1)

Telescoping in TD(1)

TD(λ) and TD(1)

  • TD(1) is roughly equivalent to every-visit Monte-Carlo
  • Error is accumulated online, step-by-step
  • If value function is only updated offline at end of episode
  • Then total update is exactly the same as MC

Telescoping in TD(λ)

Forwards and Backwards TD(λ)

Offline Equivalence of Forward and Backward TD

Offline updates

  • Updates are accumulated within episode
  • but applied in batch at the end of episode

Onine Equivalence of Forward and Backward TD

Online updates

  • TD(λ) updates are applied online at each step within episode
  • Forward and backward-view TD(λ) are slightly different
  • NEW: Exact online TD(λ) achieves perfect equivalence
  • By using a slightly different form of eligibility trace
  • Sutton and von Seijen, ICML 2014

Summary of Forward and Backward TD(λ)

Reference:《UCL Course on RL》

相关文章

网友评论

      本文标题:Lecture 4: Model-Free Prediction

      本文链接:https://www.haomeiwen.com/subject/qpximhtx.html