美文网首页
Lecture 4: Model-Free Prediction

Lecture 4: Model-Free Prediction

作者: 魏鹏飞 | 来源:发表于2020-04-13 06:42 被阅读0次

    Author:David Silver

    Outline

    1. Introduction
    2. Monte-Carlo Learning
    3. Temporal-Difference Learning
    4. TD(λ)

    Model-Free Reinforcement Learning

    • Last lecture:
      • Planning by dynamic programming
      • Solve a known MDP
    • This lecture:
      • Model-free prediction
      • Estimate the value function of an unknown MDP
    • Next lecture:
      • Model-free control
      • Optimise the value function of an unknown MDP

    Monte-Carlo Reinforcement Learning

    • MC methods learn directly from episodes of experience
    • MC is model-free: no knowledge of MDP transitions / rewards
    • MC learns from complete episodes: no bootstrapping
    • MC uses the simplest possible idea: value = mean return
    • Caveat: can only apply MC to episodic MDPs
      • All episodes must terminate

    Monte-Carlo Policy Evaluation

    • Goal: learn v_{\pi} from episodes of experience under policy \pi
      S_1,A_1,R_2,...,S_k\sim\pi
    • Recall that the return is the total discounted reward:
      G_t=R_{t+1}+\gamma R_{t+2}+...+\gamma^{T-1}R_T
    • Recall that the value function is the expected return:
      v_{\pi}(s)=E_{\pi}[G_t|S_t=s]
    • Monte-Carlo** policy evaluation uses empirical mean return instead of expected return**

    First-Visit Monte-Carlo Policy Evaluation

    • To evaluate state s
    • The first time-step t that state s is visited in an episode,
    • Increment counter N(s) ← N(s) + 1
    • Increment total return S(s) ← S(s) + G_t
    • Value is estimated by mean return V(s) = S(s)/N(s)
    • By law of large numbers, V(s) → v_π(s) as N(s) → \infty

    Every-Visit Monte-Carlo Policy Evaluation

    • To evaluate state s
    • Every time-step t that state s is visited in an episode,
    • Increment counter N(s) ← N(s) + 1
    • Increment total return S(s) ← S(s) + G_t
    • Value is estimated by mean return V(s) = S(s)/N(s)
    • Again, V(s) → v_π(s) as N(s) → \infty

    Blackjack Example

    Blackjack Value Function after Monte-Carlo Learning

    Incremental Mean

    Incremental Monte-Carlo Updates

    • Update V(s) incrementally after episode S_1, A_1, R_2, ..., S_T
    • For each state S_t with return G_t
      N(S_t)\gets N(S_t)+1
      V(S_t)\gets V(S_t)+\frac{1}{N(S_t)}(G_t-V(S_t))
    • In non-stationary problems, it can be useful to track a running
      mean, i.e. forget old episodes.
      V(S_t)\gets V(S_t)+\alpha(G_t-V(S_t))

    Temporal-Difference Learning

    • TD methods learn directly from episodes of experience
    • TD is model-free: no knowledge of MDP transitions / rewards
    • TD learns from incomplete episodes, by bootstrapping
    • TD updates a guess towards a guess

    MC and TD

    Driving Home Example

    Driving Home Example: MC vs. TD

    Advantages and Disadvantages of MC vs. TD

    • TD can learn before knowing the final outcome
      • TD can learn online after every step
      • MC must wait until end of episode before return is known
    • TD can learn without the final outcome
      • TD can learn from incomplete sequences
      • MC can only learn from complete sequences
      • TD works in continuing (non-terminating) environments
      • MC only works for episodic (terminating) environments

    Bias/Variance Trade-Off

    • Return G_t = R_{t+1} + γR_{t+2} + ... + γ^{T−1}R_T is unbiased estimate of v_π(S_t)
    • True TD target R_{t+1} + γv_π(S_{t+1}) is unbiased estimate of v_π(S_t)
    • TD target R_{t+1} +γV(S_{t+1}) is biased estimate of v_π(S_t)
    • TD target is much lower variance than the return:
      • Return depends on many random actions, transitions, rewards
      • TD target depends on one random action, transition, reward

    Advantages and Disadvantages of MC vs. TD (2)

    • MC has high variance, zero bias
      • Good convergence properties
      • (even with function approximation)
      • Not very sensitive to initial value
      • Very simple to understand and use
    • TD has low variance, some bias
      • Usually more efficient than MC
      • TD(0) converges to vπ(s)
      • (but not always with function approximation)
      • More sensitive to initial value

    Random Walk Example

    Random Walk: MC vs. TD

    Batch MC and TD

    AB Example

    AB Example

    Certainty Equivalence

    Advantages and Disadvantages of MC vs. TD (3)

    • TD exploits Markov property
      • Usually more efficient in Markov environments
    • MC does not exploit Markov property
      • Usually more effective in non-Markov environments

    Monte-Carlo Backup

    Temporal-Difference Backup

    Dynamic Programming Backup

    Bootstrapping and Sampling

    Unified View of Reinforcement Learning

    n-Step Prediction

    n-Step Return

    Large Random Walk Example

    Averaging n-Step Returns

    λ-return

    TD(λ) Weighting Function

    Forward-view TD(λ)

    Forward-View TD(λ) on Large Random Walk

    Backward View TD(λ)

    • Forward view provides theory
    • Backward view provides mechanism
    • Update online, every step, from incomplete sequences

    Eligibility Traces

    Backward View TD(λ)

    TD(λ) and TD(0)

    TD(λ) and MC

    MC and TD(1)

    Telescoping in TD(1)

    TD(λ) and TD(1)

    • TD(1) is roughly equivalent to every-visit Monte-Carlo
    • Error is accumulated online, step-by-step
    • If value function is only updated offline at end of episode
    • Then total update is exactly the same as MC

    Telescoping in TD(λ)

    Forwards and Backwards TD(λ)

    Offline Equivalence of Forward and Backward TD

    Offline updates

    • Updates are accumulated within episode
    • but applied in batch at the end of episode

    Onine Equivalence of Forward and Backward TD

    Online updates

    • TD(λ) updates are applied online at each step within episode
    • Forward and backward-view TD(λ) are slightly different
    • NEW: Exact online TD(λ) achieves perfect equivalence
    • By using a slightly different form of eligibility trace
    • Sutton and von Seijen, ICML 2014

    Summary of Forward and Backward TD(λ)

    Reference:《UCL Course on RL》

    相关文章

      网友评论

          本文标题:Lecture 4: Model-Free Prediction

          本文链接:https://www.haomeiwen.com/subject/qpximhtx.html