美文网首页
Lecture 3: Planning by Dynamic P

Lecture 3: Planning by Dynamic P

作者: 魏鹏飞 | 来源:发表于2020-04-04 08:27 被阅读0次

    Author:David Silver

    Outline

    1. Introduction
    2. Policy Evaluation
    3. Policy Iteration
    4. Value Iteration
    5. Extensions to Dynamic Programming
    6. Contraction Mapping

    What is Dynamic Programming?

    Dynamic sequential or temporal component to the problem
    Programming optimising a “program”, i.e. a policy

    • c.f. linear programming

    • A method for solving complex problems

    • By breaking them down into subproblems

      • Solve the subproblems
      • Combine solutions to subproblems

    Requirements for Dynamic Programming

    Dynamic Programming is a very general solution method for problems which have two properties:

    • Optimal substructure
      • Principle of optimality applies
      • Optimal solution can be decomposed into subproblems
    • Overlapping subproblems
      • Subproblems recur many times
      • Solutions can be cached and reused
    • Markov decision processes satisfy both properties
      • Bellman equation gives recursive decomposition
      • Value function stores and reuses solutions

    Planning by Dynamic Programming

    • Dynamic programming assumes full knowledge of the MDP
    • It is used for planning in an MDP
    • For prediction:
      • Input: MDP<S,A,P,R,\gamma> and policy \pi
      • or: MRP<S,P^{\pi},R^{\pi},\gamma>
      • Output: value function v_{\pi}
    • Or for control:
      • Input: MDP<S,A,P,R,\gamma>
      • Output: optimal value functionv_*
      • and: optimal policy \pi_*

    Other Applications of Dynamic Programming

    Dynamic programming is used to solve many other problems, e.g.

    • Scheduling algorithms
    • String algorithms (e.g. sequence alignment)
    • Graph algorithms (e.g. shortest path algorithms)
    • Graphical models (e.g. Viterbi algorithm)
    • Bioinformatics (e.g. lattice models)

    Iterative Policy Evaluation

    • Problem: evaluate a given policy π
    • Solution: iterative application of Bellman expectation backup
    • v_1 →v_2 →...→v_π
    • Using synchronous backups,
      • At each iteration k + 1
      • For all states s ∈ S
      • Update v_{k+1}(s) from v_k (s )
      • where s′ is a successor state of s
    • We will discuss asynchronous backups later
    • Convergence to v_{\pi} will be proven at the end of the lecture

    Iterative Policy Evaluation (2)

    Evaluating a Random Policy in the Small Gridworld

    Iterative Policy Evaluation in Small Gridworld

    Iterative Policy Evaluation in Small Gridworld (2)

    How to Improve a Policy

    Policy Iteration

    Jack’s Car Rental

    Policy Iteration in Jack’s Car Rental

    Policy Improvement

    Policy Improvement (2)

    Modified Policy Iteration

    • Does policy evaluation need to converge to v_π?
    • Or should we introduce a stopping condition
      • e.g. ε-convergence of value function
    • Or simply stop after k iterations of iterative policy evaluation?
    • For example, in the small gridworld k = 3 was sufficient to achieve optimal policy
    • Why not update policy every iteration? i.e. stop after k = 1
      • This is equivalent to value iteration (next section)

    Generalised Policy Iteration

    Principle of Optimality

    Deterministic Value Iteration

    Example: Shortest Path

    Value Iteration

    • Problem: find optimal policy π
    • Solution: iterative application of Bellman optimality backup
    • v_1 →v_2 →...→v_∗
    • Using synchronous backups
      • At each iteration k + 1
      • For all states s ∈ S′
      • Update v_{k+1}(s) from v_k (s′)
    • Convergence to v_∗ will be proven later
    • Unlike policy iteration, there is no explicit policy
    • Intermediate value functions may not correspond to any policy

    Value Iteration (2)

    Example of Value Iteration in Practice

    Synchronous Dynamic Programming Algorithms

    Asynchronous Dynamic Programming

    • DP methods described so far used synchronous backups
    • i.e. all states are backed up in parallel
    • Asynchronous DP backs up states individually, in any order
    • For each selected state, apply the appropriate backup
    • Can significantly reduce computation
    • Guaranteed to converge if all states continue to be selected

    Asynchronous Dynamic Programming

    Three simple ideas for asynchronous dynamic programming:

    1. In-place dynamic programming 2. Prioritised sweeping
    2. Real-time dynamic programming

    In-Place Dynamic Programming

    Prioritised Sweeping

    Real-Time Dynamic Programming

    Full-Width Backups

    Sample Backups

    Approximate Dynamic Programming

    Some Technical Questions

    • How do we know that value iteration converges to v_∗?
    • Or that iterative policy evaluation converges to v_π?
    • And therefore that policy iteration converges to v_∗?
    • Is the solution unique?
    • How fast do these algorithms converge?
    • These questions are resolved by contraction mapping theorem

    Value Function Space

    • Consider the vector space V over value functions
    • There are |S| dimensions
    • Each point in this space fully specifies a value function v(s)
    • What does a Bellman backup do to points in this space?
    • We will show that it brings value functions closer
    • And therefore the backups must converge on a unique solution

    Value Function \infty-Norm

    Bellman Expectation Backup is a Contraction

    Contraction Mapping Theorem

    Convergence of Iter. Policy Evaluation and Policy Iteration

    • The Bellman expectation operator T^π has a unique fixed point
    • v_π is a fixed point of T^π (by Bellman expectation equation)
    • By contraction mapping theorem
    • Iterative policy evaluation converges on v_π
    • Policy iteration converges on v_∗

    Bellman Optimality Backup is a Contraction

    Convergence of Value Iteration

    • The Bellman optimality operator T^∗ has a unique fixed point
    • v_∗ is a fixed point of T^∗ (by Bellman optimality equation)
    • By contraction mapping theorem
    • Value iteration converges on v_∗

    Reference:《UCL Course on RL》

    相关文章

      网友评论

          本文标题:Lecture 3: Planning by Dynamic P

          本文链接:https://www.haomeiwen.com/subject/hvykphtx.html