美文网首页
Lecture 3: Planning by Dynamic P

Lecture 3: Planning by Dynamic P

作者: 魏鹏飞 | 来源:发表于2020-04-04 08:27 被阅读0次

Author:David Silver

Outline

  1. Introduction
  2. Policy Evaluation
  3. Policy Iteration
  4. Value Iteration
  5. Extensions to Dynamic Programming
  6. Contraction Mapping

What is Dynamic Programming?

Dynamic sequential or temporal component to the problem
Programming optimising a “program”, i.e. a policy

  • c.f. linear programming

  • A method for solving complex problems

  • By breaking them down into subproblems

    • Solve the subproblems
    • Combine solutions to subproblems

Requirements for Dynamic Programming

Dynamic Programming is a very general solution method for problems which have two properties:

  • Optimal substructure
    • Principle of optimality applies
    • Optimal solution can be decomposed into subproblems
  • Overlapping subproblems
    • Subproblems recur many times
    • Solutions can be cached and reused
  • Markov decision processes satisfy both properties
    • Bellman equation gives recursive decomposition
    • Value function stores and reuses solutions

Planning by Dynamic Programming

  • Dynamic programming assumes full knowledge of the MDP
  • It is used for planning in an MDP
  • For prediction:
    • Input: MDP<S,A,P,R,\gamma> and policy \pi
    • or: MRP<S,P^{\pi},R^{\pi},\gamma>
    • Output: value function v_{\pi}
  • Or for control:
    • Input: MDP<S,A,P,R,\gamma>
    • Output: optimal value functionv_*
    • and: optimal policy \pi_*

Other Applications of Dynamic Programming

Dynamic programming is used to solve many other problems, e.g.

  • Scheduling algorithms
  • String algorithms (e.g. sequence alignment)
  • Graph algorithms (e.g. shortest path algorithms)
  • Graphical models (e.g. Viterbi algorithm)
  • Bioinformatics (e.g. lattice models)

Iterative Policy Evaluation

  • Problem: evaluate a given policy π
  • Solution: iterative application of Bellman expectation backup
  • v_1 →v_2 →...→v_π
  • Using synchronous backups,
    • At each iteration k + 1
    • For all states s ∈ S
    • Update v_{k+1}(s) from v_k (s )
    • where s′ is a successor state of s
  • We will discuss asynchronous backups later
  • Convergence to v_{\pi} will be proven at the end of the lecture

Iterative Policy Evaluation (2)

Evaluating a Random Policy in the Small Gridworld

Iterative Policy Evaluation in Small Gridworld

Iterative Policy Evaluation in Small Gridworld (2)

How to Improve a Policy

Policy Iteration

Jack’s Car Rental

Policy Iteration in Jack’s Car Rental

Policy Improvement

Policy Improvement (2)

Modified Policy Iteration

  • Does policy evaluation need to converge to v_π?
  • Or should we introduce a stopping condition
    • e.g. ε-convergence of value function
  • Or simply stop after k iterations of iterative policy evaluation?
  • For example, in the small gridworld k = 3 was sufficient to achieve optimal policy
  • Why not update policy every iteration? i.e. stop after k = 1
    • This is equivalent to value iteration (next section)

Generalised Policy Iteration

Principle of Optimality

Deterministic Value Iteration

Example: Shortest Path

Value Iteration

  • Problem: find optimal policy π
  • Solution: iterative application of Bellman optimality backup
  • v_1 →v_2 →...→v_∗
  • Using synchronous backups
    • At each iteration k + 1
    • For all states s ∈ S′
    • Update v_{k+1}(s) from v_k (s′)
  • Convergence to v_∗ will be proven later
  • Unlike policy iteration, there is no explicit policy
  • Intermediate value functions may not correspond to any policy

Value Iteration (2)

Example of Value Iteration in Practice

Synchronous Dynamic Programming Algorithms

Asynchronous Dynamic Programming

  • DP methods described so far used synchronous backups
  • i.e. all states are backed up in parallel
  • Asynchronous DP backs up states individually, in any order
  • For each selected state, apply the appropriate backup
  • Can significantly reduce computation
  • Guaranteed to converge if all states continue to be selected

Asynchronous Dynamic Programming

Three simple ideas for asynchronous dynamic programming:

  1. In-place dynamic programming 2. Prioritised sweeping
  2. Real-time dynamic programming

In-Place Dynamic Programming

Prioritised Sweeping

Real-Time Dynamic Programming

Full-Width Backups

Sample Backups

Approximate Dynamic Programming

Some Technical Questions

  • How do we know that value iteration converges to v_∗?
  • Or that iterative policy evaluation converges to v_π?
  • And therefore that policy iteration converges to v_∗?
  • Is the solution unique?
  • How fast do these algorithms converge?
  • These questions are resolved by contraction mapping theorem

Value Function Space

  • Consider the vector space V over value functions
  • There are |S| dimensions
  • Each point in this space fully specifies a value function v(s)
  • What does a Bellman backup do to points in this space?
  • We will show that it brings value functions closer
  • And therefore the backups must converge on a unique solution

Value Function \infty-Norm

Bellman Expectation Backup is a Contraction

Contraction Mapping Theorem

Convergence of Iter. Policy Evaluation and Policy Iteration

  • The Bellman expectation operator T^π has a unique fixed point
  • v_π is a fixed point of T^π (by Bellman expectation equation)
  • By contraction mapping theorem
  • Iterative policy evaluation converges on v_π
  • Policy iteration converges on v_∗

Bellman Optimality Backup is a Contraction

Convergence of Value Iteration

  • The Bellman optimality operator T^∗ has a unique fixed point
  • v_∗ is a fixed point of T^∗ (by Bellman optimality equation)
  • By contraction mapping theorem
  • Value iteration converges on v_∗

Reference:《UCL Course on RL》

相关文章

网友评论

      本文标题:Lecture 3: Planning by Dynamic P

      本文链接:https://www.haomeiwen.com/subject/hvykphtx.html