![]()
Author:David Silver
![]()
Outline
- Introduction
- Policy Evaluation
- Policy Iteration
- Value Iteration
- Extensions to Dynamic Programming
- Contraction Mapping
What is Dynamic Programming?
Dynamic
sequential or temporal component to the problem
Programming
optimising a “program”, i.e. a policy
-
c.f. linear programming
-
A method for solving complex problems
-
By breaking them down into subproblems
- Solve the subproblems
- Combine solutions to subproblems
Requirements for Dynamic Programming
Dynamic Programming is a very general solution method for problems which have two properties:
- Optimal substructure
- Principle of optimality applies
- Optimal solution can be decomposed into subproblems
- Overlapping subproblems
- Subproblems recur many times
- Solutions can be cached and reused
- Markov decision processes satisfy both properties
- Bellman equation gives recursive decomposition
- Value function stores and reuses solutions
Planning by Dynamic Programming
- Dynamic programming assumes full knowledge of the MDP
- It is used for planning in an MDP
- For prediction:
- Input: MDP
and policy
- or: MRP
- Output: value function
- Input: MDP
- Or for control:
- Input: MDP
- Output: optimal value function
- and: optimal policy
- Input: MDP
Other Applications of Dynamic Programming
Dynamic programming is used to solve many other problems, e.g.
- Scheduling algorithms
- String algorithms (e.g. sequence alignment)
- Graph algorithms (e.g. shortest path algorithms)
- Graphical models (e.g. Viterbi algorithm)
- Bioinformatics (e.g. lattice models)
Iterative Policy Evaluation
- Problem: evaluate a given policy π
- Solution: iterative application of Bellman expectation backup
- Using synchronous backups,
- At each iteration k + 1
- For all states s ∈ S
- Update
from
- where s′ is a successor state of s
- We will discuss asynchronous backups later
- Convergence to
will be proven at the end of the lecture
Iterative Policy Evaluation (2)
data:image/s3,"s3://crabby-images/aec82/aec827a2d41300e27042761ec3a3228308c3d286" alt=""
Evaluating a Random Policy in the Small Gridworld
data:image/s3,"s3://crabby-images/8692c/8692c5715d9fa457df549a8fac76c9441f18f087" alt=""
Iterative Policy Evaluation in Small Gridworld
data:image/s3,"s3://crabby-images/db58f/db58f3e695868ea658cd056f2937e650995a8f18" alt=""
Iterative Policy Evaluation in Small Gridworld (2)
data:image/s3,"s3://crabby-images/dda2b/dda2bc7ca4286862f3a056083c84aa09e3a99bdf" alt=""
How to Improve a Policy
data:image/s3,"s3://crabby-images/2d5fc/2d5fc102ccedfcb179bd6677415e279ce5b3e13e" alt=""
Policy Iteration
data:image/s3,"s3://crabby-images/553a7/553a702153aba56a809987681828367151319055" alt=""
Jack’s Car Rental
data:image/s3,"s3://crabby-images/a53f7/a53f76e82e2f0c751719fb5e4688733217c95da2" alt=""
Policy Iteration in Jack’s Car Rental
data:image/s3,"s3://crabby-images/0a8f3/0a8f3afa81b020c071704a20e23cd21970f57bdb" alt=""
Policy Improvement
data:image/s3,"s3://crabby-images/d4644/d46442af52c85be5def5d5c4d04a635b7552ec1f" alt=""
Policy Improvement (2)
data:image/s3,"s3://crabby-images/e4653/e465392b0cd40fcf746d0ac6bd511cdc3b6fb9b1" alt=""
Modified Policy Iteration
- Does policy evaluation need to converge to
?
- Or should we introduce a
stopping condition
- e.g.
ε-convergence
of value function
- e.g.
- Or simply
stop
afteriterations of iterative policy evaluation?
- For example, in the small gridworld
was sufficient to achieve optimal policy
- Why not update policy every iteration? i.e. stop after
k = 1
- This is
equivalent
to value iteration (next section)
- This is
Generalised Policy Iteration
data:image/s3,"s3://crabby-images/84fec/84fecd9e08fa459260d704fa6a98409bbe51b424" alt=""
Principle of Optimality
data:image/s3,"s3://crabby-images/d4153/d41534ffbe48123a760b0c13daea59cb0aaf0fe7" alt=""
Deterministic Value Iteration
data:image/s3,"s3://crabby-images/a2a2d/a2a2db8917a533a75fd3661de083a81603aaab51" alt=""
Example: Shortest Path
data:image/s3,"s3://crabby-images/15829/1582954dbdba72dce328ddb10b39e76ede314b1b" alt=""
Value Iteration
- Problem: find
optimal
policy π - Solution: iterative application of
Bellman optimality backup
- Using synchronous backups
- At each iteration
- For all states
- Update
from
- At each iteration
- Convergence to
will be proven later
- Unlike policy iteration, there is no
explicit
policy - Intermediate value functions may
not correspond
to any policy
Value Iteration (2)
data:image/s3,"s3://crabby-images/3549d/3549dbd82f5ac399778ebb658d1b64d5bfe23958" alt=""
Example of Value Iteration in Practice
data:image/s3,"s3://crabby-images/cf845/cf845f200ea9e31701253668d7dc18507b3823d5" alt=""
Synchronous Dynamic Programming Algorithms
data:image/s3,"s3://crabby-images/496d2/496d23456a04888f6a972cb15ce873335f211585" alt=""
Asynchronous Dynamic Programming
- DP methods described so far used
synchronous backups
- i.e. all states are
backed up
in parallel -
Asynchronous DP
backs up states individually, in any order - For
each selected state
, apply the appropriate backup - Can
significantly reduce
computation -
Guaranteed to converge
if all states continue to be selected
Asynchronous Dynamic Programming
Three simple ideas for asynchronous dynamic programming:
- In-place dynamic programming 2. Prioritised sweeping
- Real-time dynamic programming
In-Place Dynamic Programming
data:image/s3,"s3://crabby-images/0df99/0df99871e73790d03fddbe1d9dffb8564e1347b0" alt=""
Prioritised Sweeping
data:image/s3,"s3://crabby-images/701fb/701fb2f670c993a9cce46143433fbead3a34fe46" alt=""
Real-Time Dynamic Programming
data:image/s3,"s3://crabby-images/b2081/b2081dbc45b7b46ae0d8f0090ff4f02fe754a0a0" alt=""
Full-Width Backups
data:image/s3,"s3://crabby-images/30d4e/30d4ebed2bd4afdd640d8d3e509b8a47d934cf15" alt=""
Sample Backups
data:image/s3,"s3://crabby-images/5dd90/5dd9077429a468bcc2ff79a8dde7ade0fa7f091f" alt=""
Approximate Dynamic Programming
data:image/s3,"s3://crabby-images/dcbcb/dcbcbc0e687f716f70f4bd4c276deda7d0a91c70" alt=""
Some Technical Questions
- How do we know that
value iteration
converges to?
- Or that iterative
policy evaluation
converges to?
- And therefore that
policy iteration
converges to?
- Is the solution
unique
? - How
fast
do these algorithms converge? - These questions are resolved by contraction mapping theorem
Value Function Space
- Consider the vector space V over value functions
- There are
dimensions
- Each point in this space fully specifies a value function v(s)
- What does a Bellman backup do to points in this space?
- We will show that it brings value functions closer
- And therefore the backups must converge on a unique solution
Value Function
-Norm
data:image/s3,"s3://crabby-images/9f2cf/9f2cf71fc6b07d137ed72949f382f5badcf5e71b" alt=""
Bellman Expectation Backup is a Contraction
data:image/s3,"s3://crabby-images/4721c/4721cbeeeb3337247fe15dd3bfbd24f621ff0f9a" alt=""
Contraction Mapping Theorem
data:image/s3,"s3://crabby-images/8b847/8b847348cc0f48b519c5524a2ec594f841ade254" alt=""
Convergence of Iter. Policy Evaluation and Policy Iteration
- The Bellman expectation operator
has a unique fixed point
-
is a fixed point of
(by Bellman expectation equation)
- By contraction mapping theorem
- Iterative policy evaluation converges on
- Policy iteration converges on
Bellman Optimality Backup is a Contraction
data:image/s3,"s3://crabby-images/43a71/43a71a489a83ea313e6a0dbc7c783f6c6d109963" alt=""
Convergence of Value Iteration
- The Bellman optimality operator
has a unique fixed point
-
is a fixed point of
(by Bellman optimality equation)
- By contraction mapping theorem
- Value iteration converges on
Reference:《UCL Course on RL》
网友评论