Author:David Silver
Outline
- Introduction
- Policy Evaluation
- Policy Iteration
- Value Iteration
- Extensions to Dynamic Programming
- Contraction Mapping
What is Dynamic Programming?
Dynamic
sequential or temporal component to the problem
Programming
optimising a “program”, i.e. a policy
-
c.f. linear programming
-
A method for solving complex problems
-
By breaking them down into subproblems
- Solve the subproblems
- Combine solutions to subproblems
Requirements for Dynamic Programming
Dynamic Programming is a very general solution method for problems which have two properties:
- Optimal substructure
- Principle of optimality applies
- Optimal solution can be decomposed into subproblems
- Overlapping subproblems
- Subproblems recur many times
- Solutions can be cached and reused
- Markov decision processes satisfy both properties
- Bellman equation gives recursive decomposition
- Value function stores and reuses solutions
Planning by Dynamic Programming
- Dynamic programming assumes full knowledge of the MDP
- It is used for planning in an MDP
- For prediction:
- Input: MDP and policy
- or: MRP
- Output: value function
- Or for control:
- Input: MDP
- Output: optimal value function
- and: optimal policy
Other Applications of Dynamic Programming
Dynamic programming is used to solve many other problems, e.g.
- Scheduling algorithms
- String algorithms (e.g. sequence alignment)
- Graph algorithms (e.g. shortest path algorithms)
- Graphical models (e.g. Viterbi algorithm)
- Bioinformatics (e.g. lattice models)
Iterative Policy Evaluation
- Problem: evaluate a given policy π
- Solution: iterative application of Bellman expectation backup
- Using synchronous backups,
- At each iteration k + 1
- For all states s ∈ S
- Update from
- where s′ is a successor state of s
- We will discuss asynchronous backups later
- Convergence to will be proven at the end of the lecture
Iterative Policy Evaluation (2)
Evaluating a Random Policy in the Small Gridworld
Iterative Policy Evaluation in Small Gridworld
Iterative Policy Evaluation in Small Gridworld (2)
How to Improve a Policy
Policy Iteration
Jack’s Car Rental
Policy Iteration in Jack’s Car Rental
Policy Improvement
Policy Improvement (2)
Modified Policy Iteration
- Does policy evaluation need to converge to ?
- Or should we introduce a
stopping condition
- e.g.
ε-convergence
of value function
- e.g.
- Or simply
stop
after iterations of iterative policy evaluation? - For example, in the small gridworld was sufficient to achieve optimal policy
- Why not update policy every iteration? i.e. stop after
k = 1
- This is
equivalent
to value iteration (next section)
- This is
Generalised Policy Iteration
Principle of Optimality
Deterministic Value Iteration
Example: Shortest Path
Value Iteration
- Problem: find
optimal
policy π - Solution: iterative application of
Bellman optimality backup
- Using synchronous backups
- At each iteration
- For all states
- Update from
- Convergence to will be proven later
- Unlike policy iteration, there is no
explicit
policy - Intermediate value functions may
not correspond
to any policy
Value Iteration (2)
Example of Value Iteration in Practice
Synchronous Dynamic Programming Algorithms
Asynchronous Dynamic Programming
- DP methods described so far used
synchronous backups
- i.e. all states are
backed up
in parallel -
Asynchronous DP
backs up states individually, in any order - For
each selected state
, apply the appropriate backup - Can
significantly reduce
computation -
Guaranteed to converge
if all states continue to be selected
Asynchronous Dynamic Programming
Three simple ideas for asynchronous dynamic programming:
- In-place dynamic programming 2. Prioritised sweeping
- Real-time dynamic programming
In-Place Dynamic Programming
Prioritised Sweeping
Real-Time Dynamic Programming
Full-Width Backups
Sample Backups
Approximate Dynamic Programming
Some Technical Questions
- How do we know that
value iteration
converges to ? - Or that iterative
policy evaluation
converges to ? - And therefore that
policy iteration
converges to ? - Is the solution
unique
? - How
fast
do these algorithms converge? - These questions are resolved by contraction mapping theorem
Value Function Space
- Consider the vector space V over value functions
- There are dimensions
- Each point in this space fully specifies a value function v(s)
- What does a Bellman backup do to points in this space?
- We will show that it brings value functions closer
- And therefore the backups must converge on a unique solution
Value Function -Norm
Bellman Expectation Backup is a Contraction
Contraction Mapping Theorem
Convergence of Iter. Policy Evaluation and Policy Iteration
- The Bellman expectation operator has a unique fixed point
- is a fixed point of (by Bellman expectation equation)
- By contraction mapping theorem
- Iterative policy evaluation converges on
- Policy iteration converges on
Bellman Optimality Backup is a Contraction
Convergence of Value Iteration
- The Bellman optimality operator has a unique fixed point
- is a fixed point of (by Bellman optimality equation)
- By contraction mapping theorem
- Value iteration converges on
Reference:《UCL Course on RL》
网友评论