![]()
Author:David Silver
![]()
Outline
- Introduction
- Policy Evaluation
- Policy Iteration
- Value Iteration
- Extensions to Dynamic Programming
- Contraction Mapping
What is Dynamic Programming?
Dynamic
sequential or temporal component to the problem
Programming
optimising a “program”, i.e. a policy
-
c.f. linear programming
-
A method for solving complex problems
-
By breaking them down into subproblems
- Solve the subproblems
- Combine solutions to subproblems
Requirements for Dynamic Programming
Dynamic Programming is a very general solution method for problems which have two properties:
- Optimal substructure
- Principle of optimality applies
- Optimal solution can be decomposed into subproblems
- Overlapping subproblems
- Subproblems recur many times
- Solutions can be cached and reused
- Markov decision processes satisfy both properties
- Bellman equation gives recursive decomposition
- Value function stores and reuses solutions
Planning by Dynamic Programming
- Dynamic programming assumes full knowledge of the MDP
- It is used for planning in an MDP
- For prediction:
- Input: MDP
and policy
- or: MRP
- Output: value function
- Input: MDP
- Or for control:
- Input: MDP
- Output: optimal value function
- and: optimal policy
- Input: MDP
Other Applications of Dynamic Programming
Dynamic programming is used to solve many other problems, e.g.
- Scheduling algorithms
- String algorithms (e.g. sequence alignment)
- Graph algorithms (e.g. shortest path algorithms)
- Graphical models (e.g. Viterbi algorithm)
- Bioinformatics (e.g. lattice models)
Iterative Policy Evaluation
- Problem: evaluate a given policy π
- Solution: iterative application of Bellman expectation backup
- Using synchronous backups,
- At each iteration k + 1
- For all states s ∈ S
- Update
from
- where s′ is a successor state of s
- We will discuss asynchronous backups later
- Convergence to
will be proven at the end of the lecture
Iterative Policy Evaluation (2)

Evaluating a Random Policy in the Small Gridworld

Iterative Policy Evaluation in Small Gridworld

Iterative Policy Evaluation in Small Gridworld (2)

How to Improve a Policy

Policy Iteration

Jack’s Car Rental

Policy Iteration in Jack’s Car Rental

Policy Improvement

Policy Improvement (2)

Modified Policy Iteration
- Does policy evaluation need to converge to
?
- Or should we introduce a
stopping condition
- e.g.
ε-convergence
of value function
- e.g.
- Or simply
stop
afteriterations of iterative policy evaluation?
- For example, in the small gridworld
was sufficient to achieve optimal policy
- Why not update policy every iteration? i.e. stop after
k = 1
- This is
equivalent
to value iteration (next section)
- This is
Generalised Policy Iteration

Principle of Optimality

Deterministic Value Iteration

Example: Shortest Path

Value Iteration
- Problem: find
optimal
policy π - Solution: iterative application of
Bellman optimality backup
- Using synchronous backups
- At each iteration
- For all states
- Update
from
- At each iteration
- Convergence to
will be proven later
- Unlike policy iteration, there is no
explicit
policy - Intermediate value functions may
not correspond
to any policy
Value Iteration (2)

Example of Value Iteration in Practice

Synchronous Dynamic Programming Algorithms

Asynchronous Dynamic Programming
- DP methods described so far used
synchronous backups
- i.e. all states are
backed up
in parallel -
Asynchronous DP
backs up states individually, in any order - For
each selected state
, apply the appropriate backup - Can
significantly reduce
computation -
Guaranteed to converge
if all states continue to be selected
Asynchronous Dynamic Programming
Three simple ideas for asynchronous dynamic programming:
- In-place dynamic programming 2. Prioritised sweeping
- Real-time dynamic programming
In-Place Dynamic Programming

Prioritised Sweeping

Real-Time Dynamic Programming

Full-Width Backups

Sample Backups

Approximate Dynamic Programming

Some Technical Questions
- How do we know that
value iteration
converges to?
- Or that iterative
policy evaluation
converges to?
- And therefore that
policy iteration
converges to?
- Is the solution
unique
? - How
fast
do these algorithms converge? - These questions are resolved by contraction mapping theorem
Value Function Space
- Consider the vector space V over value functions
- There are
dimensions
- Each point in this space fully specifies a value function v(s)
- What does a Bellman backup do to points in this space?
- We will show that it brings value functions closer
- And therefore the backups must converge on a unique solution
Value Function
-Norm

Bellman Expectation Backup is a Contraction

Contraction Mapping Theorem

Convergence of Iter. Policy Evaluation and Policy Iteration
- The Bellman expectation operator
has a unique fixed point
-
is a fixed point of
(by Bellman expectation equation)
- By contraction mapping theorem
- Iterative policy evaluation converges on
- Policy iteration converges on
Bellman Optimality Backup is a Contraction

Convergence of Value Iteration
- The Bellman optimality operator
has a unique fixed point
-
is a fixed point of
(by Bellman optimality equation)
- By contraction mapping theorem
- Value iteration converges on
Reference:《UCL Course on RL》
网友评论