Lecture 3: Planning by Dynamic P

Lecture 3: Planning by Dynamic P

作者: 魏鹏飞 | 来源:发表于2020-04-04 08:27 被阅读0次

Lecture 3: Planning by Dynamic P
Lecture 3: Planning by Dynamic P
18/10/2019 Lecture3: Planning by
最大回文子串
L6-U2-P1 英语流利说 6-2-1 懂你英语 Level6
Lecture 10: Model-based Planning
2018-7-23 托福基础听力
雅思阅读笔记（第三节）
雅思阅读笔记（第一节）
雅思阅读笔记（第二节）

Author：David Silver

Outline

Introduction
Policy Evaluation
Policy Iteration
Value Iteration
Extensions to Dynamic Programming
Contraction Mapping

What is Dynamic Programming?

Dynamic sequential or temporal component to the problem
Programming optimising a “program”, i.e. a policy

c.f. linear programming
A method for solving complex problems
By breaking them down into subproblems
- Solve the subproblems
- Combine solutions to subproblems

Requirements for Dynamic Programming

Dynamic Programming is a very general solution method for problems which have two properties:

Optimal substructure
- Principle of optimality applies
- Optimal solution can be decomposed into subproblems
Overlapping subproblems
- Subproblems recur many times
- Solutions can be cached and reused
Markov decision processes satisfy both properties
- Bellman equation gives recursive decomposition
- Value function stores and reuses solutions

Planning by Dynamic Programming

Dynamic programming assumes full knowledge of the MDP
It is used for planning in an MDP
For prediction:
- Input: MDP $<S,A,P,R,\gamma>$ and policy $\pi$
- or: MRP $<S,P^{\pi},R^{\pi},\gamma>$
- Output: value function $v_{\pi}$
Or for control:
- Input: MDP $<S,A,P,R,\gamma>$
- Output: optimal value function $v_*$
- and: optimal policy $\pi_*$

Other Applications of Dynamic Programming

Dynamic programming is used to solve many other problems, e.g.

Scheduling algorithms
String algorithms (e.g. sequence alignment)
Graph algorithms (e.g. shortest path algorithms)
Graphical models (e.g. Viterbi algorithm)
Bioinformatics (e.g. lattice models)

Iterative Policy Evaluation

Problem: evaluate a given policy π
Solution: iterative application of Bellman expectation backup
$v_1 →v_2 →...→v_π$
Using synchronous backups,
- At each iteration k + 1
- For all states s ∈ S
- Update $v_{k+1}(s)$ from $v_k (s )$
- where s′ is a successor state of s
We will discuss asynchronous backups later
Convergence to $v_{\pi}$ will be proven at the end of the lecture

Iterative Policy Evaluation (2)

Evaluating a Random Policy in the Small Gridworld

Iterative Policy Evaluation in Small Gridworld

Iterative Policy Evaluation in Small Gridworld (2)

How to Improve a Policy

Policy Iteration

Jack’s Car Rental

Policy Iteration in Jack’s Car Rental

Policy Improvement

Policy Improvement (2)

Modified Policy Iteration

Does policy evaluation need to converge to $v_π$ ?
Or should we introduce a stopping condition
- e.g. ε-convergence of value function
Or simply stop after $k$ iterations of iterative policy evaluation?
For example, in the small gridworld $k = 3$ was sufficient to achieve optimal policy
Why not update policy every iteration? i.e. stop after k = 1
- This is equivalent to value iteration (next section)

Generalised Policy Iteration

Principle of Optimality

Deterministic Value Iteration

Example: Shortest Path

Value Iteration

Problem: find optimal policy π
Solution: iterative application of Bellman optimality backup
$v_1 →v_2 →...→v_∗$
Using synchronous backups
- At each iteration $k + 1$
- For all states $s ∈ S′$
- Update $v_{k+1}(s)$ from $v_k (s′)$
Convergence to $v_∗$ will be proven later
Unlike policy iteration, there is no explicit policy
Intermediate value functions may not correspond to any policy

Value Iteration (2)

Example of Value Iteration in Practice

Synchronous Dynamic Programming Algorithms

Asynchronous Dynamic Programming

DP methods described so far used synchronous backups
i.e. all states are backed up in parallel
Asynchronous DP backs up states individually, in any order
For each selected state, apply the appropriate backup
Can significantly reduce computation
Guaranteed to converge if all states continue to be selected

Asynchronous Dynamic Programming

Three simple ideas for asynchronous dynamic programming:

In-place dynamic programming 2. Prioritised sweeping
Real-time dynamic programming

In-Place Dynamic Programming

Prioritised Sweeping

Real-Time Dynamic Programming

Full-Width Backups

Sample Backups

Approximate Dynamic Programming

Some Technical Questions

How do we know that value iteration converges to $v_∗$ ?
Or that iterative policy evaluation converges to $v_π$ ?
And therefore that policy iteration converges to $v_∗$ ?
Is the solution unique?
How fast do these algorithms converge?
These questions are resolved by contraction mapping theorem

Value Function Space

Consider the vector space V over value functions
There are $|S|$ dimensions
Each point in this space fully specifies a value function v(s)
What does a Bellman backup do to points in this space?
We will show that it brings value functions closer
And therefore the backups must converge on a unique solution

Value Function $\infty$ -Norm

Bellman Expectation Backup is a Contraction

Contraction Mapping Theorem

Convergence of Iter. Policy Evaluation and Policy Iteration

The Bellman expectation operator $T^π$ has a unique fixed point
$v_π$ is a fixed point of $T^π$ (by Bellman expectation equation)
By contraction mapping theorem
Iterative policy evaluation converges on $v_π$
Policy iteration converges on $v_∗$

Bellman Optimality Backup is a Contraction

Convergence of Value Iteration

The Bellman optimality operator $T^∗$ has a unique fixed point
$v_∗$ is a fixed point of $T^∗$ (by Bellman optimality equation)
By contraction mapping theorem
Value iteration converges on $v_∗$

Reference：《UCL Course on RL》

相关文章

Lecture 3: Planning by Dynamic P
一、Introduction （一）什么是动态规划（Dynamic Programming） Dynamic：问...
Lecture 3: Planning by Dynamic P
Author：David Silver Outline Introduction Policy Evaluatio...
18/10/2019 Lecture3: Planning by
Planning by Dynamic Programming Dynamic Programming 具有某种时...
最大回文子串
1.暴力求解（Brute Force） O(n^3) 2.动态规划（Dynamic planning） O(n^2...
L6-U2-P1 英语流利说 6-2-1 懂你英语 Level6
L6-U2-P1-1 Listening : Planning for Retirement 1 Planning...
Lecture 10: Model-based Planning
Question: Why bad idea?Answer: Don't gain information eve...
2018-7-23 托福基础听力
今日作业！！！: P17-23 (生命科学，气象，环境科学) TPO3 lecture 1 精听！！！动物类 P...
雅思阅读笔记（第三节）
curriculum 课程大纲： Lecture 1 总纲、单词题 Lecture 2 句子题 Lecture 3...
雅思阅读笔记（第一节）
curriculum 课程大纲： Lecture 1 总纲、单词题 Lecture 2 句子题 Lecture 3...
雅思阅读笔记（第二节）
curriculum 课程大纲： Lecture 1 总纲、单词题 Lecture 2 句子题 Lecture 3...

网友评论

本文标题：Lecture 3: Planning by Dynamic P

本文链接：https://www.haomeiwen.com/subject/hvykphtx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

栏目导航

热点阅读

关于我们|服务条款|联系我们|Lecture 3: Planning by Dynamic P|投稿指南|网站地图|RSS订阅|排版工具|手机版

提供经典美文摘抄,优美散文欣赏,现代诗歌精选,短篇小说,心情随笔,表白情书范文,故事会在线阅读欣赏

Copyright © 2014-2023 Haomeiwen.com All Rights Reserved. 好美文阅读网版权所有

备案信息：桂公网安备 45052102000051号 · 桂ICP备13007215号-3

本站所收录作品、热点评论等信息部分来源互联网，目的只是为了系统归纳学习和传递资讯

所有作品版权归原创作者所有，与本站立场无关，如不慎侵犯了你的权益，请联系我们告知，我们将做删除处理！