Lecture 5 Model Free Control

Lecture 5 Model Free Control

作者: BoringFantasy | 来源:发表于2019-10-20 23:17 被阅读0次

Lecture 5 Model Free Control
Lecture 5: Model-Free Control
Lecture 5: Model-Free Control
Lecture 4: Model-Free Prediction
Lecture 4: Model-Free Prediction
读论文Model-Free Episodic Control
Github项目推荐 | 腿式机器人通用控制架构 Free Ga
2019-10-19 Lecture 4: Model-Free
2018-03-15-W3
WEEK 7 Integrating and Interpret

Model Free Control

Target: 将一个Agent放入完全未知的环境中，如何将奖励最大化。

image.png

image.png

简单回顾

image.png

on-policy 学习从该策略下采样产生的样本，同时更新策略。
off-policy 学习其他策略写（经验）产生的样本，用于更细策略。

image.png
类似于 GPI，向上评估policy，向下生成新的policy，其中评估算法和更新算法都可以替代后用于model-free.

image.png
例如将评估过程换成MC方法，及计算想要评估轨迹的均值取代期望价值，并应贪婪算法更新Policy。

image.png
然而在使用贪婪策略更新policy时，需要知道MDP模型，所以用于model-free模型时，使用实际行为函数值Q进行替代，从而可以使用贪婪策略，Q告诉我们在一个状态下，采取各个行为有多好，所以我们需要做的只是选择一个action，然后将其函数值 $Q(s, a)$ 最大化。
image.png
使用MC方法， $Q=q_\pi$ 。但是贪婪算法不能保证我们看到全部状态，无法进行准确的估计，我们必须保证看到了环境的全局。
image.png
使用MC方法和贪婪方法进行GPI，就会永远只开右边的门，获得+2的奖励，因为你不知道模型，所以就会陷入你认为正确的局部最优。

image.png
为了看到全部可能状态，所以用 $\epsilon-greedy$ 策略。
image.png
那么我们采取这个 $\epsilon-greedy$ ，是否能够得到更好的策略。
第二行前面是非贪婪选择的值（均值），后面是 $\epsilon-greedy$ 的结果。
最后得到Q值，虽然贪婪算法简单，但是的确可以得到更好的结果。

image.png
image.png
实际上不必每次都探索完全部环境再更新，这样可以提高效率。

image.png

GLIE

如何确保或者最好的value function 和 policy。

image.png
image.png
image.png
image.png

Sarsa

使用TD方法每一步估计 $Q(s,a)$ ，用于评估policy，再使用 $\epsilon-greedy$ 进行更新policy，这种方法就叫Sarsa。
image.png
image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

Let we take about off-policy learning

off-policy 的用处
1.1 用off-policy探索已经探索过得庞大数据库（已经提供最优policy），从而也学习最优策略。
1.2 获得多重策略，通过学习单一策略。

image.png
importantce sampling

image.png
使用蒙特卡洛方法进行importantce sampling按照完整回合进行更新，每一步都使用了Importance sampling，所以会得到一个无限小的 $G_t^{\pi/\mu}$ ，所以蒙特卡洛方法真的不适合off-policy。
image.png
所以你真的需要使用差分时序的方法。利用TD-target改变分配，分配系数是现有策略和环境差异。

image.png
$A_{t+1}$ 是实际世界会采取的活动， $A'$ 是遵循目标策略采取的活动。
image.png
这就是大名鼎鼎的Q-learning。
特殊之处就是target policy 和 behaviour policy u 都可以被更新，目标就是一点点的想着Q值最大的方向进行更新。

image.png
image.png
image.png
image.png
image.png
image.png

相关文章

Lecture 5 Model Free Control
Model Free Control Target: 将一个Agent放入完全未知的环境中，如何将奖励最大化。简...
Lecture 5: Model-Free Control
一、Introduction （一）Model-Free Reinforcement Learning Last ...
Lecture 5: Model-Free Control
Author：David SilverHe was awarded the 2019 ACM Prize in C...
Lecture 4: Model-Free Prediction
一、Monte-Carlo Learning （一）Monte-Carlo Reinforcement Learn...
Lecture 4: Model-Free Prediction
Author：David Silver Outline Introduction Monte-Carlo Lear...
读论文Model-Free Episodic Control
这篇论文是deepmind一篇论文，是基于外部存储的RL的方向，Demis Hassabis是作者之一。强化学习...
Github项目推荐 | 腿式机器人通用控制架构 Free Ga
Free Gait - An Architecture for the Versatile Control of ...
2019-10-19 Lecture 4: Model-Free
Model-Free Prediction Interduction 区别上节课讲已知MDP，使用动态规划方法...
2018-03-15-W3
How to gain control of your free time ——...
WEEK 7 Integrating and Interpret
Lecture 1 - Omics data and Network Model Analyses heat ma...

网友评论

本文标题：Lecture 5 Model Free Control

本文链接：https://www.haomeiwen.com/subject/qaqamctx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

栏目导航

热点阅读

关于我们|服务条款|联系我们|Lecture 5 Model Free Control|投稿指南|网站地图|RSS订阅|排版工具|手机版

提供经典美文摘抄,优美散文欣赏,现代诗歌精选,短篇小说,心情随笔,表白情书范文,故事会在线阅读欣赏

Copyright © 2014-2023 Haomeiwen.com All Rights Reserved. 好美文阅读网版权所有

备案信息：桂公网安备 45052102000051号 · 桂ICP备13007215号-3

本站所收录作品、热点评论等信息部分来源互联网，目的只是为了系统归纳学习和传递资讯

所有作品版权归原创作者所有，与本站立场无关，如不慎侵犯了你的权益，请联系我们告知，我们将做删除处理！