强化学习中的无模型预测

作者: 小小何先生 | 来源:发表于2020-03-14 10:50 被阅读0次

强化学习中的无模型控制
强化学习中的无模型预测
客户分群-聚类算法
分类
深度强化学习（三）：从Q-Learning到DQN
蒙特卡罗方法(Monte Carlo Methods)
强化学习2020-03-17
强化自己的学习能力
强化学习、价值函数和区块链
基于Value的强化学习算法

在这里插入图片描述

本节目录

在大多是强化学习(reinforcement learning RL)问题中，环境的model都是未知的，也就无法直接做动态规划。一种方法是去学MDP，在这个系列的理解强化学习中的策略迭代和值迭代这篇文章中有具体思路。但这种做法还是会存在很多问题，就是在sample过程中会比较麻烦，如果你随机sample的话就会有某些state你很难sample到，而按照某种策略sample的话，又很难得到真实的转移概率。一旦你的model出现了问题，值迭代和策略迭代都将会出现问题。

于是就有了Model-free Reinforcement Learning，直接与环境交互，直接从数据中学到model。

Model-free Reinforcement Learning

Model-free Reinforcement Learning需要从数据中estimate出value是多少(state or state-action pair)，接下来拿到cumulative reward的期望，得到这些case之后，再去做model-free的control，去optimal当前的policy使得value function最大化。

那model-free的value function如何来做prediction呢？

在model-free的RL中我们无法获取state transition和reward function，我们仅仅是有一些episodes。之前我们是拿这些episodes学model，在model free的方法中拿这些episode直接学value function 或者是policy，不需要学MDP。这里面两个关键的key steps：1. estimate value function. 2. optimize policy.

Value Function Estimate

In model-based RL (MDP), the value function is calculated by dynamic programming

$v_{\pi}(s)=\mathbb{E_{\pi}}[R_{t+1}+\gamma v_{\pi}(\mathcal{S_{t+1}})|\mathcal{S_{t}=s}]$

在model free的方法中，我们不知道state transition，由此无法计算上述等式的期望。

Monte-Carlo Methods

Monte-Carlo methods are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. For example, to calculate the circle's surface. As show in following figure：

Monte-Carlo Methods

对上述方框中均匀撒上一些点，然后用如下等式计算即可：

$\text{Circle Surface} = \text{Square Surface} \times \frac{\text{ points in circle}}{\text{points in total}}$

Monte-Carlo Value Estimation

我们有很多episodes，基于这些episode，我们去计算total discounted reward ：

$G_{t}=R_{t+1}+\gamma R_{t+2}+\ldots=\sum_{k=0}^{\infty} \gamma^{k} R_{t+k+1}$

Value function的 expected return可表示为如下数学形式：

$V^{\pi}(s) = \mathbb{E} [G_{t}|s_{t}=s,\pi] \\ \approx \frac{1}{N} \sum_{i=1}^{N} G_{i}^{(i)}$

上述方法可总结为两步：1. 使用policy $\pi$ 从state $s$ 开始采样 $N$ 个episodes 。2. 计算平均累计奖励(the average of cumulative reward )。可以看出来，这种基于采样的方法，直接一步到位，计算value而不需要计算MDP中的什么状态转移啥的。

上述思想更加细致、更具体的方法可用如下形式表示：

Sample episodes of policy $\pi$ 。
Every time-step that state is visited in an episode
- Increment counter $N(s) \leftarrow N(s) +1$
- Increment total return $S(s) \leftarrow S(s) +G_{t}$
- Value is estimated by mean return $V(s)=S(s)/N(s)$
- By law of large numbers $V(s) \leftarrow V^{\pi}$ as $N(s) \rightarrow \infty$ 。

Incremental Monte-Carlo Updates

Update $V(s)$ incrementally after each episode
For each state $S_{t}$ with cumulative return $G_{t}$

$\begin{array}{l} {N\left(S_{t}\right) \leftarrow N\left(S_{t}\right)+1} \\ {V\left(S_{t}\right) \leftarrow V\left(S_{t}\right)+\frac{1}{N\left(S_{t}\right)}\left(G_{t}-V\left(S_{t}\right)\right)} \end{array}$

For non-stationary problems (i.e. the environment could be varying over time), it can be useful to track a running mean, i.e. forget old episodes

如果环境的state transition和reward function一直在变，我们把这个环境叫做non-stationary，环境本身肯定叫做stochastic环境。但是如果分布不变，叫做statically environment，但是环境本身的分布会发生变化的话，就需要去忘掉一些老的episode，如果用平均的方法去做的话，老的episode和新的episode一样，它就忘不掉老的episode。

$V(S_{t}) \leftarrow V(S_{t}) + \alpha (G_{t} - V(S_{t}))$

Monte-Carlo Value Estimation的一些特点：

MC methods learn directly from episodes of experience
MC is model-free: no knowledge of MDP transitions / rewards
MC learns from complete episodes: no bootstrapping (discussed later)
MC uses the simplest possible idea: value = mean return
Caveat: can only apply MC to episodic MDPs i.e., all episodes must terminate

Temporal-Difference Learning

TD的方法中引入对未来值函数的估计：

$G_{t}=R_{t+1}+\gamma R_{t+2}+\ldots=R_{t+1}+\gamma V(S_{t+1})$

$V(S_{t}) \leftarrow V(S_{t}) + \alpha(R_{t+1}+\gamma V(S_{t+1}) - V(S_{t}))$

TD的算法主要有以下四个特点：

TD methods learn directly from episodes of experience
TD is model-free: no knowledge of MDP transitions / rewards
TD learns from incomplete episodes, by bootstrapping
TD updates a guess towards a guess

Monte Carlo vs. Temporal Difference

Monte Carlo方法和Temporal Difference方法对比如下：

The same goal: learn $V_{\pi}$ from episodes of experience under policy $\pi$ 。
Incremental every-visit Monte-Carlo
- Update value $V(S_{t})$ toward actual return $G_{t}$ 。

$V(S_{t}) \leftarrow V(S_{t}) + \alpha(G_{t}-V(S_{t}))$

Simplest temporal-difference learning algorithm: TD
- Update value $V(S_{t})$ toward estimated return $R_{t+1} + \gamma V(S_{t+1})$ 。
- TD Target： $R_{t+1} + \gamma V(S_{t+1})$ ；
- TD error： $\delta = R_{t+1} + \gamma V(S_{t+1}) - V(S_{t})$

Advantages and Disadvantages of MC vs. TD

TD can learn before knowing the final outcome
- TD can learn online after every step
- MC must wait until end of episode before return is known
TD can learn without the final outcome
- TD can learn from incomplete sequences
- MC can only learn from complete sequences
- TD works in continuing (non-terminating) environments
- MC only works for episodic (terminating) environments

Bias/Variance Trade-Off

Return $G_{t}$ is unbiased estimate of $V^{\pi}(S_{t})$ 。

基于当前的策略去采样，然后计算平均值，这样得到的估计是无偏估计。

TD target $R_{t+1} + \gamma V(S_{t+1})$ is biased estimate of $V^{\pi}$ 。

TD target中由于存在对未来的估计 $V(S_{t+1})$ ，这个估计如果是非常准确的，那TD target也是unbiased estimate，但是由于 $V(S_{t+1})$ 很难估计准确，所以是 biased estimate 。

TD target is of much lower variance than the return

TD target的方法一般比Return $G_{t}$ 要小。Return $G_{t}$ depends on many random actions, transitions and rewards；TD target depends on one random action, transition and reward

Advantages and Disadvantages of MC vs. TD (2)

MC has high variance, zero bias

MC方法具有好的 convergence properties (even with function approximation) 并且 Not very sensitive to initial value 但是需要 Very simple to understand and use。需要多采样去降低variance。

TD has low variance, some bias

TD的方法 Usually more efficient than MC ，TD converges to $V^{\pi}(S_{t})$ ，but not always with function approximation。并且 More sensitive to initial value than MC。

n-step model-free prediction

For time constraint, we may jump n-step prediction section and directly head to model-free control

Define the n-step return

$G_{t}^{n} = R_{t+1} + \gamma R_{t+2} + \cdots +\gamma^{n-1}R_{t+n} + \gamma^{n} V(S_{t+n})$

在这里插入图片描述

n-step temporal-difference learning

$V(S_{t}) \leftarrow V(S_{t}) + \alpha(G_{t}^{(n)} - V(S_{t}))$

有了值函数之后，我们就需要去做策略改进了。

我的微信公众号名称：深度学习与先进智能决策
微信公众号ID：MultiAgent1024
公众号介绍：主要研究分享深度学习、机器博弈、强化学习等相关内容！期待您的关注，欢迎一起学习交流进步！

强化学习中的无模型控制
在上一篇文章强化学习中的无模型预测中，有说过这个无模型强化学习的预测问题，通过TD、n-step TD或者MC...
强化学习中的无模型预测
在大多是强化学习(reinforcement learning RL)问题中，环境的model都是未知的，也就...
客户分群-聚类算法
机器学习算法分类有监督学习有训练样本分类模型预测模型无监督学习无训练样本关联模型聚类模型聚类算法...
分类
机器学习方法：监督学习，半监督学习，无监督学习，强化学习。监督学习：判别模型，生成模型。判别模型：条件随机场...
深度强化学习（三）：从Q-Learning到DQN
一、无模型的强化学习在上一节中介绍了基于模型的强化学习方法(动态规划)，其中的前提是知道环境的状态转移概率，但在...
蒙特卡罗方法(Monte Carlo Methods)
概述蒙特卡罗方法(Monte Carlo Methods)是强化学习中基于无模型的训练方法。与动态规划(Dyna...
强化学习2020-03-17
机器学习可以分为预测型和决策性，有监督学习和无监督学习属于预测型，强化学习属于决策型。策略是从状态到行为的映射，...
强化自己的学习能力
强化自己的学习能力，学习模型
强化学习、价值函数和区块链
价值函数和Token 社会强化学习可以看作是独立强化学习的推广，是独立强化学习与社会模型或经济模型的结合。由于采用...
基于Value的强化学习算法
在文章强化学习与马尔可夫决策中，介绍了使用马尔可夫决策模型对强化学习的过程进行建模，本篇文章将介绍基于这一模型而引...

强化学习中的无模型预测

Model-free Reinforcement Learning

Value Function Estimate

Monte-Carlo Methods

Monte-Carlo Value Estimation

Incremental Monte-Carlo Updates

Temporal-Difference Learning

Monte Carlo vs. Temporal Difference

Advantages and Disadvantages of MC vs. TD

Bias/Variance Trade-Off

Advantages and Disadvantages of MC vs. TD (2)

n-step model-free prediction

相关文章

强化学习中的无模型控制

强化学习中的无模型预测

客户分群-聚类算法

分类

深度强化学习（三）：从Q-Learning到DQN

蒙特卡罗方法(Monte Carlo Methods)

强化学习2020-03-17

强化自己的学习能力

强化学习、价值函数和区块链

基于Value的强化学习算法

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

深度强化学习基础到前沿