Model-based RL

作者: 海街diary | 来源:发表于2018-09-19 19:12 被阅读13次

注:以下内容基于CS598.

1. Estimate Model

给定数据集D=\{(s_1, a_1, r_1, s_1^\prime), (s_2, a_2, r_2, s_2^\prime) \cdots (s_n, a_n, r_n, s_n^\prime) \}, 采用极大似然对模型进行估计。用|D_{s,a}|表示(s_i=s, a_i=a, \cdots)的样本数。

\hat P(s^\prime|s, a) = \frac{1}{|D_{s,a}|} \sum_{(r, s^\prime) \in D_{s,a} }\mathbb{I}_{s^\prime}

\hat R(s,a) = \frac{1}{|D_{s,a}|} \sum_{(r, s^\prime) \in D_{s, a}} r

2. Analysis of Certainty-Equivalence RL

2.1 Naive analysis

根据Hoeffding's Inequality: With probability at least 1-\delta,

|\frac{S_n}{n} - \frac{\mathbb{E}[S_n|}{n}| \leq (b-a) \sqrt{\frac{1}{2n}\ln{\frac{2}{\delta}}}

将失败率\delta分别平摊到|S \times A||S \times A \times S|个事件上,有:

\max_{s,a} |\hat R(s,a) - R(s,a) | \leq R_{max} \sqrt{\frac{1}{2n}\ln{\frac{4|S| \times |A|}{\delta}}}

\max_{s,a, s^\prime} |\hat P(s^\prime|s,a) - P(s^\prime|s, a) | \leq \sqrt{\frac{1}{2n}\ln{\frac{4|S| \times |A| \times |S|}{\delta}}}

所以, 定义P(s,a) = P(\cdot|s,a)为一个|S|维的vector,有:

\max_{s, a} ||\hat P(s, a) - P(s, a)||_1 \leq \max_{s,a} |S| \cdot ||\hat P(s, a) - P(s, a)||_{\infty} \leq |S| \cdot \sqrt{\frac{1}{2n}\ln{\frac{4|S| \times |A| \times |S|}{\delta}}}

  • Lemma 1(Simulation Lemma)

If \max_{s,a}|\hat R(s,a) - R(s,a)| \leq \epsilon_R and \max_{s,a}|\hat P(s,a) - P(s,a)| \leq \epsilon_P, then for any policy \pi: S \to A, we have \forall s \in S
|| V^{\pi}_{\hat M} - V^{\pi}_{M} ||_{\infty} \leq \frac{\epsilon_R}{1 - \gamma} + \frac{\gamma \epsilon_P R_{max}}{2(1-\gamma)^2}

Proof: \forall s \in S
\begin{align*} | V^{\pi}_{\hat M}(s) - V^{\pi}_{M}(s) | &= | \hat R(s,a) + \gamma \hat {P}(s'|s,a) \cdot V^\pi_{\hat M}(s') - R(s,a) - \gamma P(s'|s,a) \cdot V^\pi_M(s')| \\ & \leq |\hat R(s,a)-R(s,a)| + \gamma | \hat P(s'|s,a) \cdot V^\pi_{\hat M}(s') - P(s'|s,a) \cdot V^\pi_M(s')| \\ &\leq \epsilon_R + \gamma |\hat P(s,a) \cdot V^\pi_{\hat M} - P(s,a) \cdot V^\pi_{\hat M} + P(s,a) \cdot V^\pi_{\hat M} - P(s,a) \cdot V^\pi_M| \quad (ignore \, s')\\ &\leq \epsilon_R +\gamma |(\hat P(s,a) - P(s,a)) \cdot V^\pi_{\hat M}| +\gamma | P(s,a) \cdot (V^\pi_{\hat M} - V^\pi_M) | \\ &\leq \epsilon_R + \gamma |(\hat P(s,a) - P(s,a)) \cdot (V^\pi_{\hat M} - \frac{R_{max}}{2(1-\gamma)} {\mathbf{1}} )| + \gamma ||(V^\pi_{\hat M} - V^\pi_M)||_{\infty} \\ &\leq \epsilon_R + \gamma ||\hat P(s,a) - P(s,a)||_{1} \times ||V^\pi_{\hat M} - \frac{R_{max}}{2(1-\gamma)} {\mathbf{1}} ||_{\infty} + \gamma |(V^\pi_{\hat M} - V^\pi_M)|_{\infty} \\ &\leq \epsilon_R + \gamma \frac{ \epsilon_P R_{max}}{2(1-\gamma)} + \gamma ||(V^\pi_{\hat M} - V^\pi_M)||_{\infty}\\ Thus,\\ (1-\gamma)& ||(V^\pi_{\hat M} - V^\pi_M)||_{\infty} \leq \epsilon_R + \gamma \frac{ \epsilon_P R_{max}}{2(1-\gamma)} \\ &||(V^\pi_{\hat M} - V^\pi_M)||_{\infty} \leq \frac{\epsilon_R}{1 - \gamma} + \frac{\gamma \epsilon_P R_{max}}{2(1-\gamma)^2} \end{align*}

  • Lemma 1(Evaluation error to decision loss)

\forall s \in S, V^*_M(s) - V^{\pi^*_{\hat M}}_M(s) \leq 2 \sup_{\pi: S \to A}||V^\pi_{\hat M} - V^\pi_{M}||_{\infty}.

Proof: \forall s \in S,
\begin{align*} V^*_M(s) - V^{\pi^*_{\hat M}}_M(s) &= V^*_M(s) - V^{\pi^*_M}_{\hat M}(s) + V^{\pi^*_M}_{\hat M}(s) - V^{\pi^*_{\hat M}}_M(s) \\ &\leq V^*_M(s) - V^{\pi^*_M}_{\hat M}(s) + V^{\pi^*_{\hat M}}_{\hat M}(s) - V^{\pi^*_{\hat M}}_M(s) \qquad ( \pi^*_{\hat M} \,maxmizes \quad v_{\hat M}) \\ & \leq || V^*_M(s) - V^{\pi^*_M}_{\hat M}(s)||_{\infty} + || V^{\pi^*_{\hat M}}_{\hat M}(s) - V^{\pi^*_{\hat M}}_M(s)||_{\infty}. \\ & \leq 2[\frac{\epsilon_R}{1 - \gamma} + \frac{\gamma \epsilon_P R_{max}}{2(1-\gamma)^2}] \qquad(Lemma \,1) \\ Thus, \\ V^*_M(s) - V^{\pi^*_{\hat M}}_M(s) &= \tilde{O} (\frac{|S|}{\sqrt{n}(1 - \gamma)^2}) \qquad \forall s \in S. \end{align*}
Here \tilde{O}(\cdot)supresses poly-logarithmic dependences on |S| and |A|.

2.2 Improving S to \sqrt{|S|}

对于任意向量v \in \mathbb{R}^{|S|}, 有
||v||_1 = \sup_{u \in {-1, 1}^{|S|}} u^T v.

所以对于任意给定的(s,a) 和任意给定的u \in \{-1, 1\}^{|S|}, u^T \hat P(s,a) 是以[-1, 1]为界的随机变量,以至少1 - \delta/(2|S\times A| \cdot 2^{|S|}), 有
u^T (\hat P(s,a) - P(s, a)) \leq 2 \sqrt{\frac{1}{2n} \ln \frac{2|S \times A| \cdot 2^{|S|}}{\delta}}

所以, 以至少1 - \delta/2的概率,有
\max_{s, a} ||\hat P(s,a) - P(s, a)||_1 = \max_{s,a} \max_{u \in {-1, 1}^{|S|}} u^T (\hat P(s,a) - P(s, a)) \leq 2 \sqrt{\frac{1}{2n} \ln \frac{2|S \times A| \cdot 2^{|S|}}{\delta}}

所以,
V^*_M(s) - V^{\pi^*_{\hat M}}_M(s) = \tilde{O} (\frac{ \sqrt{|S|}}{\sqrt{n}(1 - \gamma)^2}) \qquad \forall s \in S

相关文章

  • 强化学习

    RL 种类 Model-Free RL不理解环境,通过试错来学习 Model-Based RL理解环境,通过想象学...

  • [Chapter 4] Reinforcement Learni

    Model-Free RL Method In model-based method, we need first...

  • Model-based RL

    注:以下内容基于CS598. 1. Estimate Model 给定数据集, 采用极大似然对模型进行估计。用表示...

  • Model-based RL中有哪些经典的算法?

      在model-based的RL方法中,需要学transition或者reward model,基于这个所学的m...

  • mac 本机mysql无法启动

    sudo chown -RL root:mysql /usr/local/mysqlsudo chown -RL ...

  • RL

    Q-learning Sarsa Sara-lambda

  • RL

    策略(搜索/优化)都是在学习控制律control law,即系统状态到控制输入的映射(本质上也是个回归问题)。强化...

  • RL

    RL 强化学习任务通常用马尔科夫决策过程(Markov Decision Process,简称 MDP)来描述: ...

  • rl

    recyclerview

  • Rodney A. Brooks 的作品

    《Model-Based Computer Vision》 《Programming in Common Lisp...

网友评论

    本文标题:Model-based RL

    本文链接:https://www.haomeiwen.com/subject/xnbbnftx.html