CS598 Homework 1

作者: 海街diary | 来源:发表于2018-08-31 00:46 被阅读19次

CS598 Homework 1
homework1&2
U6L1
Homework 1
homework 1
讲解：CSCI 1100、CS1 Multiverse、Pyth
讲解：CSCI 1100、Python、data files、P
无标题文章
laravel多条件查询(and，or嵌套查询)
Homework1

Question 1

Solution

$\begin{align*} V^\pi_{M^{\prime}}(s) &= E[\sum_{t=1}^\infty\gamma^{t-1}R_t^{\prime}(s,a)] \\ &= E[\sum_{t=1}^{\infty}\gamma^{t-1}R_t(s,a) - c] \\ &= E[\sum_{t=1}^{\infty}\gamma^{t-1}R_t(s,a) ] - \sum_{t=1}^{\infty}\gamma^{t-1}c \\ &= V^\pi_M(s) - \frac{c}{1 - \gamma} \quad \quad \quad \quad \quad\quad\forall s \in S \end{align*}$
$\begin{align*} V^\star_{M^\prime}(s) &= \max_{a\in A} Q_{M^\prime}(s,a) \\ &= \max_{a\in A}[ Q_{M}(s,a) - \frac{c}{1 - \gamma}] \\ &= Q_M(s, a^{\star}) - \frac{c}{1 - \gamma} \end{align*}$
Thus, although there exists constant $c$ , it doesn't affect the optimal policy. That is, there is a constant difference between two models for state value, however the optimal action is the same. So in infinite MDP we can generalize that $R \in [0, R_{max}]$

Question 2

Solution

We denote subscript as the horizon. Like $s_1$ means the state of $h=1$
$\begin{align*} V^\pi_{M^{\prime}}(s_1) &= E[\sum_{t=1}^H\gamma^{t-1}R_t^{\prime}(s,a)] \\ &= E[\sum_{t=1}^{H}\gamma^{t-1}R_t(s,a) - c] \\ &= E[\sum_{t=1}^{H}\gamma^{t-1}R_t(s,a) ] - \sum_{t=1}^{H}\gamma^{t-1}c \\ &= V^\pi_M(s_1) - \frac{c(1 - \gamma^H)}{1 - \gamma} \quad \quad \quad \quad \quad\quad\forall s_1\in S \end{align*}$
$\begin{align*} V^\pi_{M^{\prime}}(s_2) &= E[\sum_{t=1}^{H-1}\gamma^{t-1}R_{t+1}^{\prime}(s,a)] \\ &= E[\sum_{t=1}^{H-1}\gamma^{t-1}R_{t+1}(s,a) - c] \\ &= E[\sum_{t=1}^{H-1}\gamma^{t-1}R_{t+1}(s,a) ] - \sum_{t=1}^{H-1}\gamma^{t-1}c \\ &= V^\pi_M(s_1) - \frac{c(1 - \gamma^{H-1})}{1 - \gamma} \quad \quad \quad \quad \quad\quad\forall s_2\in S \end{align*}$
Again for each state value of $h$ from $1$ to $H$ , there exists same but varied difference between two models. Thus, the optimal policy under two models are same.

Question 3.1

Solution

For reward is -1 per step, the optimal policy will choose the shortest paths.
For reward is 0 per step, it is trivial.
For reward is +1 per step, the optimal policy will choose the longest paths.
To sum up, in the case of indefinite-horizon MDP, the optimal policy will be affected if all rewards are added some numbers.

Question 3.2

Solution

Convert indefinite-horizon MDP into finite-horizon MDP

Let's assume the max horizon length is $H_0$ . For those trajectories that its length is strictly smaller than $H_0$ , it can add some absorbing states into its trajectory with reward zero such that its horizon length is $H_0$ . Note that the additional absorbing states will not change the primitive policy because it will not change the value function for this trajectory.

Add rewards like +1 or +2.

If these rewards do not be added into absorbing state, final result is the same as Q3.1

Question 4

Solution

Stationary MDP can be viewed as non-stationary MDP with $P_h$ and $R_h$ fixed along horizon.

We can augment the state representation by introducing horizon $h$ , which means that $S^\prime= \{S, h\}$ . And we can rewrite new transition probability as $P^\prime = P(S^\prime, a) = P(s, h, a)$ and new reward function as $R^\prime = R(S^\prime, a) = R(S, h, a)$ . In this way, $M^\prime = (S^\prime, A, P^\prime, R^\prime, H, \mu)$ is stationary. Finally, the size of new state space is $|S| \times |H|$ .

At the first glance, if we want to convert non-stationary dynamics into stationary as the above method indicates, the size of new state space will be infinite, which is trivial.

CS598 Homework 1
Question 1 Solution Thus, although there exists constant ...
homework1&2
HOMEWORK 1 运行结果 HOMEWORK 2 运行结果
U6L1
回家作业： 1.workbook：Homework1 2.Tutorial book：Homework1 在线练习...
Homework 1
from operator import add, sub def a_plus_abs_b(a, b):"""R...
homework 1
·Look at the picture,there are eight kids surrounding a t...
讲解：CSCI 1100、CS1 Multiverse、Pyth
CSCI 1100 — Computer Science 1 Homework 8CS1 Multiverse: ...
讲解：CSCI 1100、Python、data files、P
CSCI 1100 — Computer Science 1 Homework 8CS1 Multiverse: ...
无标题文章
Lab 1 Homework 1.4 1.5 Based on the knowledge of discrete...
laravel多条件查询(and，or嵌套查询)
sql select * from homework where (id between 1 and 10 or ...
Homework1
一.个人Github账户账户名：wx14 使用的电子信箱：767322026@qq.com 账户主页面截...

CS598 Homework 1

Question 1

Question 2

Question 3.1

Question 3.2

Convert indefinite-horizon MDP into finite-horizon MDP

Add rewards like +1 or +2.

Question 4

相关文章

CS598 Homework 1

homework1&2

U6L1

Homework 1

homework 1

讲解：CSCI 1100、CS1 Multiverse、Pyth

讲解：CSCI 1100、Python、data files、P

无标题文章

laravel多条件查询(and，or嵌套查询)

Homework1

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

增强学习Reinforcement Learning