CS598 Homework 1

作者: 海街diary | 来源:发表于2018-08-31 00:46 被阅读19次

    Question 1

    Question 1
    • Solution

    \begin{align*} V^\pi_{M^{\prime}}(s) &= E[\sum_{t=1}^\infty\gamma^{t-1}R_t^{\prime}(s,a)] \\ &= E[\sum_{t=1}^{\infty}\gamma^{t-1}R_t(s,a) - c] \\ &= E[\sum_{t=1}^{\infty}\gamma^{t-1}R_t(s,a) ] - \sum_{t=1}^{\infty}\gamma^{t-1}c \\ &= V^\pi_M(s) - \frac{c}{1 - \gamma} \quad \quad \quad \quad \quad\quad\forall s \in S \end{align*}
    \begin{align*} V^\star_{M^\prime}(s) &= \max_{a\in A} Q_{M^\prime}(s,a) \\ &= \max_{a\in A}[ Q_{M}(s,a) - \frac{c}{1 - \gamma}] \\ &= Q_M(s, a^{\star}) - \frac{c}{1 - \gamma} \end{align*}
    Thus, although there exists constant c, it doesn't affect the optimal policy. That is, there is a constant difference between two models for state value, however the optimal action is the same. So in infinite MDP we can generalize that R \in [0, R_{max}]

    Question 2

    Question 2
    • Solution

    We denote subscript as the horizon. Like s_1 means the state of h=1
    \begin{align*} V^\pi_{M^{\prime}}(s_1) &= E[\sum_{t=1}^H\gamma^{t-1}R_t^{\prime}(s,a)] \\ &= E[\sum_{t=1}^{H}\gamma^{t-1}R_t(s,a) - c] \\ &= E[\sum_{t=1}^{H}\gamma^{t-1}R_t(s,a) ] - \sum_{t=1}^{H}\gamma^{t-1}c \\ &= V^\pi_M(s_1) - \frac{c(1 - \gamma^H)}{1 - \gamma} \quad \quad \quad \quad \quad\quad\forall s_1\in S \end{align*}
    \begin{align*} V^\pi_{M^{\prime}}(s_2) &= E[\sum_{t=1}^{H-1}\gamma^{t-1}R_{t+1}^{\prime}(s,a)] \\ &= E[\sum_{t=1}^{H-1}\gamma^{t-1}R_{t+1}(s,a) - c] \\ &= E[\sum_{t=1}^{H-1}\gamma^{t-1}R_{t+1}(s,a) ] - \sum_{t=1}^{H-1}\gamma^{t-1}c \\ &= V^\pi_M(s_1) - \frac{c(1 - \gamma^{H-1})}{1 - \gamma} \quad \quad \quad \quad \quad\quad\forall s_2\in S \end{align*}
    Again for each state value of h from 1 to H, there exists same but varied difference between two models. Thus, the optimal policy under two models are same.

    Question 3.1

    Question 3.1
    • Solution

    For reward is -1 per step, the optimal policy will choose the shortest paths.
    For reward is 0 per step, it is trivial.
    For reward is +1 per step, the optimal policy will choose the longest paths.
    To sum up, in the case of indefinite-horizon MDP, the optimal policy will be affected if all rewards are added some numbers.

    Question 3.2

    Question 3.2
    • Solution
    1. Convert indefinite-horizon MDP into finite-horizon MDP

    Let's assume the max horizon length is H_0. For those trajectories that its length is strictly smaller than H_0, it can add some absorbing states into its trajectory with reward zero such that its horizon length is H_0. Note that the additional absorbing states will not change the primitive policy because it will not change the value function for this trajectory.

    1. Add rewards like +1 or +2.

    If these rewards do not be added into absorbing state, final result is the same as Q3.1

    Question 4

    Question 4
    • Solution
    1. Stationary MDP can be viewed as non-stationary MDP with P_h and R_h fixed along horizon.
    2. We can augment the state representation by introducing horizon h, which means that S^\prime= \{S, h\}. And we can rewrite new transition probability as P^\prime = P(S^\prime, a) = P(s, h, a) and new reward function as R^\prime = R(S^\prime, a) = R(S, h, a). In this way, M^\prime = (S^\prime, A, P^\prime, R^\prime, H, \mu) is stationary. Finally, the size of new state space is |S| \times |H|.
    3. At the first glance, if we want to convert non-stationary dynamics into stationary as the above method indicates, the size of new state space will be infinite, which is trivial.

    相关文章

      网友评论

        本文标题:CS598 Homework 1

        本文链接:https://www.haomeiwen.com/subject/cgejwftx.html