Chapter 7

作者: MasterXiong | 来源:发表于2021-01-10 14:27 被阅读0次

Chapter 7: n-step Bootstrapping

n-step TD methods span a spectrum with MC methods at one end and one-step TD methods at the other.

n-step TD Prediction

The target estimation of value functions in n-step TD is a combination of the first n steps' sample rewards and bootstrapping the value estimation of the sampled state after n steps, i.e., G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n V_{t+n-1}(S_{t+n}). Correspondingly, the update rule is V_{t+n}(S_t) = V_{t+n-1}(S_t) + \alpha [G_{t:t+n} - V_{t+n-1}(S_t)]. Note that only the value of S_t changes at step t, while the values of all the other states remain unchanged, i.e., V_{t+n}(s) = V_{t+n-1}(s) for \forall s \neq S_t. An issue here is that the value estimation of V_{t+n-1}(s) may be updated long ago and does not reflect the true value of s under the current policy \pi_{t+n-1} in expectation. This is not covered in the RL book and I'm not sure if this will cause any problem in RL.
One-step TD (as introduced in the last chapter) can be seen as a special case of n-step TD when n=1, i.e., G_{t:t+1} = R_{t+1} + \gamma V_t(S_{t+1}). MC can be seen as the extreme of n-step TD in the opposite direction when n equals to the episode length, i.e., G_t = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{T-t-1} R_T.

n-step Sarsa

n-step Sarsa is a natural generalization of 1-step Sarsa with the target estimation as G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^{n} Q_{t+n-1}(S_{t+n}, A_{t+n}), and the corresponding update rule is Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha [G_{t:t+n} - Q_{t+n-1}(S_t, A_t)]. The RL book gives a gridworld example as shown below to illustrate the advantage of n-step Sarsa compared to one-step Sarsa. When the reward is sparse, n-step Sarsa can help speed up the reward propagation the earlier states.

n-step Sarsa example.

n-step Off-policy Control

For n-step off-policy control, we need to take importance sampling into consideration, which leads to the update rule as follows: Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha \rho_{t+1:t+n} [G_{t:t+n} - Q_{t+n-1}(S_t, A_t)], where \rho_{t:h} = \prod_{k=t}^{min(h,T-1)} \frac{\pi(A_k|S_k)}{b(A_k|S_k)}. Note that the ratio starts from step t+1 because we do not have to care how likely we were to select the action A_t; now that we have selected it we want to learn fully from what happens, with importance sampling only for subsequent actions. This also explains why one-step Q-learning do not have the ratio term, as \rho_{t+1:t+n} = 1 for n=1.

Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm

Importance sampling is required in Q-learning because it is a sample update method. So we need to multiply with the importance sampling ratio to make the update's expectation unbiased w.r.t. the target policy. Thus a natural way to avoid importance sampling is to perform expected update w.r.t. the target policy \pi, i.e., the n-step tree backup algorithm.
In its simplest case with n=1, tree backup is exactly expected Sarsa, i.e., G_{t:t+1} = R_{t+1} + \gamma \sum_a \pi(a|S_{t+1}) Q_t(S_{t+1}, a).
For a = A_{t+1}, we can further expand the corresponding Q_t(S_{t+1}, a) term in the equation above to get a two-step target. Recursively, we can get the tree backup target as follows: G_{t:t+n} = R_{t+1} + \gamma \sum_{a \neq A_{t+1}} \pi(a|S_{t+1}) Q_{t+n-1}(S_{t+1},a) + \gamma \pi(A_{t+1}|S_{t+1}) G_{t+1:t+n}. And the update rule without importance sampling is Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha [G_{t:t+n} - Q_{t+n-1}(S_t, A_t)].

相关文章

  • Fortress Besiege012

    Chapter7 In this chapter, Ms Wang and Mr Wang want to inv...

  • EN Note # A Study in Scarlet, Ch

    上接 EN Note # A Study in Scarlet, Chapter 1-7 Chapter 8The...

  • Effective Objective-C 2.0 Tips 总

    Effective Objective-C 2.0 Tips 总结 Chapter 5,6,7 Chapter 5...

  • 8.1 美国商法之破产法(第一部分)

    1. 简介 联邦法下有6种破产类型: Chapter 7 liquidation Chapter 9 munici...

  • Chapter 7

    当他人遭遇不幸时,我们常常急于建议,安慰,或表达我们的态度和感受。为了倾听他人,我们需要先放下已有的想法和判断,全...

  • Chapter 7

    那天晚上,tony 又想起了他在飞船外壁上打的补丁,不过这一次没有其他东西使他分心了。 他用手把墙上的补丁撬下来,...

  • Chapter 7

    一堆女生齐齐围住我的桌子:“里面是什么?打开打开!”千雪打开信封,一抖,里边的东西掉到桌上——是几根粉红色...

  • Chapter 7

    Words And Lydia herself - the reluctant center of their u...

  • Chapter——7

    你也忘不掉一个人吧 即使那个人没有陪你走到最后 你割舍不掉回忆 不管如今是分开还是陌生 你依旧心存感激 因为他出现...

  • Chapter 7

    Expressions Nath, watching his sister's eyes blink and re...

网友评论

      本文标题:Chapter 7

      本文链接:https://www.haomeiwen.com/subject/sdxqaktx.html