Chapter 7

作者: MasterXiong | 来源:发表于2021-01-10 14:27 被阅读0次

Fortress Besiege012
EN Note # A Study in Scarlet, Ch
Effective Objective-C 2.0 Tips 总
8.1 美国商法之破产法(第一部分)
Chapter 7
Chapter 7
Chapter 7
Chapter 7
Chapter——7
Chapter 7

Chapter 7: n-step Bootstrapping

n-step TD methods span a spectrum with MC methods at one end and one-step TD methods at the other.

n-step TD Prediction

The target estimation of value functions in n-step TD is a combination of the first n steps' sample rewards and bootstrapping the value estimation of the sampled state after n steps, i.e., $G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n V_{t+n-1}(S_{t+n}).$ Correspondingly, the update rule is $V_{t+n}(S_t) = V_{t+n-1}(S_t) + \alpha [G_{t:t+n} - V_{t+n-1}(S_t)].$ Note that only the value of $S_t$ changes at step $t$ , while the values of all the other states remain unchanged, i.e., $V_{t+n}(s) = V_{t+n-1}(s)$ for $\forall s \neq S_t$ . An issue here is that the value estimation of $V_{t+n-1}(s)$ may be updated long ago and does not reflect the true value of $s$ under the current policy $\pi_{t+n-1}$ in expectation. This is not covered in the RL book and I'm not sure if this will cause any problem in RL.
One-step TD (as introduced in the last chapter) can be seen as a special case of n-step TD when $n=1$ , i.e., $G_{t:t+1} = R_{t+1} + \gamma V_t(S_{t+1}).$ MC can be seen as the extreme of n-step TD in the opposite direction when $n$ equals to the episode length, i.e., $G_t = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{T-t-1} R_T$ .

n-step Sarsa

n-step Sarsa is a natural generalization of 1-step Sarsa with the target estimation as $G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^{n} Q_{t+n-1}(S_{t+n}, A_{t+n}),$ and the corresponding update rule is $Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha [G_{t:t+n} - Q_{t+n-1}(S_t, A_t)].$ The RL book gives a gridworld example as shown below to illustrate the advantage of n-step Sarsa compared to one-step Sarsa. When the reward is sparse, n-step Sarsa can help speed up the reward propagation the earlier states.

n-step Sarsa example.

n-step Off-policy Control

For n-step off-policy control, we need to take importance sampling into consideration, which leads to the update rule as follows: $Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha \rho_{t+1:t+n} [G_{t:t+n} - Q_{t+n-1}(S_t, A_t)],$ where $\rho_{t:h} = \prod_{k=t}^{min(h,T-1)} \frac{\pi(A_k|S_k)}{b(A_k|S_k)}$ . Note that the ratio starts from step $t+1$ because we do not have to care how likely we were to select the action $A_t$ ; now that we have selected it we want to learn fully from what happens, with importance sampling only for subsequent actions. This also explains why one-step Q-learning do not have the ratio term, as $\rho_{t+1:t+n} = 1$ for $n=1$ .

Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm

Importance sampling is required in Q-learning because it is a sample update method. So we need to multiply with the importance sampling ratio to make the update's expectation unbiased w.r.t. the target policy. Thus a natural way to avoid importance sampling is to perform expected update w.r.t. the target policy $\pi$ , i.e., the n-step tree backup algorithm.
In its simplest case with $n=1$ , tree backup is exactly expected Sarsa, i.e., $G_{t:t+1} = R_{t+1} + \gamma \sum_a \pi(a|S_{t+1}) Q_t(S_{t+1}, a).$
For $a = A_{t+1}$ , we can further expand the corresponding $Q_t(S_{t+1}, a)$ term in the equation above to get a two-step target. Recursively, we can get the tree backup target as follows: $G_{t:t+n} = R_{t+1} + \gamma \sum_{a \neq A_{t+1}} \pi(a|S_{t+1}) Q_{t+n-1}(S_{t+1},a) + \gamma \pi(A_{t+1}|S_{t+1}) G_{t+1:t+n}.$ And the update rule without importance sampling is $Q_{t+n}(S_t, A_t) = Q_{t+n-1}(S_t, A_t) + \alpha [G_{t:t+n} - Q_{t+n-1}(S_t, A_t)].$

Fortress Besiege012
Chapter7 In this chapter, Ms Wang and Mr Wang want to inv...
EN Note # A Study in Scarlet, Ch
上接 EN Note # A Study in Scarlet, Chapter 1-7 Chapter 8The...
Effective Objective-C 2.0 Tips 总
Effective Objective-C 2.0 Tips 总结 Chapter 5，6，7 Chapter 5...
8.1 美国商法之破产法(第一部分)
1. 简介联邦法下有6种破产类型： Chapter 7 liquidation Chapter 9 munici...
Chapter 7
当他人遭遇不幸时，我们常常急于建议，安慰，或表达我们的态度和感受。为了倾听他人，我们需要先放下已有的想法和判断，全...
Chapter 7
那天晚上，tony 又想起了他在飞船外壁上打的补丁，不过这一次没有其他东西使他分心了。他用手把墙上的补丁撬下来，...
Chapter 7
一堆女生齐齐围住我的桌子：“里面是什么？打开打开！”千雪打开信封，一抖，里边的东西掉到桌上——是几根粉红色...
Chapter 7
Words And Lydia herself - the reluctant center of their u...
Chapter——7
你也忘不掉一个人吧即使那个人没有陪你走到最后你割舍不掉回忆不管如今是分开还是陌生你依旧心存感激因为他出现...
Chapter 7
Expressions Nath, watching his sister's eyes blink and re...