原文链接:https://oneraynyday.github.io/ml/2018/05/03/Reinforcement-Learning-Bandit/
老虎机问题是表格求解方法的一个子集,之所以称为表格是因为我们可以在表格中找到任何状态。
K-armed Bandit Problem:
One-armed bandit: a solt machine operated by pulling a long handle at the side.
有K个不同的action,每个action输出一笔钱,从以该action为条件的分布中采样得到(sampled from a distribution conditioned on the action),通常有T个时间步骤,如何使得获得的最多?
At是action,Rt是rewardIt means “the value of an action a is the expected value of the reward of the action(at any time).”
Qt(a)是t时刻q*(a)的估计
如何计算Qt(a)?
value*: in this case, it’s different than the concept of rewards. Value is the long run metric, meanwhile reward is the immediate metric.
Action-value Methods:分为两步计算,首先估计action的value,接着选择具体的action
1. Estimating Action values
求均值来近似计算
It entails that Qt(a) coverages almost surely to q*(a)
2.
(1)Action selection rule: Greedy
选最大的(2)Action selection rule: e-Greedy 随机挑选,choose from all actions uniformly
指数平均,a是系数,可以替换为一个函数an(a),表示每个时间点reward的权重an(a)的两点性质,使得上面的更新是任意收敛的,并且不是收敛到一个具体的值,
性质(3)Action Selection Rule: Optimistic initial values
One trick is to set the initial values for Q1(a)=C∀a, where C>q∗(a)∀a. 最初始的value值设置是很随机的,是一个超参数,一个trick是对于任意的a,设置value是一个符合一个条件的常数。
(4)Action Selection Rule:Upper-Confidence-Bound Selection
Double Bandit:视频链接:https://www.youtube.com/watch?feature=player_embedded&v=2M7mv4-BPCg
网友评论