多臂老虎机问题

作者: best___me | 来源:发表于2018-06-12 14:59 被阅读0次

老虎机问题是表格求解方法的一个子集，之所以称为表格是因为我们可以在表格中找到任何状态。

K-armed Bandit Problem：

One-armed bandit: a solt machine operated by pulling a long handle at the side.

有K个不同的action，每个action输出一笔钱，从以该action为条件的分布中采样得到（sampled from a distribution conditioned on the action），通常有T个时间步骤，如何使得获得的最多？

At是action，Rt是reward

It means “the value of an action a is the expected value of the reward of the action(at any time).”

Qt(a)是t时刻q*(a)的估计

如何计算Qt(a)？

value*: in this case, it’s different than the concept of rewards. Value is the long run metric, meanwhile reward is the immediate metric.

Action-value Methods：分为两步计算，首先估计action的value，接着选择具体的action

1. Estimating Action values

求均值来近似计算

It entails that Qt(a) coverages almost surely to q*(a)

2.

（1）Action selection rule: Greedy

选最大的

（2）Action selection rule: e-Greedy 随机挑选，choose from all actions uniformly

指数平均，a是系数，可以替换为一个函数an(a)，表示每个时间点reward的权重

an(a)的两点性质，使得上面的更新是任意收敛的，并且不是收敛到一个具体的值，

性质

（3）Action Selection Rule： Optimistic initial values

One trick is to set the initial values for Q1(a)=C∀a, where C>q∗(a)∀a. 最初始的value值设置是很随机的，是一个超参数，一个trick是对于任意的a，设置value是一个符合一个条件的常数。

（4）Action Selection Rule：Upper-Confidence-Bound Selection

网友评论

本文标题：多臂老虎机问题

本文链接：https://www.haomeiwen.com/subject/lysleftx.html