完整paper地址:https://arxiv.org/pdf/1204.5721.pdf
Introduction
- Application: ad placement, website optimization, and packeting route.
- Goal: maximize the total payoff obtained in a sequence of allocations.
- trade-off between exploration and exploitation in sequential experiments
- 3 fundamental formalization depending on reward process: stochastic, adversarial, Markovian.
- 3 playing strategy:
- UCB algorithm (stochastic case)
- Exp3 randomized algorithm(adversarial case)
- Gittins indices(Markovian case)
Regret :
Regret DefinitionWHERE:
- : number of arms ,
- : arm i ,i=1,2,3,4...,K
- :steps of selecting arms. t=1,2,3...
- : n times plays
- : ther arm selected in step (Choice)
-
: unkown rewards associated with arm
( -- Optimal ) - reward received by select arm (Reward)
In general, both rewards Xi,t and forecaster’s choices It might be stochastic. Distinguish between the two following notions of averaged regret:
Expected regret
expected regretThe expectation of the regret with respect to the action which is optimal on the sequence of reward realizations.
Psudo-regret
A weaker notion of regret, since one competes against the action which is optimal only in expectation.
- 与expected regret的区别在于:
- expected regret是K个摇臂中各自的总收益中选出最大的(相当于选reward最大的n次),减去forecaster的总reward,求期望。
- Pseudo-regret是先选定某个arm i,选n次求和计算reward,减去forecaster的总reward,再求期望。最后再从K个arms的各自的这个期望中选择最大的。
?感觉这个解释不太对啊?再看看,每个t中选i的时候可以变吗?
Pseudo-regret ≤ Expected regret
Stochastic bandits (UCB algorithm)
Stochastic banditsProbability distributions
: the mean of (mean reward of arm i).
: the mean of (Mean reward of arm i).
Lai and Robbins [125], who introduced the technique of upper confidence bounds for the asymptotic analysis of regret:
[125] T. L. Lai and H. Robbins, “Asymptotically efficient adaptive allocation rules,” Advances in Applied Mathematics, vol. 6, pp. 4–22, 1985.
Adversarial bandits(Exp3 randomized algorithm)
Adversarial bandits- Adversary/opponent: the mechanism that sets the sequence of gains for each arm.
- Obvious adversary: The mechanism is independent of the forecaster's actions
-
Nonobvious adversary: the adversary adapt to the forecaster's past behavior
eg: in the rigged casino the owner observes the way a gambler plays in order to design even more evil sequences of gains.
In this adversarial setting, the goal is to obtain regret bounds in high probability or in expectation with respect to any possible randomization in the strategies used by the forecaster or the opponent and irrespective of the opponent. In the case of a nonobvious adversary this is not an easy task, and for this reason, we usually start by bounding
the pseudo-regret:
image.png
Markovian bandits(Gittins indices)
- Not much will be discussed in this monograph
- A special case of Markovian bandits: Bayesian bandits.
Terminology
- forecaster = player(i.e.,the agent implementing a bandit strategy)
网友评论