美文网首页
[Notes]Regret Analysis of Stocha

[Notes]Regret Analysis of Stocha

作者: 半山来客 | 来源:发表于2018-10-22 13:43 被阅读0次

    完整paper地址:https://arxiv.org/pdf/1204.5721.pdf

    Introduction

    • Application: ad placement, website optimization, and packeting route.
    • Goal: maximize the total payoff obtained in a sequence of allocations.
    • trade-off between exploration and exploitation in sequential experiments
    • 3 fundamental formalization depending on reward process: stochastic, adversarial, Markovian.
    • 3 playing strategy:
      • UCB algorithm (stochastic case)
      • Exp3 randomized algorithm(adversarial case)
      • Gittins indices(Markovian case)

    Regret R_n:

    Regret Definition

    WHERE:

    • K: number of arms ,
    • i: arm i ,i=1,2,3,4...,K
    • t:steps of selecting arms. t=1,2,3...
    • n: n times plays
    • I_t: ther arm selected in step t (Choice)
    • X_{i,t}: unkown rewards associated with arm i
      ( Max \sum X_{i,t} -- Optimal )
    • X_{I_t,t}:reward received by select arm I_t (Reward)

    In general, both rewards Xi,t and forecaster’s choices It might be stochastic. Distinguish between the two following notions of averaged regret:

    Expected regret

    expected regret
    The expectation of the regret with respect to the action which is optimal on the sequence of reward realizations.
    Psudo-regret

    A weaker notion of regret, since one competes against the action which is optimal only in expectation.

    • 与expected regret的区别在于:
      • expected regret是K个摇臂中各自的总收益中选出最大的(相当于选reward最大的n次),减去forecaster的总reward,求期望。
      • Pseudo-regret是先选定某个arm i,选n次求和计算reward,减去forecaster的总reward,再求期望。最后再从K个arms的各自的这个期望中选择最大的。

    ?感觉这个解释不太对啊?再看看,每个t中选i的时候可以变吗?

    Pseudo-regret ≤ Expected regret

    Stochastic bandits (UCB algorithm)

    Stochastic bandits

    Probability distributions ν_i,\dots,v_k~ on ~[0,1] ~~ X_{I_t,t} \sim v_{I_t}
    \mu_i: the mean of ν_i (mean reward of arm i).

    Psudo-regret

    \mu_i: the mean of ν_i (Mean reward of arm i).

    Lai and Robbins [125], who introduced the technique of upper confidence bounds for the asymptotic analysis of regret:
    [125] T. L. Lai and H. Robbins, “Asymptotically efficient adaptive allocation rules,” Advances in Applied Mathematics, vol. 6, pp. 4–22, 1985.

    Adversarial bandits(Exp3 randomized algorithm)

    Adversarial bandits
    • Adversary/opponent: the mechanism that sets the sequence of gains for each arm.
    • Obvious adversary: The mechanism is independent of the forecaster's actions
    • Nonobvious adversary: the adversary adapt to the forecaster's past behavior
      eg: in the rigged casino the owner observes the way a gambler plays in order to design even more evil sequences of gains.

    In this adversarial setting, the goal is to obtain regret bounds in high probability or in expectation with respect to any possible randomization in the strategies used by the forecaster or the opponent and irrespective of the opponent. In the case of a nonobvious adversary this is not an easy task, and for this reason, we usually start by bounding
    the pseudo-regret:


    image.png

    Markovian bandits(Gittins indices)

    • Not much will be discussed in this monograph
    • A special case of Markovian bandits: Bayesian bandits.

    Terminology

    • forecaster = player(i.e.,the agent implementing a bandit strategy)

    相关文章

      网友评论

          本文标题:[Notes]Regret Analysis of Stocha

          本文链接:https://www.haomeiwen.com/subject/fjiwzftx.html