UCB 方法 —— The Upper Confidence Bound (UCB) method

regret minimization

Pull each arm once: set $t = n$ , where $T_i(t) = 1 \, \forall i \in [n]$
WHILE $t < T$ DO:
2.1 select $i_t = \arg\max_{i\in [n]} \Big\{\hat{\mu_{i,T_i(t)}} + \sqrt{\frac{\ln(2T)}{T_i(t)}}\Big\}$
2.2 pull arm $i_t$ once
2.3 $T_i(t+1) = T_i(t) \space\space \text{if}\space\space i \ne i_t\space \text{else}\space\space T_i(t)+1$
2.4 $t \leftarrow t+1$

最乐观地估计

悔值分析： $\text{Regret}^{\text{UCB}}(T) \precsim \sum_{i=2}^{n} \frac{\ln T}{\Delta_i}$ where $\Delta_i = \mu_1 - \mu_i$ , total $n-1$

Let $\mathcal{E} = \{\mu_i \in \hat{\mu_{i,T_i(t)}} \pm \sqrt{\frac{\ln(2T)}{T_i(t)}}\space \forall i \space \in [n] \space t \in [T]\}$
Claim
$Pr[\mathcal{E}] \geq 1 - \frac{1}{2T}$
When $\mathcal{E}$ happens
Claim for each $i = 2,3,...n$ , arm $i$ is pulled at most $\mathcal{O}(\frac{\ln T}{\Delta_i^2})$ times.

Proof. Suppose arm $i$ is pulled $T_i$ times. For the last time it was pulled:

$\mu_i + 2 \sqrt{\frac{2T}{T_{i-1}}}\geq \hat{\mu_{i,T_{i-1}}} \sqrt{\frac{\ln(2T)}{T_{i-1}}} \geq \hat{\mu_{1,t'}}+\sqrt{\frac{2T}{t'}} \geq \mu_1$

Therefore $\Delta_i = \mu_1 - \mu_i \leq 2 \sqrt{\frac{\ln(2T)}{T_{i-1}}}$ , so:
$T_i\precsim \frac{\ln(2T)}{\Delta_i^2}$

$\text{Regret}^{\text{UCB}}(T) \leq \mathbb{E}[Reg |\mathcal{E}] + Pr[\bar{\mathcal{E}}]\cdot T \precsim \sum_{i=2}^{n} \frac{\ln T}{\Delta_i^2} \Delta_i = \sum_{i=2}^{n} \frac{\ln T}{\Delta_i}$

Parameter independent bound
If arm $i$ is pulled $T_i$ times (when $\mathcal{E}$ happens) we have $\Delta_i \precsim \sqrt{\frac{\ln T}{T_i}}$

Therefore
$\mathbb{E}[Reg|\mathcal{E}] = \sum_{i=2}^{n} \Delta_i T_i \precsim \sum_{i=2}^{n} \sqrt{\ln T} \sqrt{T_i}$
By Cauchy Schwartz inequality
$\leq \sqrt{(n-1) \ln T} \sqrt{\sum_{i=2}^{n} T_i} \leq \sqrt{n \ln T \cdot T}$

total regret $\leq T$ average regret $\leq 1$

To summerize: $\Delta_i$ 降低，难度升高