UCB 方法 —— The Upper Confidence Bound (UCB) method
regret minimization
- Pull each arm once: set t = n, where T_i(t) = 1 \, \forall i \in [n]
-
WHILE t < T DO:
2.1 select i_t = \arg\max_{i\in [n]} \Big\{\hat{\mu_{i,T_i(t)}} + \sqrt{\frac{\ln(2T)}{T_i(t)}}\Big\}
2.2 pull arm i_t once
2.3 T_i(t+1) = T_i(t) \space\space \text{if}\space\space i \ne i_t\space \text{else}\space\space T_i(t)+1
2.4 t \leftarrow t+1
最乐观地估计
悔值分析:\text{Regret}^{\text{UCB}}(T) \precsim \sum_{i=2}^{n} \frac{\ln T}{\Delta_i} where \Delta_i = \mu_1 - \mu_i, total n-1
Let \mathcal{E} = \{\mu_i \in \hat{\mu_{i,T_i(t)}} \pm \sqrt{\frac{\ln(2T)}{T_i(t)}}\space \forall i \space \in [n] \space t \in [T]\}
Claim
Pr[\mathcal{E}] \geq 1 - \frac{1}{2T}
When \mathcal{E} happens
Claim for each i = 2,3,...n, arm i is pulled at most \mathcal{O}(\frac{\ln T}{\Delta_i^2}) times.
Proof. Suppose arm i is pulled T_i times. For the last time it was pulled:
\mu_i + 2 \sqrt{\frac{2T}{T_{i-1}}}\geq \hat{\mu_{i,T_{i-1}}} \sqrt{\frac{\ln(2T)}{T_{i-1}}} \geq \hat{\mu_{1,t'}}+\sqrt{\frac{2T}{t'}} \geq \mu_1
Therefore \Delta_i = \mu_1 - \mu_i \leq 2 \sqrt{\frac{\ln(2T)}{T_{i-1}}}, so:
T_i\precsim \frac{\ln(2T)}{\Delta_i^2}
\text{Regret}^{\text{UCB}}(T) \leq \mathbb{E}[Reg |\mathcal{E}] + Pr[\bar{\mathcal{E}}]\cdot T \precsim \sum_{i=2}^{n} \frac{\ln T}{\Delta_i^2} \Delta_i = \sum_{i=2}^{n} \frac{\ln T}{\Delta_i}
Parameter independent bound
If arm i is pulled T_i times (when \mathcal{E} happens) we have \Delta_i \precsim \sqrt{\frac{\ln T}{T_i}}
Therefore
\mathbb{E}[Reg|\mathcal{E}] = \sum_{i=2}^{n} \Delta_i T_i \precsim \sum_{i=2}^{n} \sqrt{\ln T} \sqrt{T_i}
By Cauchy Schwartz inequality
\leq \sqrt{(n-1) \ln T} \sqrt{\sum_{i=2}^{n} T_i} \leq \sqrt{n \ln T \cdot T}
total regret \leq T average regret \leq 1
To summerize: \Delta_i 降低,难度升高
网友评论