When sketching papers on bandit problems, I also try to find a good research direction for myself. Then I begin to wonder what will happen if we give a metric to sets of bandits, that is give some restrictions to which arm to choose next. It will make it more appeal to pick nearer arms than farther ones.
Then I start to think about arms forming a cycle. I make such an assumption because that will make no distinction from different arms when choosing next, because every arm can only go left and right. Then I fell in the problem of how to use UCB method that I introduced before to this structured bandit problem. Before I go into details, I first study the most common algorithm that is pulling every arm m times and then fix the one with highest estimated mean. After I compute regret of this method, it is not much far from the best result we have now. The best result is of order (2*\sum_{1/\Delta_i})log T and my result is of order (2*\sum{\Delta_i/\Delta^2}). They all have same order but different coefficients.
Then next problem is how to apply UCB method to the cycle bandit problem. When at round t, we have an estimation of all real means by UCB method. If I decide to go to the one with maximal estimates directly, then it is hard to estimate next. It is not easy to count steps from current arm, thus we will lose control for how far we are away from time horizon T. And the arms on the path will also be updated so it is not easy to estimate how many times arms have been pulled because it may because it is the best estimate and also maybe because it is only on the path. It is hard to distinguish these two situations. If it is pulled because it is on path, then the number relies also on pulled numbers of other arms.
If I decide to just go one step every time by choosing which direction is better, then how to choose the direction is also a problem. If we choose the side with maximal mean estimate, then may the other side has much more good path. If we choose the side with better sum of all estimates, it is not so reasonable. Anyway, it will be a large problem to estimate how many times an arm has been pulled.
The above all is about stochastic bandit problem. But maybe because the bound by simple method and also MaxSNP difficulty of switching cost, nowadays adversary bandit problem is much more being explored.
网友评论