1. 介绍

1.1 探索与利用间的困境

Online decision-making involves a fundamental choice:
Exploitation Make the best decision given current information
Exploration Gather more information
The best long-term strategy may involve short-term sacrifices
Gather enough information to make the best overall decisions

1.2 生活中栗子

Restaurant Selection
Exploitation Go to your favorite restaurant
Exploration Try a new restaurant
Online Banner Advertisements
Exploitation Show the most successful advert
Exploration Show a different advert
Oil Drilling
Exploitation Drill at the best known location
Exploration Drill at a new location
Game Playing
Exploitation Play the move you believe is best
Exploration Play an experiment move

1.3 伍种策略规则

Naive Exploration
Add noise to greedy policy (e.g. $\epsilon-greedy$ )
Optimistic Initialization
Assume the best until proven otherwise
Optimism in the Face of Uncertainty

2. 引入多臂老虎机 (The Multi-Armed Bandit)

拉斯维加斯的一排老虎机
维基百科解释如下：
这个名字来自于想象一个赌徒在一排老虎机（有时被称为“单臂匪徒”），他们必须决定要玩哪些机器，玩每台机器多少次以及按顺序播放它们，以及是否继续使用当前的机器或尝试不同的机器。在该问题中，每台机器从特定于该机器的概率分布中提供随机奖励。赌徒的目标是通过一系列杠杆拉动最大化获得的奖励总和。^[3]^[4]赌徒在每次试验中面临的关键权衡是在“利用”具有最高预期收益的机器和“探索”以获得关于其他机器的预期收益的更多信息之间。

2.1 最大化cumulative reward && 最小化 total regret

动作空间和奖赏分布
在 $t$ 时刻，Agent做出动作 $\alpha_t \in \cal A$ ， Environment依据未知分布 $\cal R^{\alpha}(r)=\mathbb P[r|\alpha]$ 产生对应的奖赏值 $r_t \sim \cal R^{\alpha_t}=\mathbb P[r| \alpha_t]$ 。动作空间和奖赏分布 可以记为二元组 $\langle \cal A, \cal R \rangle$ ，产生的具体观测记为 $\langle \alpha_t, r_t \rangle$ 。
最大化cumulative reward
$max \sum_{\tau=1}^{t}{r_\tau}$