概念:
给定一个MDP,一个agent可以访问:
- 环境先验模型(a-priori model of environment)
在agent与环境交互之前就呈现给他了。
which is revealed to the agent before its interaction with the environment starts.
- 生成模型(a generative model): 输入一个状态和动作,输出奖励和根据转移函数的到的随机下一个状态。
-
声明模型(a declarative model)
声明模型提供了MDP的描述,对于所有的状态、后继状态和动作提供了概率转移和奖励函数。
image.png
我们在此处区分学习agents (未提供先验模型)和规划agents(可以访问生成模型或者声明模型)
多臂赌博机问题
Figuratively, the agent faces a slot machine (a bandit) with multiple arms, and it has to decide which arm to pull based only on the payouts it received so far. Each arm provides a random reward according to a probability distribution that is specific to the arm and constant over time, but the agent has no initial information on the distribution.
exploration-exploitation dilemma 探索和获取收益的窘境
on the one hand, since the agent has no a-priori knowledge of the environment, it has to explore its possibilities to learn from the feedback the environment provides. And on the other, since it aims to maximize the accumulated reward over all runs, it has an incentive to exploit the knowledge it has gathered by executing the action it believes to be best. In other words, the agent learns from its trial-and-error interaction with the environment, and it has to make sure that the balance between exploration and exploitation is such that it learns the best action without sacrificing too much reward in early decisions.
简单的说就是:agent要探索尝试各种动作,牺牲一部分收益来寻找到最好的动作,但又不能牺牲的太多。要平衡探索时牺牲的收益和获取最大的收益。
image.png一个三臂的例子MAB问题。 边缘用转移概率p和奖励r标记为p:r。
一个MAB问题有三个拉杆al,am,ar(分别代表左、中、右拉杆)。拉动左拉杆可以获得20或者0的奖励,概率分别为0.5。拉动右拉杆分别或者70或者30的奖励,概率分别为0.5。拉动中间的拉杆,有0.1的概率获得100的奖励,0.5的概率获得60的奖励,0.4的概率获得30的奖励。我们可以计算出3个拉杆的期望奖励分别为:
Q(al) = 10, Q(am) = 52, Q*(ar) = 50
网友评论