增强学习四个要素
- policy policy指的是一个函数或者规则,输入为环境状态,输出为action(Roughly speaking, a policy is a mapping from perceived states of the environment to actions to be taken when in those states.)
- reward reward翻译为奖励,指在某个action之后环境给你的反馈。和环境状态和action有关。reward表示的是即使收益(On each time step, the environment sends to the reinforcement learning agent a single number, a reward. The agent’s sole objective is to maximize the total reward it receives over the long run. The reward signal thus defines what are the good and bad events for the agent)
- value function。value function表示的是一种长期回报。一般写作v(s),指的是agent从状态s出发,将来收益的期望。(Roughly speaking, the value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state).某个状态的reward可以很低,但是value function可以很高。因为从这个状态转到其他状态,其他状态的reward可以很高。举例:(To make a human analogy, rewards are somewhat like pleasure (if high) and pain (if low), whereas values correspond to a more refined and farsighted judgment of how pleased or displeased we are that our environment is in a particular state.)。在选择action的时候,优先选择value大的state。(We seek actions that bring about states of highest value, not highest reward, because these actions obtain the greatest amount of reward for us over the long run),增强学习的核心就是估计状态的value function
- model of the environment. model作为环境的模拟,可以根据此时的状态和做出的ation,预测下一刻的状态以及agent获得的reward。model主要用来做规划。表示我们知道环境的运行原理,方法为model-based。对应的是model-free。model-free需要不断的尝试,试错来预估。
网友评论