[Chapter 1] Markov Decision Proc

作者: 超级超级小天才 | 来源:发表于2021-05-28 14:28 被阅读0次

[Chapter 1] Markov Decision Proc
Chapter 3
AI学习笔记——MDP(Markov Decision Proc
RL L1
Markov Decision Processes II
马尔可夫决策过程与贝尔曼方程
机器学习:Chapter4-5
从零开始强化学习（二）——马尔可夫决策过程
Lecture 2: Markov Decision Proce
强化学习：有限马尔科夫过程

Markov Decision Process

One of the most important problems in decision making is to make sequential decisions, which is also the agent's utility depends on. At each time step, the agent selects some actions to interact with the environment and make it transit to some new states, at the same time, the environment will also return some rewards to the agent. This process can be illustrated by the following figure.

image

We can take the following simple game as an example, in which the agent is at the lower-left corner and it can move to other blocks.

image

In this game, suppose that we can observe the location of the agent, which can be represented by a coordinate. We can call the location of the agent the state. All possible states in a set is the state space. In this case, the agent itself always knows where it is, so we call this fully observable environment. For the agent, at each time step, it can select to move to four directions: [UP, DOWN, LEFT, RIGHT], they are actions, and the set can be called action space, which represents all the possible actions for the agent to perform. In this example, each state has a corresponding reward, so the reward is a function of state $R(s)$ , or sometimes more generally $R(s,a,s^′)$ . The agent perfom actions to interact with the environment, the state will be transit from one to another, as you can see here, the probability for the agent to move to the expected direction is 0.8, with the probability of 0.1 for moving to left and right respectively. The outcome of the actions are stochastic, so we can define $P(s^′|s,a)$ to represent the probability of reaching state $s′$ if action $a$ is done in state $s$ , which is called transition function. If for all states and actions, the $P(s^′│s,a)=1$ , we can call the environment deterministic.

Given the example, we can give the difinition of the Markov Decision Process (MDP), which is a sequential decision problem model for a fully observable, stochastic environment with a Markovian transition and additive rewards.

An MDP model can be defined by a tuple: $(S,A,T,R)$ , where:

$S$ : state space
$A$ : action space
$T$ : transition model (transition function), defined by $P(s^′ |s,a)$
$R$ : reward function, defined by $R(s,a,s^′)$

Policy and Value

How does an agent to make a sequence of decisions? This can be intuitive. We are using policy function to represent the decision of agent under each possible state, that means, the policy is a function from state to action, which can be represented as $a=\pi(s)$ : for every state $s$ , it outputs an appropriate action $a$ .

Then you may ask, how to evaluate a policy? Since the key of our problem is to make a sequence of actions, the quality of a policy should be measured by expected return: the expected utility of possible state sequence generated by the policy. In another word, the value of a policy $\pi$ at a state $s$ , $U^{\pi}(s)$ (also sometimes use notation $V^{\pi}(s)$ ) is the expected cumulative utility of executing $\pi$ starting from $s$ .

Infinite Horizon and Discounting

There are mainly two kinds of problems with respect to the horizon.

One is called finite horizon problem, in which, there is a fixed time step $N$ , and after $N$ steps, nothing matters anymore. In this case, the value/return is usually the addition of the rewards over the sequence:

$U_h([s_0,s_1,…,s_N])=R(s_0)+R(s_1)+...+R(s_N)$

The finite horizon problem is usually more difficult, because the optimal action at a given state can change depending on how many steps remains to be executed, so the optimal policy is nonstationary.

Then the infinite horizon problem, without fixed time step limit, is more common and we will focus on this kind of problem. For infinite horizon problem, optimal action depends only on current state, so it's stationary. However, it can be difficult to compute the utilities: an infinite accumulation. To solve this problem, we need to introduce the discounted reward:

$U_h([s_0,s_1,…])=R(s_0)+{\gamma}R(s_1)+{\gamma}^2 R(s_2)+...=\sum_{t=1}^{\infty}{{\gamma}^t R(s_t)}$

Where $\gamma$ is a discount factor between 0 and 1. This is based on that the states closer to the current one is more important, so there is a higher weight for them in the accumulation. What's more, with discounted rewards with $\gamma<1$ and rewards bounded by $\pm R_{max}$ , utility always finite:

$U_h([s_0,s_1,…])=\sum_{t=1}^{\infty}{\gamma^t R(s_t)} \leq \sum_{t=1}^{\infty}{\gamma^t R_{max}}=\frac{R_{max}}{(1−\gamma)}$

Suppose that the sequence of states $[s_0,s_1,…]$ is determined by a policy ${\pi}$ and the current state $s$ , then the expected utility of executing ${\pi}$ starting from $s$ is:

$U^{\pi}(s)=E[\sum_{t=1}^{\infty}{{\gamma}^t R(s_t) }]$

Where $s_0=s$ and the expectation is with respect to distribution of state sequences determined by the policy ${\pi}$ . Obviously, the optimal policy is the policy with the max expected utility:

${\pi}_s^∗=arg⁡ max_{\pi}⁡ U^{\pi}(s)$

Of course, we want to find the optimal policy, so we need to calculate the maximum utility for each state. So here is another definition: value function, which is the value of an optimal policy, a function of state:

$V(s)=U^{{\pi}^∗}(s)$

Suppose we know clearly the value function, namely, we know all maximum values for all states, (for example, showing in the following figure), then it will be very easy to compute the optimal policy for all states:

${\pi}^∗(s)=arg⁡max_{a\in A(s)}{⁡\sum_{s'}{P(s^′│s,a)V(s^′)}}$

image

This inspires us, the most important thing we need to do is to calculate (approximate) the value function.

[Chapter 1] Markov Decision Proc
Markov Decision Process One of the most important problem...
Chapter 3
Chapter 3: Finite Markov Decision Processes Basic Definit...
AI学习笔记——MDP(Markov Decision Proc
前面好几篇文章都在介绍强化学习（RL），以及强化学习的一些具体算法，但是强化学习中用到的最重要的理论MDP却还没提...
RL L1
markov decision process Bellman equation value iteration
Markov Decision Processes II
waste of computation policy evaluation is a fixed-policy ...
马尔可夫决策过程与贝尔曼方程
马尔可夫决策过程 (Markov Decision Process，MDP)是序贯决策(sequential ...
机器学习:Chapter4-5
Chapter 4: 决策树（decision tree） what is decision tree？基本流程...
从零开始强化学习（二）——马尔可夫决策过程
二. 马尔可夫决策过程(Markov Decision Processes, MDP) 2.1 马尔可夫性质(Ma...
Lecture 2: Markov Decision Proce
Author：David Silver Outline Markov Processes Markov Rewar...
强化学习：有限马尔科夫过程
FMDs FMDs（finite Markov decision processes）：对在一种连续性过程中不断做...