美文网首页
[Chapter 1] Markov Decision Proc

[Chapter 1] Markov Decision Proc

作者: 超级超级小天才 | 来源:发表于2021-05-28 14:28 被阅读0次

    Markov Decision Process

    One of the most important problems in decision making is to make sequential decisions, which is also the agent's utility depends on. At each time step, the agent selects some actions to interact with the environment and make it transit to some new states, at the same time, the environment will also return some rewards to the agent. This process can be illustrated by the following figure.

    image

    We can take the following simple game as an example, in which the agent is at the lower-left corner and it can move to other blocks.

    image

    In this game, suppose that we can observe the location of the agent, which can be represented by a coordinate. We can call the location of the agent the state. All possible states in a set is the state space. In this case, the agent itself always knows where it is, so we call this fully observable environment. For the agent, at each time step, it can select to move to four directions: [UP, DOWN, LEFT, RIGHT], they are actions, and the set can be called action space, which represents all the possible actions for the agent to perform. In this example, each state has a corresponding reward, so the reward is a function of state R(s), or sometimes more generally R(s,a,s^′). The agent perfom actions to interact with the environment, the state will be transit from one to another, as you can see here, the probability for the agent to move to the expected direction is 0.8, with the probability of 0.1 for moving to left and right respectively. The outcome of the actions are stochastic, so we can define P(s^′|s,a) to represent the probability of reaching state s′ if action a is done in state s, which is called transition function. If for all states and actions, the P(s^′│s,a)=1, we can call the environment deterministic.

    Given the example, we can give the difinition of the Markov Decision Process (MDP), which is a sequential decision problem model for a fully observable, stochastic environment with a Markovian transition and additive rewards.

    An MDP model can be defined by a tuple: (S,A,T,R), where:

    • S: state space
    • A: action space
    • T: transition model (transition function), defined by P(s^′ |s,a)
    • R: reward function, defined by R(s,a,s^′)

    Policy and Value

    How does an agent to make a sequence of decisions? This can be intuitive. We are using policy function to represent the decision of agent under each possible state, that means, the policy is a function from state to action, which can be represented as a=\pi(s): for every state s, it outputs an appropriate action a.

    Then you may ask, how to evaluate a policy? Since the key of our problem is to make a sequence of actions, the quality of a policy should be measured by expected return: the expected utility of possible state sequence generated by the policy. In another word, the value of a policy \pi at a state s, U^{\pi}(s) (also sometimes use notation V^{\pi}(s)) is the expected cumulative utility of executing \pi starting from s.

    Infinite Horizon and Discounting

    There are mainly two kinds of problems with respect to the horizon.

    One is called finite horizon problem, in which, there is a fixed time step N, and after N steps, nothing matters anymore. In this case, the value/return is usually the addition of the rewards over the sequence:

    U_h([s_0,s_1,…,s_N])=R(s_0)+R(s_1)+...+R(s_N)

    The finite horizon problem is usually more difficult, because the optimal action at a given state can change depending on how many steps remains to be executed, so the optimal policy is nonstationary.

    Then the infinite horizon problem, without fixed time step limit, is more common and we will focus on this kind of problem. For infinite horizon problem, optimal action depends only on current state, so it's stationary. However, it can be difficult to compute the utilities: an infinite accumulation. To solve this problem, we need to introduce the discounted reward:

    U_h([s_0,s_1,…])=R(s_0)+{\gamma}R(s_1)+{\gamma}^2 R(s_2)+...=\sum_{t=1}^{\infty}{{\gamma}^t R(s_t)}

    Where \gamma is a discount factor between 0 and 1. This is based on that the states closer to the current one is more important, so there is a higher weight for them in the accumulation. What's more, with discounted rewards with \gamma<1 and rewards bounded by \pm R_{max}, utility always finite:

    U_h([s_0,s_1,…])=\sum_{t=1}^{\infty}{\gamma^t R(s_t)} \leq \sum_{t=1}^{\infty}{\gamma^t R_{max}}=\frac{R_{max}}{(1−\gamma)}

    Suppose that the sequence of states [s_0,s_1,…] is determined by a policy {\pi} and the current state s, then the expected utility of executing {\pi} starting from s is:

    U^{\pi}(s)=E[\sum_{t=1}^{\infty}{{\gamma}^t R(s_t) }]

    Where s_0=s and the expectation is with respect to distribution of state sequences determined by the policy {\pi}. Obviously, the optimal policy is the policy with the max expected utility:

    {\pi}_s^∗=arg⁡ max_{\pi}⁡ U^{\pi}(s)

    Of course, we want to find the optimal policy, so we need to calculate the maximum utility for each state. So here is another definition: value function, which is the value of an optimal policy, a function of state:

    V(s)=U^{{\pi}^∗}(s)

    Suppose we know clearly the value function, namely, we know all maximum values for all states, (for example, showing in the following figure), then it will be very easy to compute the optimal policy for all states:

    {\pi}^∗(s)=arg⁡max_{a\in A(s)}{⁡\sum_{s'}{P(s^′│s,a)V(s^′)}}

    image

    This inspires us, the most important thing we need to do is to calculate (approximate) the value function.

    相关文章

      网友评论

          本文标题:[Chapter 1] Markov Decision Proc

          本文链接:https://www.haomeiwen.com/subject/yrjssltx.html