Chapter 3

作者: MasterXiong | 来源:发表于2020-09-05 22:52 被阅读0次

    Chapter 3: Finite Markov Decision Processes

    Basic Definitions

    MDP is the most basic formulation of sequential decision process under the assumption of Markov property.

    1. State: The state must include information about all aspects of the past agent-environment interaction that make a difference for the future.
    2. Action
    3. Reward: The reward defines what we want to achieve instead of how we want to achieve it.
    4. Dynamics: p(s', r | s, a)
    5. Return: Return is defined as some function of the reward sequence
      For episodic tasks, we have G_t = R_{t+1} + \cdots + R_T
      For continuing tasks, we have G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots
      They can be unified under the same framework as G_t = \sum_{k=0}^\infty \gamma^{k} R_{t+k+1} by adding an absorbing state with zero reward to the terminal of episodic tasks
      The recursive form of return is G_t = R_{t+1} + \gamma G_{t+1}, which forms the basis of Bellman equations

    Further notes:

    1. In the RL book, the reward obtained from taking action A_t in state S_t at time step t is denoted as R_{t+1} instead of R_t;
    2. RL beyond MDP assumption is an important research topic (also discussed in the RL book)
    3. The representation of the states and actions has a great influence on the learning process, but is beyond the scope of the RL book (many recent works actually focus on this topic)
    4. The RL book focuses on scalar reward signal, but there are also some recent works focusing on multi-objective reward signal in vector form

    Policies and Value Functions

    Value function is the expected return of a state or a state-action pair
    Policy is a mapping from states to the probabilities of selecting each possible action
    Value functions are defined w.r.t. particular policies, i.e., v_\pi (s) = \mathbb{E}_\pi [G_t | S_t = s], \quad q_\pi (s, a) = \mathbb{E}_\pi [G_t | S_t = s, A_t = a]
    Based on the simple relationships of v_\pi (s) = \sum_a \pi (a | s) q_\pi (s, a), \quad q_\pi (s, a) = \sum_{s', r} p(s', r | s, a) \big [ r + \gamma v_\pi (s') \big ], we can derive the Bellman equation which expresses the relationship between the value of a state (state-action pair) and the values of its successor states (state-action pairs), i.e., v_\pi (s) = \sum_a \pi (a | s) \sum_{s', r} p(s', r | s, a) \big [ r + \gamma v_\pi (s') \big ] \\ q_\pi (s, a) = \sum_{s', r} p(s', r | s, a) \big [ r + \gamma \sum_{a'} \pi (a' | s') q_\pi (s', a') \big ]. The value function v_\pi is the unique solution to its Bellman equation by solving a set of |\mathcal{S}| linear equations. Notice that the assumption here is that the system dynamics p(s', r | s, a) is known.
    Another useful tool to visualize the recursive relationships of value functions is backup diagram.

    Optimal Policies and Optimal Value Functions

    Definition of a "better" policy: \pi \geq \pi' if and only if v_\pi (s) \geq v_{\pi'} (s) for all s \in \mathcal{S}.
    There always exists an optimal value function v_*(s) and q_*(s, a) and its corresponding optimal policies (potentially more than one) for MDPs. Intuitively, if a policy is not optimal, we can always improve the value of a state s by changing the policy for this specific state. The improvement in the value of s will then backpropagate to all the values of the states which can reach s in the state transition graph. In this way, we can always achieve a better policy and gradually reach the optimal policy.
    Based on the following simple relations, i.e., v_\pi (s) = \sum_a \pi (a | s) q_\pi (s, a) \rightarrow v_*(s) = \max_{a \in \mathcal{A}(s)} q_*(s, a) \\ \quad q_\pi (s, a) = \sum_{s', r} p(s', r | s, a) \big [ r + \gamma v_\pi (s') \big ] \rightarrow q_*(s, a) = \sum_{s', r} p(s', r | s, a) \big [ r + \gamma v_*(s') \big ], we have the Bellman optimality equation without reference to any specific policy as v_*(s) = \max_{a \in \mathcal{A}(s)} \sum_{s', r} p(s', r | s, a) \big [ r + \gamma v_*(s') \big ] \\ q_*(s, a) = \sum_{s', r} p(s', r | s, a) \big [ r + \gamma \max_{a'} q_*(s', a') \big ].
    The optimal policy can be easily derived by greedy search over the state values.
    Solving the Bellman optimality equation requires solving |\mathcal{S}| nonlinear equations based on the assumption of fully known system dynamics and Markov property. Even these two assmuptions are satisfied, solving the equations is still computationally infeasible when the state space is very large. Consequently, different RL methods mainly focus on how to solve the Bellman optimality equation approximately.

    Further notes:
    The MDP formulation of RL makes it closely related to (stochastic) optimal control.

    Reinforcement learning adds to MDPs a focus on approximation and incomplete information for realistically large problems.

    The online nature of reinforcement learning makes it possible to approximate optimal policies in ways that put more effor into learning to make good decisions for frequently encountered states, at the expense of less effort for infrequently encountered states.

    相关文章

      网友评论

          本文标题:Chapter 3

          本文链接:https://www.haomeiwen.com/subject/umhzsktx.html