美文网首页
Lecture 2: Markov Decision Proce

Lecture 2: Markov Decision Proce

作者: 魏鹏飞 | 来源:发表于2020-03-23 16:50 被阅读0次

    Author:David Silver

    Outline

    1. Markov Processes
    2. Markov Reward Processes
    3. Markov Decision Processes
    4. Extensions to MDPs

    Introduction to MDPs

    • Markov decision processes formally describe anenvironment` for reinforcement learning
    • Where the environment is fully observable
    • i.e. The current state completely characterises the process
    • Almost all RL problems can be formalised as MDPs, e.g.
      • Optimal control primarily deals with continuous MDPs
      • Partially observable problems can be converted into MDPs
      • Bandits are MDPs with one state

    Markov Property

    “The future is independent of the past given the present”

    • The state captures all relevantinformation from the history
    • Once the state is known, the history may be thrown away
    • i.e. The state is a sufficient statistic of the future

    State Transition Matrix

    For a Markov state s and successor state s' , the state transition probability is defined by

    P_{ss'}=P[S_{t+1}=s'|S_t=s]

    State transition matrix P defines transition probabilities from all states s to all successor states s',

    where each row of the matrix sums to 1.

    Markov Process

    A Markov process is a memoryless random process, i.e. a sequence of random states S_1, S_2, ... with the Markov property.

    Example: Student Markov Chain

    Example: Student Markov Chain Episodes

    Example: Student Markov Chain Transition Matrix

    Markov Reward Process

    A Markov reward process is a Markov chain with values.

    Example: Student MRP

    Return

    • The discount \gamma \in [0, 1] is the present value of future rewards
    • The value of receiving reward R after k + 1 time-steps is \gamma^kR .
    • This values immediate reward above delayed reward.
      • \gamma close to 0 leads to ”myopic” evaluation
    • \gamma close to 1 leads to ”far-sighted” evaluation

    Why discount?

    Most Markov reward and decision processes are discounted. Why?

    • Mathematically convenient to discount rewards
    • Avoids infinite returns in cyclic Markov processes
    • Uncertainty about the future may not be fully represented
    • If the reward is financial, immediate rewards may earn more interest than delayed rewards
    • Animal/human behaviour shows preference for immediate reward
    • It is sometimes possible to use undiscounted Markov reward processes (i.e. \gamma = 1), e.g. if all sequences terminate.

    Value Function

    The value function v(s) gives the long-term value of state s.

    Example: Student MRP Returns

    Example: State-Value Function for Student MRP (1)

    Example: State-Value Function for Student MRP (2)

    Example: State-Value Function for Student MRP (3)

    Bellman Equation for MRPs

    The value function can be decomposed into two parts:

    • immediate reward R_{t+1}
    • discounted value of successor state \gamma v(S_{t+1})

    Bellman Equation for MRPs (2)

    Example: Bellman Equation for Student MRP

    Bellman Equation in Matrix Form

    The Bellman equation can be expressed concisely using matrices,

    v = R+\gamma Pv

    where v is a column vector with one entry per state

    \begin{bmatrix} v(1) \\ \vdots\\ v(n) \end{bmatrix}=\begin{bmatrix} R_1 \\ \vdots\\ R_n \end{bmatrix} + \gamma \begin{bmatrix} P_{11} & \dots & P_{1n}\\ \vdots \\ P_{n1} & \dots & P_{nn} \end{bmatrix}\begin{bmatrix} v(1) \\ \vdots\\ v(n) \end{bmatrix}

    Solving the Bellman Equation

    • The Bellman equation is a linear equation
    • It can be solved directly:

    v=R+\gamma Pv
    (I-\gamma P)v=R
    v=(I-\gamma P)^{-1}R

    • Computational complexity is O(n^3) for n states
    • Direct solution only possible for small MRPs
    • There are many iterative methods for large MRPs, e.g.
      • Dynamic programming
      • Monte-Carlo evaluation
      • Temporal-Difference learning

    Markov Decision Process

    A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov.

    Example: Student MDP

    Policies (1)

    • A policy fully defines the behaviour of an agent
    • MDP policies depend on the current state (not the history)
    • i.e. Policies are stationary (time-independent), A_t \sim \pi(\cdot|S_t),\forall t>0

    Policies (2)

    • Given an MDP M = <S,A,P,R,\gamma> and a policy \pi
    • The state sequence S_1, S_2, ... is a Markov process <S, P^{\pi}>
    • The state and reward sequence S_1, R_2, S_2, ... is a Markov reward process <S, P^{\pi}, R^{\pi}, \gamma>
    • where

    P^{\pi}_{s,s'}=\sum_{a\in A}\pi(a|s)P_{ss'}^a
    R_s^{\pi}=\sum_{a\in A}\pi(a|s)R_s^a

    Value Function

    Example: State-Value Function for Student MDP

    Bellman Expectation Equation

    The state-value function can again be decomposed into immediate reward plus discounted value of successor state,

    v_{\pi}(s)=E_{\pi}[R_{t+1}+\gamma v_{\pi}(S_{t+1})|S_t=s]

    The action-value function can similarly be decomposed,

    q_{\pi}(s,a)=E_{\pi}[R_{t+1}+\gamma q_{\pi}(S_{t+1},A_{t+1})|S_t=s,A_t=a]

    Bellman Expectation Equation for V^{\pi}

    Bellman Expectation Equation for Q^{\pi}

    Bellman Expectation Equation for v_{\pi} (2)

    Bellman Expectation Equation for q_{\pi} (2)

    Example: Bellman Expectation Equation in Student MDP

    Bellman Expectation Equation (Matrix Form)

    The Bellman expectation equation can be expressed concisely using the induced MRP,

    v_{\pi}=R^{\pi}+\gamma P^{\pi}v_{\pi}

    with direct solution

    v_{\pi}=(I-\gamma P^{\pi})^{-1}R^{\pi}

    Optimal Value Function

    • The optimal value function specifies the best possible performance in the MDP.
    • An MDP is “solved” when we know the optimal value fn (v+q).

    Example: Optimal Value Function for Student MDP

    Example: Optimal Action-Value Function for Student MDP

    Optimal Policy

    Define a partial ordering over policies:

    \pi \geq \pi' \mbox{ if } v_{\pi}(s)\geq v_{\pi'}(s), \forall_s

    Finding an Optimal Policy

    An optimal policy can be found by maximising over q_∗(s,a),

    \pi_{*}(a|s) = \begin{cases} 1, & \mbox{if }a=\argmax_{a\in A}q_{*}(s,a) \\ 0, & \mbox{otherwise } \end{cases}

    • There is always a deterministic optimal policy for any MDP
    • If we know q_∗(s,a), we immediately have the optimal policy

    Example: Optimal Policy for Student MDP

    Bellman Optimality Equation for v_*

    Bellman Optimality Equation for Q_*

    Bellman Optimality Equation for V^* (2)

    Bellman Optimality Equation for Q^* (2)

    Example: Bellman Optimality Equation in Student MDP

    Solving the Bellman Optimality Equation

    • Bellman Optimality Equation is non-linear
    • No closed form solution (in general)
    • Many iterative solution methods
      • Value Iteration
      • Policy Iteration
      • Q-learning
      • Sarsa

    Extensions to MDPs (no exam)

    • Infinite and continuous MDPs
    • Partially observable MDPs
    • Undiscounted, average reward MDPs

    Infinite MDPs (no exam)

    The following extensions are all possible:

    • Countably infinite state and/or action spaces
      • Straightforward
    • Continuous state and/or action spaces
      • Closed form for linear quadratic model (LQR)
    • Continuous time
      • Requires partial differential equations
      • Hamilton-Jacobi-Bellman (HJB) equation
      • Limiting case of Bellman equation as time-step → 0

    POMDPs (no exam)

    Belief States (no exam)

    Reductions of POMDPs (no exam)

    Ergodic Markov Process (no exam)

    Ergodic MDP (no exam)

    Average Reward Value Function (no exam)

    Questions?

    The only stupid question is the one you were afraid to ask but never did.
    -Rich Sutton

    Reference:《UCL Course on RL》

    相关文章

      网友评论

          本文标题:Lecture 2: Markov Decision Proce

          本文链接:https://www.haomeiwen.com/subject/efywyhtx.html