Chapter 3

作者: MasterXiong | 来源:发表于2020-09-05 22:52 被阅读0次

《flask Web 开发》读书笔记 & chapter
（飘）随风而逝目录
《flask Web 开发》读书笔记 & chapter
Nuke Python 中文帮助目录
On Writing Well - Chapter 3&4
Harry Potter and The Sorcerer's
How to Write a Lot
12同义词与搭配
操作系统知识总结
《flask Web 开发》读书笔记 & chapter

Chapter 3: Finite Markov Decision Processes

Basic Definitions

MDP is the most basic formulation of sequential decision process under the assumption of Markov property.

State: The state must include information about all aspects of the past agent-environment interaction that make a difference for the future.
Action
Reward: The reward defines what we want to achieve instead of how we want to achieve it.
Dynamics: $p(s', r | s, a)$
Return: Return is defined as some function of the reward sequence
For episodic tasks, we have $G_t = R_{t+1} + \cdots + R_T$
For continuing tasks, we have $G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots$
They can be unified under the same framework as $G_t = \sum_{k=0}^\infty \gamma^{k} R_{t+k+1}$ by adding an absorbing state with zero reward to the terminal of episodic tasks
The recursive form of return is $G_t = R_{t+1} + \gamma G_{t+1}$ , which forms the basis of Bellman equations

Further notes:

In the RL book, the reward obtained from taking action $A_t$ in state $S_t$ at time step $t$ is denoted as $R_{t+1}$ instead of $R_t$ ;
RL beyond MDP assumption is an important research topic (also discussed in the RL book)
The representation of the states and actions has a great influence on the learning process, but is beyond the scope of the RL book (many recent works actually focus on this topic)
The RL book focuses on scalar reward signal, but there are also some recent works focusing on multi-objective reward signal in vector form

Policies and Value Functions

Value function is the expected return of a state or a state-action pair
Policy is a mapping from states to the probabilities of selecting each possible action
Value functions are defined w.r.t. particular policies, i.e., $v_\pi (s) = \mathbb{E}_\pi [G_t | S_t = s], \quad q_\pi (s, a) = \mathbb{E}_\pi [G_t | S_t = s, A_t = a]$
Based on the simple relationships of $v_\pi (s) = \sum_a \pi (a | s) q_\pi (s, a), \quad q_\pi (s, a) = \sum_{s', r} p(s', r | s, a) \big [ r + \gamma v_\pi (s') \big ],$ we can derive the Bellman equation which expresses the relationship between the value of a state (state-action pair) and the values of its successor states (state-action pairs), i.e., $v_\pi (s) = \sum_a \pi (a | s) \sum_{s', r} p(s', r | s, a) \big [ r + \gamma v_\pi (s') \big ] \\ q_\pi (s, a) = \sum_{s', r} p(s', r | s, a) \big [ r + \gamma \sum_{a'} \pi (a' | s') q_\pi (s', a') \big ].$ The value function $v_\pi$ is the unique solution to its Bellman equation by solving a set of $|\mathcal{S}|$ linear equations. Notice that the assumption here is that the system dynamics $p(s', r | s, a)$ is known.
Another useful tool to visualize the recursive relationships of value functions is backup diagram.

Optimal Policies and Optimal Value Functions

Definition of a "better" policy: $\pi \geq \pi'$ if and only if $v_\pi (s) \geq v_{\pi'} (s)$ for all $s \in \mathcal{S}$ .
There always exists an optimal value function $v_*(s)$ and $q_*(s, a)$ and its corresponding optimal policies (potentially more than one) for MDPs. Intuitively, if a policy is not optimal, we can always improve the value of a state $s$ by changing the policy for this specific state. The improvement in the value of $s$ will then backpropagate to all the values of the states which can reach $s$ in the state transition graph. In this way, we can always achieve a better policy and gradually reach the optimal policy.
Based on the following simple relations, i.e., $v_\pi (s) = \sum_a \pi (a | s) q_\pi (s, a) \rightarrow v_*(s) = \max_{a \in \mathcal{A}(s)} q_*(s, a) \\ \quad q_\pi (s, a) = \sum_{s', r} p(s', r | s, a) \big [ r + \gamma v_\pi (s') \big ] \rightarrow q_*(s, a) = \sum_{s', r} p(s', r | s, a) \big [ r + \gamma v_*(s') \big ],$ we have the Bellman optimality equation without reference to any specific policy as $v_*(s) = \max_{a \in \mathcal{A}(s)} \sum_{s', r} p(s', r | s, a) \big [ r + \gamma v_*(s') \big ] \\ q_*(s, a) = \sum_{s', r} p(s', r | s, a) \big [ r + \gamma \max_{a'} q_*(s', a') \big ].$
The optimal policy can be easily derived by greedy search over the state values.
Solving the Bellman optimality equation requires solving $|\mathcal{S}|$ nonlinear equations based on the assumption of fully known system dynamics and Markov property. Even these two assmuptions are satisfied, solving the equations is still computationally infeasible when the state space is very large. Consequently, different RL methods mainly focus on how to solve the Bellman optimality equation approximately.

Further notes:
The MDP formulation of RL makes it closely related to (stochastic) optimal control.

Reinforcement learning adds to MDPs a focus on approximation and incomplete information for realistically large problems.

The online nature of reinforcement learning makes it possible to approximate optimal policies in ways that put more effor into learning to make good decisions for frequently encountered states, at the expense of less effort for infrequently encountered states.

《flask Web 开发》读书笔记 & chapter
chapter 2 - chapter 3 - chapter 4 - chapter 5 - chapter 6...
（飘）随风而逝目录
·文案 ·chapter 1 ·chapter 2 ·chapter 3 .chapter 4
《flask Web 开发》读书笔记 & chapter
chapter 2 - chapter 3 - chapter 4 - chapter 5 - 源码概念剖析-...
Nuke Python 中文帮助目录
Chapter 0 始 Chapter 1 入门 Chapter 2 动画 Chapter 3 作为python包...
On Writing Well - Chapter 3&4
Chapter 3 "Clutter"强调了"Simplicity"。词汇&表达 -Chapter 3- sum...
Harry Potter and The Sorcerer's
Chapter 3 The Letters From No One This chapter showed the...
How to Write a Lot
Chapter 3: Motiational Tools As mentioned in chapter 2, t...
12同义词与搭配
Chapter12Polysemy and Homonymy This chapter covers 3 part...
操作系统知识总结
操作系统面向进程和线程学习操作系统。目录 Chapter 1Chapter 2Chapter 3Chapter...
《flask Web 开发》读书笔记 & chapter
chapter 2 - chapter 3 - chapter 4 - 源码概念剖析-flask数据库操作数据...

Chapter 3

Chapter 3: Finite Markov Decision Processes

Basic Definitions

Policies and Value Functions

Optimal Policies and Optimal Value Functions

相关文章

《flask Web 开发》读书笔记 & chapter

（飘）随风而逝目录

《flask Web 开发》读书笔记 & chapter

Nuke Python 中文帮助目录

On Writing Well - Chapter 3&4

Harry Potter and The Sorcerer's

How to Write a Lot

12同义词与搭配

操作系统知识总结

《flask Web 开发》读书笔记 & chapter

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读