[Chapter 5] Reinforcement Learni

作者: 超级超级小天才 | 来源:发表于2021-05-30 10:04 被阅读0次

[Chapter 5] Reinforcement Learni
[Chapter 4] Reinforcement Learni
[Chapter 6] Reinforcement Learni
[Chapter 3] Reinforcement Learni
从零开始强化学习（一）——基础概念
reinforcement learning 学习资源推荐
icra2021 decentralized paper lis
PROGRESSIVE REINFORCEMENT LEARNI
关于「强化学习」和「模仿学习」的两篇文章
《DRN:A Deep Reinforcement Learni

Function Approximation

While we are learning the Q-functions, but how to represent or record the Q-values? For discrete and finite state space and action space, we can use a big table with size of $|S| \times |A|$ to represent the Q-values for all $(s,a)$ pairs. However, if the state space or action space is very huge, or actually, usually they are continuous and infinite, a tabular method doesn't work anymore.

We need function approximation to represent utility and Q-functions with some parameters ${\theta}$ to be learnt. Also take the grid environment as our example, we can represent the state using a pair of coordiantes $(x,y)$ , then one simple function approximation can be like this:

$\hat{U}_{\theta} (x,y)={\theta}_0+{\theta}_1 x+{\theta}_2 y$

Of course, you can design more complex functions when you have a much larger state space.

In this case, our reinforcement learning agent turns to learn the parameters ${\theta}$ to approximate the evaluation functions ( $\hat{U}_{\theta}$ or $\hat{Q}_{\theta}$ ).

For Monte Carlo learning, we can collect a set of training samples (trails) with input and label, then this turns to be a supervised learning problem. With squared error and linear function, we get a standard linear regression problem.

For Temporal Difference learning, the agent aims to adjust the parameters to reduce the temporal difference (to reduce the TD error. To update the parameters using gradient decrease method:

For SARSA (on-policy method):

${\theta}_i \leftarrow {\theta}_i+{\alpha}(R(s)+{\gamma}\hat{Q}_{\theta} (s^′,a′)−\hat{Q}_{\theta}(s,a)) \frac{\partial {\hat{Q}_{\theta} (s,a)}}{\partial{{\theta}_i}}$

For Q-learning (off-policy method):

${\theta}_i \leftarrow {\theta}_i+{\alpha}(R(s)+{\gamma} max_{a'}{\hat{Q}_{\theta} (s^′,a′)}−\hat{Q}_{\theta}(s,a)) \frac{\partial {\hat{Q}_{\theta} (s,a)}}{\partial{{\theta}_i}}$

Going Deep

One of the greatest advancement in reinforcement learning is to combine it with deep learning. As we have stated above, mostly, we cannot use a tabular method to represent the evaluation functions, we need approximation! I know what you want to say: you must have thought that deep network is a good function approximation. We have input for a network and output the Q-values or utilities, that's it! So, using deep network in RL is deep reinforcement learning (DRL).

Why we need deep network?

Firstly, for the environment that has nearly infinite state space, the deep network can hold a large set of parameters ${\theta}$ to be learnt and can map a large set of states to their expected Q-values.
Secondly, for some environment with complex observation, only deep network can solve them. For example, if the observation is an RGB image, we need convolutional neural network (CNN) in the first layers to read them; if the observation is a piece of audio, we need recurrent neural network (RNN) in the first layers.
Nowadays, designing and training a deep neural network can be done much easier based on the advanced hardware and software technology.

One of the DRL algorithms is Deep Q-learning Network (DQN), we have the pseudo code here but will not go into it:

image

[Chapter 5] Reinforcement Learni
Function Approximation While we are learning the Q-functi...
[Chapter 4] Reinforcement Learni
Model-Free RL Method In model-based method, we need first...
[Chapter 6] Reinforcement Learni
In the previous sections, we try to learn the utility fun...
[Chapter 3] Reinforcement Learni
Reinforcement Learning Firstly, we assume that all the en...
从零开始强化学习（一）——基础概念
一. 强化学习概念(Reinforcement learning) 引言：Reinforcement learni...
reinforcement learning 学习资源推荐
reinforcement learning 学习资源推荐强化学习圣经 reinforcement learni...
icra2021 decentralized paper lis
decentralized Learning for Robotics(+reinforcement learni...
PROGRESSIVE REINFORCEMENT LEARNI
Anonymous authorsPaper under double-blind review ABSTRACT...
关于「强化学习」和「模仿学习」的两篇文章
模仿学习（Imitation Learning）完全介绍（一）强化学习（Reinforcement Learni...
《DRN:A Deep Reinforcement Learni
之前学习了强化学习的一些内容以及推荐系统的一些内容，二者能否联系起来呢！今天阅读了一篇论文，题目叫《DRN: A ...