Chapter 2

作者: MasterXiong | 来源:发表于2020-09-02 21:52 被阅读0次

《flask Web 开发》读书笔记 & chapter
（飘）随风而逝目录
《flask Web 开发》读书笔记 & chapter
Nuke Python 中文帮助目录
02词的意义
Harry Potter and The Sorcerer's
28.《violent python》笔记
06词的分类
Day 7 时间简史
How to Write a Lot

Chapter 2: Multi-armed Bandits

Multi-armed bandits can be seen as the simplest form of reinforcement learning, where there is only a single state. The key point here is how to estimate the action values. This chapter mainly aims to introduce some key concepts in RL, such as value estimation, exploration-exploitation dilemma and policy gradient, under a simplified setting.

Different methods for value estimation:

Sample average: $Q_{n+1} = \frac{1}{n} \sum_{i=1}^n R_i$
Pros: Easy to understand and implement; unbiased estimation; convergence guanrantee with infinite time step under stationary environment
Cons: Time and memory complexity increases with the step number
Incremental form: $Q_{n+1} = Q_n + \frac{1}{n} (R_n - Q_n)$
This is equivalent to sample average but introduces an important update rule in RL, i.e., $NewEstimate \leftarrow OldEstimate + StepSize \ [ Target - OldEstimate ]$ However, we need to note that the $Target$ here is not the true action value $q_*(a)$ , but an unbiased estimation $R_n$ , thus will introduce noise into the update rule.
Pros: constant time and momory cost
Cons: step size decreases with time step, thus is not suitable for nonstationary environments
Constant step size: $Q_{n+1} = Q_n + \alpha \ (R_n - Q_n)$
Pros: exponential weighted average over historical rewards, thus is suitable for nonstationary environments
Cons: biased estimation of action values; no convergence guarantee

Different methods for exploration:

$\epsilon$ -greedy: the effect of different values of $\epsilon$ on the learning curve; time-variant $\epsilon$ scheme
Optimistic intial values: encourage exploration at the early stage, but not a general method
initial values as prior knowledge (MAML)
Upper-confidence-bound (UCB): $Q_t(a) + c \sqrt{\frac{\ln t}{N_t(a)}}$
Closely related to other methods like Bayesian optimization

Gradient Bandit Algorithms

Instead of selecting action based on the learned action values, directly learn a numerical preference for each action.
Follow the derivation in the book, which formulates it as a SGD method.
Baseline can help reduce variance.

Contextual Bandits

At each time step, the agent is presented with a random bandit from a set of different bandits, and the identity of the bandit is known.
It is different from multi-armed bandits, as different states (bandits) are involved.
It is different from RL as the action in each step only influences the immediate reward, but has no effect on the next state and the rewards in the future.

网友评论

本文标题：Chapter 2

本文链接：https://www.haomeiwen.com/subject/dycgsktx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

Chapter 2

Chapter 2: Multi-armed Bandits

Different methods for value estimation:

Different methods for exploration:

Gradient Bandit Algorithms

Contextual Bandits

相关文章

《flask Web 开发》读书笔记 & chapter

（飘）随风而逝目录

《flask Web 开发》读书笔记 & chapter

Nuke Python 中文帮助目录

02词的意义

Harry Potter and The Sorcerer's

28.《violent python》笔记

06词的分类

Day 7 时间简史

How to Write a Lot

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读