Author:David Silver
He was awarded the 2019 ACM Prize in Computing for breakthrough advances in computer game-playing.
![]()
Outline
- Introduction
- Incremental Methods
- Batch Methods
Large-Scale Reinforcement Learning
Reinforcement learning can be used to solve large problems, e.g.
- Backgammon:
states
- Computer Go:
states
- Helicopter: continuous state space
How can we scale up the model-free methods
for prediction and control from the last two lectures?
Value Function Approximation
![](https://img.haomeiwen.com/i4905462/dc1877b1bfab1479.png)
Types of Value Function Approximation
![](https://img.haomeiwen.com/i4905462/b269597de481d282.png)
Which Function Approximator?
![](https://img.haomeiwen.com/i4905462/cda05bef558a4afc.png)
Gradient Descent
![](https://img.haomeiwen.com/i4905462/de2c717d15ae2d2a.png)
Value Function Approx. By Stochastic Gradient Descent
![](https://img.haomeiwen.com/i4905462/9332b7ec84529a2c.png)
Feature Vectors
![](https://img.haomeiwen.com/i4905462/e01199fd06478354.png)
Linear Value Function Approximation
![](https://img.haomeiwen.com/i4905462/0a56826eef165063.png)
Table Lookup Features
![](https://img.haomeiwen.com/i4905462/4b7ade57c803435f.png)
Incremental Prediction Algorithms
![](https://img.haomeiwen.com/i4905462/a761ddc18ea5dc3b.png)
Monte-Carlo with Value Function Approximation
![](https://img.haomeiwen.com/i4905462/dc2dc8a740559964.png)
TD Learning with Value Function Approximation
![](https://img.haomeiwen.com/i4905462/7bd539aef4681c78.png)
TD(λ) with Value Function Approximation
![](https://img.haomeiwen.com/i4905462/82dbf915de1bf5f8.png)
Control with Value Function Approximation
![](https://img.haomeiwen.com/i4905462/4353b85b356caa8e.png)
Action-Value Function Approximation
![](https://img.haomeiwen.com/i4905462/638010bcf5be00e0.png)
Linear Action-Value Function Approximation
![](https://img.haomeiwen.com/i4905462/ff7da9375ef9d7ee.png)
Incremental Control Algorithms
![](https://img.haomeiwen.com/i4905462/6727d15ee4de14ae.png)
Linear Sarsa with Coarse Coding in Mountain Car
![](https://img.haomeiwen.com/i4905462/4bd74b2a51ae5eaa.png)
Linear Sarsa with Radial Basis Functions in Mountain Car
![](https://img.haomeiwen.com/i4905462/88b652ce9ebaa487.png)
Study of λ: Should We Bootstrap?
![](https://img.haomeiwen.com/i4905462/8787dca8130d83fc.png)
Baird’s Counterexample
![](https://img.haomeiwen.com/i4905462/f003af71f2398dd3.png)
Parameter Divergence in Baird’s Counterexample
![](https://img.haomeiwen.com/i4905462/f431aab6a8138a1d.png)
Convergence of Prediction Algorithms
![](https://img.haomeiwen.com/i4905462/4e3f66027bf59297.png)
Gradient Temporal-Difference Learning
![](https://img.haomeiwen.com/i4905462/14278683288aa5ad.png)
Convergence of Control Algorithms
![](https://img.haomeiwen.com/i4905462/44f94ec6eabfe0d3.png)
Batch Reinforcement Learning
- Gradient descent is simple and appealing
- But it is not sample efficient
- Batch methods
seek to
find the best fitting value function - Given the agent’s experience
(“training data”)
Least Squares Prediction
![](https://img.haomeiwen.com/i4905462/5009210a7ada96a0.png)
Stochastic Gradient Descent with Experience Replay
![](https://img.haomeiwen.com/i4905462/13c350e72130b7d1.png)
Stochastic Gradient Descent with Experience Replay
![](https://img.haomeiwen.com/i4905462/9dfa7c53b57adc43.png)
Experience Replay in Deep Q-Networks (DQN)
![](https://img.haomeiwen.com/i4905462/584ca4c9791db297.png)
DQN in Atari
![](https://img.haomeiwen.com/i4905462/e171bd7e22bd1068.png)
DQN Results in Atari
![](https://img.haomeiwen.com/i4905462/bda668cb140cee78.png)
How much does DQN help?
![](https://img.haomeiwen.com/i4905462/261475530cc27e01.png)
Linear Least Squares Prediction
- Experience replay finds least squares solution
- But it may take many iterations
- Using linear value function approximation
- We can solve the least squares solution directly
Linear Least Squares Prediction (2)
![](https://img.haomeiwen.com/i4905462/d44cc0b2351c9fab.png)
Linear Least Squares Prediction Algorithms
![](https://img.haomeiwen.com/i4905462/81057ac15cc97eee.png)
Linear Least Squares Prediction Algorithms (2)
![](https://img.haomeiwen.com/i4905462/40131dc82a357899.png)
Convergence of Linear Least Squares Prediction Algorithms
![](https://img.haomeiwen.com/i4905462/37cb5ef0daf801ce.png)
Least Squares Policy Iteration
![](https://img.haomeiwen.com/i4905462/de00a3d4d0d05cd7.png)
Least Squares Action-Value Function Approximation
![](https://img.haomeiwen.com/i4905462/6235ae3eb52d873d.png)
Least Squares Control
![](https://img.haomeiwen.com/i4905462/43b083ddcde264df.png)
Least Squares Q-Learning
![](https://img.haomeiwen.com/i4905462/48c11d8761f297cd.png)
Least Squares Policy Iteration Algorithm
![](https://img.haomeiwen.com/i4905462/d3626a9c44f9b03c.png)
Convergence of Control Algorithms
![](https://img.haomeiwen.com/i4905462/21e674cf11528c02.png)
Chain Walk Example
![](https://img.haomeiwen.com/i4905462/28e6c2da0fcbb400.png)
LSPI in Chain Walk: Action-Value Function
![](https://img.haomeiwen.com/i4905462/4eb418d37634a6a9.png)
LSPI in Chain Walk: Policy
![](https://img.haomeiwen.com/i4905462/af7ff5896e7efb91.png)
Questions?
![](https://img.haomeiwen.com/i4905462/5a04f726f55f2859.png)
Reference:《UCL Course on RL》
网友评论