美文网首页
Lecture 6: Value Function Approx

Lecture 6: Value Function Approx

作者: 魏鹏飞 | 来源:发表于2020-04-22 11:51 被阅读0次

Author:David Silver
He was awarded the 2019 ACM Prize in Computing for breakthrough advances in computer game-playing.

Outline

  1. Introduction
  2. Incremental Methods
  3. Batch Methods

Large-Scale Reinforcement Learning

Reinforcement learning can be used to solve large problems, e.g.

  • Backgammon: 10^{20} states
  • Computer Go: 10^{170} states
  • Helicopter: continuous state space

How can we scale up the model-free methods for prediction and control from the last two lectures?

Value Function Approximation

Types of Value Function Approximation

Which Function Approximator?

Gradient Descent

Value Function Approx. By Stochastic Gradient Descent

Feature Vectors

Linear Value Function Approximation

Table Lookup Features

Incremental Prediction Algorithms

Monte-Carlo with Value Function Approximation

TD Learning with Value Function Approximation

TD(λ) with Value Function Approximation

Control with Value Function Approximation

Action-Value Function Approximation

Linear Action-Value Function Approximation

Incremental Control Algorithms

Linear Sarsa with Coarse Coding in Mountain Car

Linear Sarsa with Radial Basis Functions in Mountain Car

Study of λ: Should We Bootstrap?

Baird’s Counterexample

Parameter Divergence in Baird’s Counterexample

Convergence of Prediction Algorithms

Gradient Temporal-Difference Learning

Convergence of Control Algorithms

Batch Reinforcement Learning

  • Gradient descent is simple and appealing
  • But it is not sample efficient
  • Batch methods seek to find the best fitting value function
  • Given the agent’s experience (“training data”)

Least Squares Prediction

Stochastic Gradient Descent with Experience Replay

Stochastic Gradient Descent with Experience Replay

Experience Replay in Deep Q-Networks (DQN)

DQN in Atari

DQN Results in Atari

How much does DQN help?

Linear Least Squares Prediction

  • Experience replay finds least squares solution
  • But it may take many iterations
  • Using linear value function approximation \hat{v}(s, w) = x(s)^Tw
  • We can solve the least squares solution directly

Linear Least Squares Prediction (2)

Linear Least Squares Prediction Algorithms

Linear Least Squares Prediction Algorithms (2)

Convergence of Linear Least Squares Prediction Algorithms

Least Squares Policy Iteration

Least Squares Action-Value Function Approximation

Least Squares Control

Least Squares Q-Learning

Least Squares Policy Iteration Algorithm

Convergence of Control Algorithms

Chain Walk Example

LSPI in Chain Walk: Action-Value Function

LSPI in Chain Walk: Policy

Questions?

Reference:《UCL Course on RL》

相关文章

网友评论

      本文标题:Lecture 6: Value Function Approx

      本文链接:https://www.haomeiwen.com/subject/jkspihtx.html