Author:David Silver
![]()
Outline
- Admin
- About Reinforcement Learning
- The Reinforcement Learning Problem
- Inside An RL Agent
- Problems within Reinforcement Learning
Many Faces of Reinforcement Learning

Branches of Machine Learning

Characteristics of Reinforcement Learning
What makes reinforcement learning
different from other machine learning
paradigms?
- There is no supervisor, only a
reward signal
- Feedback is
delayed
, not instantaneous - Time really
matters
(sequential, non i.i.d data) - Agent’s actions
affect
the subsequent data it receives
Examples of Reinforcement Learning
- Fly stunt manoeuvres in a
helicopter
- Defeat the world champion at Backgammon
- Manage an
investment portfolio
- Control a
power station
- Make a
humanoid robot walk
- Play many different
Atari games
better than humans
Rewards
- A
reward
is a scalar feedback signal
- Indicates how well agent is doing at step t
- The agent’s job is to
maximise
cumulative reward
Reinforcement learning is based on the reward hypothesis

Do you agree with this statement?
Examples of Rewards
- Fly stunt manoeuvres in a helicopter
- +ve reward for following desired trajectory
- −ve reward for crashing
- Defeat the world champion at Backgammon
- +/−ve reward for winning/losing a game
- Manage an investment portfolio
- +ve reward for each $ in bank
- Control a power station
- +ve reward for producing power
- −ve reward for exceeding safety thresholds
- Make a humanoid robot walk
- +ve reward for forward motion
- −ve reward for falling over
- Play many different Atari games better than humans
- +/−ve reward for increasing/decreasing score
Sequential Decision Making
- Goal: select actions to
maximise
total future reward - Actions may have
long term
consequences - Reward may be
delayed
- It may be better to sacrifice
immediate reward
to gain more long-term reward - Examples:
- A financial investment
(may take months to mature)
- Refuelling a helicopter
(might prevent a crash in several hours)
- Blocking opponent moves
(might help winning chances many moves from now)
- A financial investment
Agent and Environment


History and State
- The
history
is the sequence of observations, actions, rewards :
- i.e. all observable variables up to time t
- i.e. the sensorimotor stream of a robot or embodied agent
- What happens next depends on the history:
- The agent selects actions
- The environment selects observations/rewards
-
State
is the information used to determine what happens next - Formally, state is a function of the history:
Environment State

Agent State

Information State
An information state
(a.k.a. Markov state
) contains all useful information from the history.

- “The future is independent of the past given the present”
- Once
the state is known
, the history may be thrown away - i.e. The state is a sufficient statistic of the future
- The environment state
is Markov
- The history
is Markov
Rat Example

Fully Observable Environments

Partially Observable Environments
-
Partial observability
: agentindirectly
observes environment:- A robot with camera vision isn’t told its absolute location
- A trading agent only observes current prices
- A poker playing agent only observes public cards
- Now agent state
environment state
- Formally this is a
partially observable Markov decision process
(POMDP) - Agent must construct its own state representation
, e.g.
- Complete history:
-
Beliefs
of environment state: - Recurrent neural network:
- Complete history:
Major Components of an RL Agent
- An RL agent may include one or more of these components:
-
Policy
: agent’s behaviour function -
Value function
: how good is each state and/or action -
Model
: agent’s representation of the environment
-
Policy
- A
policy
is the agent’s behaviour - It is a map from state to action, e.g. - Deterministic policy:
- Stochastic policy:
Value Function
- Value function is a prediction of future reward
- Used to evaluate the goodness/badness of states
- And therefore to select between actions, e.g.
Model
- A
model
predicts what the environment will do next -
predicts the next state
-
predicts the next (immediate) reward, e.g.
Maze Example

Maze Example: Policy

Maze Example: Value Function

Maze Example: Model

Categorizing RL agents (1)
- Value Based
- Value Function
- Policy Based
- Policy
- Actor Critic
- Policy
- Value Function
Categorizing RL agents (2)
- Model Free
- Policy and/or Value Function
- Model Based
- Policy and/or Value Function
- Model
RL Agent Taxonomy

Learning and Planning
Two fundamental problems
in sequential decision making
-
Reinforcement Learning:
- The environment is initially unknown
- The agent interacts with the environment
- The agent improves its policy
-
Planning:
- A model of the environment is known
- The agent performs computations with its model (without any external interaction)
- The agent improves its policy
- a.k.a. deliberation, reasoning, introspection, pondering, thought, search
Atari Example: Reinforcement Learning

Atari Example: Planning

Exploration and Exploitation (1)
- Reinforcement learning is like
trial-and-error
learning - The agent should discover a good policy
- From its experiences of the environment
- Without losing too much reward along the way
Exploration and Exploitation (2)
-
Exploration
finds more information about the environment -
Exploitation
exploits known information to maximise reward - It is usually important to explore as well as exploit
Examples
-
Restaurant Selection
Go to your favourite restaurant
Try a new restaurant
-
Online Banner Advertisements
Show the most successful advert
Show a different advert
-
Oil Drilling
Drill at the best known location
Drill at a new location
-
Game Playing
Play the move you believe is best
Play an experimental move
Prediction and Control
- Prediction:
evaluate
the future- Given a policy
- Control:
optimise
the future- Find the best policy
Gridworld Example: Prediction

Gridworld Example: Control

Course Outline
- Part I: Elementary Reinforcement Learning
- Introduction to RL
- Markov Decision Processes
- Planning by Dynamic Programming
- Model-Free Prediction
- Model-Free Control
- Part II: Reinforcement Learning in Practice
- Value Function Approximation
- Policy Gradient Methods
- Integrating Learning and Planning
- Exploration and Exploitation
- Case study - RL in games
Reference:《UCL Course on RL》
网友评论