美文网首页
Lecture 1: Introduction to Reinf

Lecture 1: Introduction to Reinf

作者: 魏鹏飞 | 来源:发表于2020-03-18 12:45 被阅读0次

Author:David Silver

Outline

  1. Admin
  2. About Reinforcement Learning
  3. The Reinforcement Learning Problem
  4. Inside An RL Agent
  5. Problems within Reinforcement Learning

Many Faces of Reinforcement Learning

Branches of Machine Learning

Characteristics of Reinforcement Learning

What makes reinforcement learning different from other machine learning paradigms?

  • There is no supervisor, only a reward signal
  • Feedback is delayed, not instantaneous
  • Time really matters (sequential, non i.i.d data)
  • Agent’s actions affect the subsequent data it receives

Examples of Reinforcement Learning

  • Fly stunt manoeuvres in a helicopter
  • Defeat the world champion at Backgammon
  • Manage an investment portfolio
  • Control a power station
  • Make a humanoid robot walk
  • Play many different Atari games better than humans

Rewards

  • A reward R_t is a scalar feedback signal
  • Indicates how well agent is doing at step t
  • The agent’s job is to maximise cumulative reward

Reinforcement learning is based on the reward hypothesis


Do you agree with this statement?

Examples of Rewards

  1. Fly stunt manoeuvres in a helicopter
    • +ve reward for following desired trajectory
    • −ve reward for crashing
  2. Defeat the world champion at Backgammon
    • +/−ve reward for winning/losing a game
  3. Manage an investment portfolio
    • +ve reward for each $ in bank
  4. Control a power station
    • +ve reward for producing power
    • −ve reward for exceeding safety thresholds
  5. Make a humanoid robot walk
    • +ve reward for forward motion
    • −ve reward for falling over
  6. Play many different Atari games better than humans
    • +/−ve reward for increasing/decreasing score

Sequential Decision Making

  • Goal: select actions to maximise total future reward
  • Actions may have long term consequences
  • Reward may be delayed
  • It may be better to sacrifice immediate reward to gain more long-term reward
  • Examples:
    • A financial investment (may take months to mature)
    • Refuelling a helicopter (might prevent a crash in several hours)
    • Blocking opponent moves (might help winning chances many moves from now)

Agent and Environment

History and State

  • The history is the sequence of observations, actions, rewards :
    H_t=O_1,R_1,A_1,...A_{t-1},O_t,R_t
  • i.e. all observable variables up to time t
  • i.e. the sensorimotor stream of a robot or embodied agent
  • What happens next depends on the history:
    • The agent selects actions
    • The environment selects observations/rewards
  • State is the information used to determine what happens next
  • Formally, state is a function of the history:
    S_t=f(H_t)

Environment State

Agent State

Information State

An information state (a.k.a. Markov state) contains all useful information from the history.

  • “The future is independent of the past given the present”
    H_{1:t}\longrightarrow S_t\longrightarrow H_{t+1:\infty}
  • Once the state is known, the history may be thrown away
  • i.e. The state is a sufficient statistic of the future
  • The environment state S_t^e is Markov
  • The history H_t is Markov

Rat Example

Fully Observable Environments

Partially Observable Environments

  • Partial observability: agent indirectly observes environment:
    • A robot with camera vision isn’t told its absolute location
    • A trading agent only observes current prices
    • A poker playing agent only observes public cards
  • Now agent state \neq environment state
  • Formally this is a partially observable Markov decision process(POMDP)
  • Agent must construct its own state representation S_t^a, e.g.
    • Complete history: S_t^a = H_t
    • Beliefs of environment state: S_t^a = (P[S_t^e = s^1],...,P[S_t^e = s^n])
    • Recurrent neural network: S_t^a = \sigma(S_{t-1}^aW_s + O_tW_o)

Major Components of an RL Agent

  • An RL agent may include one or more of these components:
    • Policy: agent’s behaviour function
    • Value function: how good is each state and/or action
    • Model: agent’s representation of the environment

Policy

  • A policy is the agent’s behaviour
  • It is a map from state to action, e.g. - Deterministic policy: a = \pi(s)
  • Stochastic policy: \pi(a|s) = P[A_t = a|S_t = s]

Value Function

  • Value function is a prediction of future reward
  • Used to evaluate the goodness/badness of states
  • And therefore to select between actions, e.g.
    v_{\pi}(s)=E_{\pi}[R_{t+1}+\gamma R_{t+2}+\gamma^2R_{t+3}+...|S_t=s]

Model

  • A model predicts what the environment will do next
  • P predicts the next state
  • R predicts the next (immediate) reward, e.g.
    P_{ss'}^a=P[S_{t+1}=s'|S_t=s,A_t=a]
    R_s^a=E[R_{t+1}|S_t=s,A_t=a]

Maze Example

Maze Example: Policy

Maze Example: Value Function

Maze Example: Model

Categorizing RL agents (1)

  1. Value Based
    • \color{gray}{No-Policy (Implicit)}
    • Value Function
  2. Policy Based
    • Policy
    • \color{gray}{No-Value-Function}
  3. Actor Critic
    • Policy
    • Value Function

Categorizing RL agents (2)

  1. Model Free
    • Policy and/or Value Function
    • \color{gray}{No-Model}
  2. Model Based
    • Policy and/or Value Function
    • Model

RL Agent Taxonomy

Learning and Planning

Two fundamental problems in sequential decision making

  1. Reinforcement Learning:

    • The environment is initially unknown
    • The agent interacts with the environment
    • The agent improves its policy
  2. Planning:

    • A model of the environment is known
    • The agent performs computations with its model (without any external interaction)
    • The agent improves its policy
    • a.k.a. deliberation, reasoning, introspection, pondering, thought, search

Atari Example: Reinforcement Learning

Atari Example: Planning

Exploration and Exploitation (1)

  • Reinforcement learning is like trial-and-error learning
  • The agent should discover a good policy
  • From its experiences of the environment
  • Without losing too much reward along the way

Exploration and Exploitation (2)

  • Exploration finds more information about the environment
  • Exploitation exploits known information to maximise reward
  • It is usually important to explore as well as exploit

Examples

  • Restaurant Selection
    \color{blue}{Exploitation} Go to your favourite restaurant
    \color{blue}{Exploration} Try a new restaurant

  • Online Banner Advertisements
    \color{blue}{Exploitation} Show the most successful advert
    \color{blue}{Exploration} Show a different advert

  • Oil Drilling
    \color{blue}{Exploitation} Drill at the best known location
    \color{blue}{Exploration} Drill at a new location

  • Game Playing
    \color{blue}{Exploitation} Play the move you believe is best
    \color{blue}{Exploration} Play an experimental move

Prediction and Control

  • Prediction: evaluate the future
    • Given a policy
  • Control: optimise the future
    • Find the best policy

Gridworld Example: Prediction

Gridworld Example: Control

Course Outline

  • Part I: Elementary Reinforcement Learning
    1. Introduction to RL
    2. Markov Decision Processes
    3. Planning by Dynamic Programming
    4. Model-Free Prediction
    5. Model-Free Control
  • Part II: Reinforcement Learning in Practice
    1. Value Function Approximation
    2. Policy Gradient Methods
    3. Integrating Learning and Planning
    4. Exploration and Exploitation
    5. Case study - RL in games

Reference:《UCL Course on RL》

相关文章

网友评论

      本文标题:Lecture 1: Introduction to Reinf

      本文链接:https://www.haomeiwen.com/subject/zwdqyhtx.html