Reinforcement Learning 第七周课程笔记

作者: 我的名字叫清阳 | 来源:发表于2015-09-30 02:54 被阅读431次

    This week's tasks

    • watch Reward Shaping.
    • read Ng, Harada, Russell (1999) and* Asmuth, Littman, Zinkov (2008).*
    • office hours on Friday, October 2nd, from 4-5 pm(EST).
    • Homework 6.

    Why changing RF?

    Why changing MDP

    Given an MDP, RF can affect the behavior of the learner/agent so it ultimately specifies the behavior (or policy) we want for the MDP. So changing rewards can make the MDP easy to solve and represent

    1. Semantics: what the agent are expected to do at each state;
    2. Efficiency: speed (experience and/or computation needed), space (complexity), and solvability .

    How to change RF without changing optimal policy.

    How to Change RF

    Given an MDP described by <S, A, R, T, γ>, there are three ways to change R without changing optimal solution. (Note, if we know T, then it is not a RL problem any more, so this part of lecture if for MDP not RL specifically).

    1. Multiply by a positive constant ( non-zero 'cause multiply by 0 will erase all the reward information)
    2. shift by a constant
    3. non-linear potential-based

    1. Multiply by a positive constant

    Quiz 1
    • Q(s,a) is the solution of Bellman function with the old RF R(s,a).
    • R'(s,a) is a new RF with is the old RF multiplying by a constant.
    • What's the new solution Q'(s,a) with respect to the new RF R'(s,a) and old Q(s,a)?

    Here is how to solve the problem:

    1. Q = R + γR+γ2R + ... + γR)
    2. Q' = R' + γR'+γ2R' + ... + γR'
    3. Replace R' with c*R,
      Q'=(c*R) +γ(c*R)+γ2(c*R) + ... + γ(c*R)
      =c(R + γR+γ2R + ... + γR)
      =c*Q

    2. shift by a constant

    Quiz 2: Add a scalar Solution and proof of Quiz 2
    1. Q = R + γR+γ2R + ... + γR)
    2. Q' = R' + γR'+γ2R' + ... + γR'
    3. Replace R' with R+c,
      Q'=(R+c) +γ(R+c)+γ2(R+c) + ... + γ(R+c)
      =(R + γR+γ2R + ... + γR) + (c+γc+γ2c + ... + γc)
    4. The first part is Q and the second part is geometric series. So,
      Q' = Q + c/(1-γ)

    3. nonlinear potential-based reward shaping

    Quiz 3: potential-based reward shaping
    1. Q = R + γR+γ2R + ... + γR)
    2. Q' = R' + γR'+γ2R' + ... + γR'
    3. Replace R' with R-ψ(s) + γψ(s'),
      Q'=(R-ψ(s) + γψ(s')) +γ(R-ψ(s') + γψ(s''))+γ2(R-ψ(s'') + γψ(s''')) + ... + γ(R-ψ(s) + γψ(s'))

    =(R + γR+γ2R + ... + γR) + (-ψ(s) + γψ(s') +γ(-ψ(s') + γψ(s''))+γ2(-ψ(''s) + γψ(s''')) + ... + γ(-ψ(s) + γψ(s'))

    1. The first part is Q. In the second part, most of the elements are cancelling each other out and only has the very first and last elements left. So,
      Q' = Q + (-ψ(s) + γψ(s')
    1. Given γ is in (0,1), so γ=0. Then we have Q':
      Q' = Q - ψ(s)

    Q-learning with potential

    Updating the Q function with the potential based reward shaping,

    1. Q function will converge at Q*(s,a).
    2. we know that Q(s,a) = Q*(s,a) - ψ(s). If we initialize Q(s,a) with zero, then Q*(s,a) - ψ(s) = Q*(s,a) - maxaQ*(s,a) = 0, that means a is optimal.
    3. so Q-learning with potential is like initializing Q at Q*
    Q-learning with potential

    What have we learned?

    Summary
    • Potential functions is a way to speed up the process to solve MDP
    • Reward shaping might have suboptimal positive loops which will never converge?
    2015-09-29 初稿
    2015-12-04 reviewed and revised
    

    相关文章

      网友评论

      • a1395ce813b2:您好 我最近在看有关强化学习的论文,有幸读到您以前写的笔记,但是感觉好多知识都不是很了解,对这方面的学习,您有什么建议吗?

      本文标题:Reinforcement Learning 第七周课程笔记

      本文链接:https://www.haomeiwen.com/subject/uxowcttx.html