美文网首页
莫烦强化学习笔记2- Q Learning

莫烦强化学习笔记2- Q Learning

作者: Tutan_dcb0 | 来源:发表于2019-02-11 21:28 被阅读0次

    Q-learning 决策: 根据Q表,选择reward较大的action.

    Q-learning 更新:

    image.png
    这一张图概括了我们之前所有的内容. 这也是 Q learning 的算法, 每次更新我们都用到了 Q 现实和 Q 估计, 而且 Q learning 的迷人之处就是 在 Q(s1, a2) 现实 中, 也包含了一个 Q(s2) 的最大估计值, 将对下一步的衰减的最大估计和当前所得到的奖励当成这一步的现实, 很奇妙吧. 最后我们来说说这套算法中一些参数的意义. Epsilon greedy 是用在决策上的一种策略, 比如 epsilon = 0.9 时, 就说明有90% 的情况我会按照 Q 表的最优值选择行为, 10% 的时间使用随机选行为. alpha是学习率, 来决定这次的误差有多少是要被学习的, alpha是一个小于1 的数. gamma 是对未来 reward 的衰减值. 我们可以这样想象.

    代码如下

    """
    A simple example for Reinforcement Learning using table lookup Q-learning method.
    An agent "o" is on the left of a 1 dimensional world, the treasure is on the rightmost location.
    Run this program and to see how the agent will improve its strategy of finding the treasure.
    View more on my tutorial page: https://morvanzhou.github.io/tutorials/
    """
    
    import numpy as np
    import pandas as pd
    import time
    
    np.random.seed(2)  # reproducible
    
    
    N_STATES = 6   # the length of the 1 dimensional world
    ACTIONS = ['left', 'right']     # available actions
    EPSILON = 0.9   # greedy police
    ALPHA = 0.1     # learning rate
    GAMMA = 0.9    # discount factor
    MAX_EPISODES = 13   # maximum episodes
    FRESH_TIME = 0.3    # fresh time for one move
    
    
    def build_q_table(n_states, actions):
        table = pd.DataFrame(
            np.zeros((n_states, len(actions))),     # q_table initial values
            columns=actions,    # actions's name
        )
        # print(table)    # show table
        return table
    
    
    def choose_action(state, q_table):
        # This is how to choose an action
        state_actions = q_table.iloc[state, :]
        if (np.random.uniform() > EPSILON) or ((state_actions == 0).all()):  # act non-greedy or state-action have no value
            action_name = np.random.choice(ACTIONS)
        else:   # act greedy
            action_name = state_actions.idxmax()    # replace argmax to idxmax as argmax means a different function in newer version of pandas
        return action_name
    
    
    def get_env_feedback(S, A):
        # This is how agent will interact with the environment
        if A == 'right':    # move right
            if S == N_STATES - 2:   # terminate
                S_ = 'terminal'
                R = 1
            else:
                S_ = S + 1
                R = 0
        else:   # move left
            R = 0
            if S == 0:
                S_ = S  # reach the wall
            else:
                S_ = S - 1
        return S_, R
    
    
    def update_env(S, episode, step_counter):
        # This is how environment be updated
        env_list = ['-']*(N_STATES-1) + ['T']   # '---------T' our environment
        if S == 'terminal':
            interaction = 'Episode %s: total_steps = %s' % (episode+1, step_counter)
            print('\r{}'.format(interaction), end='')
            time.sleep(2)
            print('\r                                ', end='')
        else:
            env_list[S] = 'o'
            interaction = ''.join(env_list)
            print('\r{}'.format(interaction), end='')
            time.sleep(FRESH_TIME)
    
    
    def rl():
        # main part of RL loop
        q_table = build_q_table(N_STATES, ACTIONS)
        for episode in range(MAX_EPISODES):
            step_counter = 0
            S = 0
            is_terminated = False
            update_env(S, episode, step_counter)
            while not is_terminated:
    
                A = choose_action(S, q_table)
                S_, R = get_env_feedback(S, A)  # take action & get next state and reward
                q_predict = q_table.loc[S, A]
                if S_ != 'terminal':
                    q_target = R + GAMMA * q_table.iloc[S_, :].max()   # next state is not terminal
                else:
                    q_target = R     # next state is terminal
                    is_terminated = True    # terminate this episode
    
                q_table.loc[S, A] += ALPHA * (q_target - q_predict)  # update
                S = S_  # move to next state
    
                update_env(S, episode, step_counter+1)
                step_counter += 1
        return q_table
    
    
    if __name__ == "__main__":
        q_table = rl()
        print('\r\nQ-table:\n')
        print(q_table)
    

    相关文章

      网友评论

          本文标题:莫烦强化学习笔记2- Q Learning

          本文链接:https://www.haomeiwen.com/subject/hdsceqtx.html