美文网首页
RL[0] - 初见

RL[0] - 初见

作者: db24cc | 来源:发表于2017-12-02 22:11 被阅读0次

    结构

    • 背景
    • Q-Learning with table
    • Q-Learning with network
    • 后记

    背景

    RL是reinforcement learning的缩写, 属于机器学习的一个领域,严谨的定义如下:

    Reinforcement learning (RL) is an area of [machine learning inspired by behaviourist psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.

    我理解RL是一个最优解的寻找问题,通过不同的trick让计算机面对action play的场景下做出最有利的行动,比如玩游戏

    Q-Learning with table

    q-learning是RL算法中的一个分支, 从wiki中扒的定义如下:Q-learning is a model-free reinforcement learning technique. Specifically, Q-learning can be used to find an optimal action-selection policy for any given (finite) Markov decision process(MDP).
    我最开始学习是从Playing Atari with Deep Reinforcement LearningSimple Reinforcement Learning with Tensorflow开始的,论文主要是讲DQN的
    论文中有对MDR&BellmanEquation的详细描述, 简单抽离一下:
    我们的agent在每一个场景下可以做出一系列的action中的一个(A = {1, . . . , K}),因为这个action会获得相应reward, take 那个action有2个因素

    1. 后续位置的好坏
    2. 当前action带来的reward
      reward可以是系统返回的,也可以是agent观测到的(就像人玩游戏一样,观察游戏画面),当前take 那个action不禁需要考虑那个action收获最大,还要考虑这个action到了那个状态,因为后续状态决定了后续的reward, 所以我们应该选择action=a 满足

    Q(s,a) = r + γ(max(Q(s’,a’))

    其中γ是一个系数,标识当下比未来的权重,s是当前状态s'是跳转之后的状态,这个选择就是BellmanEquation
    整个环境,action, reward数学模型和框架就是MDR(马尔科夫决策过程)

    Screen Shot 2017-12-02 at 8.05.15 PM.png

    假如当前这一步action的reward系统可以返回,只要我们之后后续步骤的最优解就可以每一个都按照bellman equation来走了, 所以问题就转化成了如何求解每个状态的最优解,然后把他们存起来,agent执行的时候查表即可
    以下我们以FrozenLake为例子看下Q value的table是如何计算的

    这个系列的例子都来自Simple Reinforcement Learning with Tensorflow

    使用 OpenAI gym我们很容易可以模拟很多toy game

    The FrozenLake environment consists of a 4x4 grid of blocks, each one either being the start block, the goal block, a safe frozen block, or a dangerous hole. The objective is to have an agent learn to navigate from the start to the goal without moving onto a hole. At any given time the agent can choose to move either up, down, left, or right.

    每个状态有4个action,一共16个状态,所以table是16X4, mdr的decision making是partly random and partly under the control of a decision maker, 有人叫这个是ξ-greedy方式, 我refactor作者原来的变量命名方式,
    代码如下(#开头是原作者的Comments, '# #'或者"""包围是我加的comments)

    # coding=utf-8
    import numpy as np
    import gym
    
    env = gym.make('FrozenLake-v0')
    
    q_table = np.zeros([env.observation_space.n, env.action_space.n])
    # Set learning parameters
    lr = .8
    y = .95
    num_episodes = 2000
    rewards = []
    for i in range(num_episodes):
        # Reset environment and get first new observation
        s = env.reset()
        reward_episode = 0
        game_over = False
        j = 0
        # The Q-Table learning algorithm
        while j < 99:
            j += 1
            # Choose an action by greedily (with noise) picking from Q table
            """
            randn return a sequence of numbers from the "standard normal" distribution, with when i becoming larger and
            larger, random take smaller and smaller impact of decision making, @very begging it just random choice 
            """
            action_to_be_taken = np.argmax(q_table[s, :] + np.random.randn(1, env.action_space.n) * (1. / (i + 1)))
            # Get new state and reward from environment
            new_statue, reward, game_over, _ = env.step(action_to_be_taken)
            # Update Q-Table with new knowledge
            """
            according to bell-equation Q(s,a) = r + γ(max(Q(s’,a’)) 
            q_table[s, action_to_be_taken] = r + γ*max(q_table[new_state,:]), after each iteration max(q_table[new_state,:]) 
            may be changed,  hence need updated, let 's lr = γ, q_table[s, action_to_be_taken]  = q = r + γ*max_old, 
            q + γ*(r + y*max_new - q) = r + γ*max_old + γ*(r + y*max_new - r - max_old) = r + γ*y*max_new which exactly 
            equal to new Q(S, a) 's value
            """
            q_table[s, action_to_be_taken] = q_table[s, action_to_be_taken] + lr * (reward + y * np.max(q_table[new_statue, :]) - q_table[s, action_to_be_taken])
            reward_episode += reward
            s = new_statue
            if game_over:
                break
        rewards.append(reward_episode)
    
    print "Score over time: " + str(sum(rewards) / num_episodes)
    print "Final Q-Table Values"
    print q_table
    

    原来代码中episode是2000, 我尝试了10000,20000,30000的结果如下,可以看出后续增加episode,q-table的值趋于稳定

    q-learning with model

    table方式虽然高效,但是面对现实问题,table的size可能是非常恐怖的巨大,难以放入内存中,于是就有另一种思路,q_table的value的值不是每一个保存,给出当前状态s模拟计算出每个action对应Q-value, Q-value最大就是最有利的选择

    在FrozenLake的例子中, 我们用一层1X16的网络来标识当前的状态, 输出是4个action的q-value,所以网络结构是16X4. 我们用tensorflow来训练矩阵,
    其中loss函数

    loss = ∑(Q-target - Q)²

    代码如下:

    # coding=utf-8
    import matplotlib.pyplot as plt
    import numpy as np
    import gym
    import tensorflow as tf
    
    env = gym.make('FrozenLake-v0')
    
    tf.reset_default_graph()
    
    
    # These lines establish the feed-forward part of the network used to choose actions
    input_state = tf.placeholder(shape=[1, 16], dtype=tf.float32)
    xavier_init = tf.contrib.layers.xavier_initializer()
    W = tf.Variable(xavier_init([16, 4]))
    q_out = tf.matmul(input_state, W)
    predict = tf.argmax(q_out, 1)[0]
    
    # Below we obtain the loss by taking the sum of squares difference between the target and prediction Q values.
    target_q = tf.placeholder(shape=[1, 4], dtype=tf.float32)
    loss = tf.reduce_sum(tf.square(target_q - q_out))
    trainer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
    update_model = trainer.minimize(loss)
    
    init = tf.global_variables_initializer()
    
    # Set learning parameters
    y = .99
    e = 0.1
    num_episodes = 2000
    # create lists to contain total rewards and steps per episode
    rewards = []
    counts = []
    with tf.Session() as sess:
        sess.run(init)
        for i in range(num_episodes):
            # Reset environment and get first new observation
            s = env.reset()
            reward_episode = 0
            d = False
            j = 0
            # The Q-Network
            # # in case of dead loop within the game, need up limit jump out
            while j < 99:
                j += 1
                # Choose an action by greedily (with e chance of random action) from the Q-network
                # # 1*16 dimension array, a[i] = 1 if i == s otherwise 0
                a, q_out_value = sess.run([predict, q_out], feed_dict={input_state: np.identity(16)[s:s + 1]})
                # # ξ-greedy selection
                if np.random.rand(1) < e:
                    a = env.action_space.sample()
                # Get new state and reward from environment
                new_state, r, d, _ = env.step(a)
                # Obtain the Q' values by feeding the new state through our network
                new_q = sess.run(q_out, feed_dict={input_state: np.identity(16)[new_state:new_state + 1]})
                # Obtain maxQ' and set our target value for chosen action.
                new_max_q = np.max(new_q)
                target_value = q_out_value
                # #
                target_value[0, a] = r + y * new_max_q
                # Train our network using target and predicted Q values
                _, W1 = sess.run([update_model, W], feed_dict={input_state: np.identity(16)[s:s + 1], target_q: target_value})
                reward_episode += r
                s = new_state
                if d:
                    # Reduce chance of random action as we train the model.
                    e = 1. / ((i / 50) + 10)
                    break
            rewards.append(reward_episode)
            counts.append(j)
    print W1
    print "Percent of succesful episodes: " + str(sum(rewards) / num_episodes) + "%"
    plt.plot(rewards)
    plt.plot(counts)
    

    这个例子网络结构太简单了,用cpu就可以跑, 750个episode就可以达到成绩,盗用个图plot的图


    image.png
    image.png
    image.png

    后记

    初见RL, 后续会有更加有意思的,比如人玩游戏的看到的图像然后反应,理应卷积神经网络抽取图像特征来处理而不是类似one_hot数组输入,agent缓存过往训练片段随机抽取batch,大大将强训练效果(experience replay),再比如用两个network来训练Double DQN 和同一个网络中抽离a和v Dueling DQN

    感谢medium的作者辛苦讲解和deepmind的论文无私付出

    相关文章

      网友评论

          本文标题:RL[0] - 初见

          本文链接:https://www.haomeiwen.com/subject/mztobxtx.html