RL[0] - 初见

作者: db24cc | 来源:发表于2017-12-02 22:11 被阅读0次

RL[0] - 初见
人生是一场有去无回的旅程
Bella的Scalers-talk第四期新概念朗读持续力训练D
CEC
读《比贫穷更可怕的是“穷人思维”》——感悟
mac 本机mysql无法启动
深度强化学习总结
强化学习
RL
RL

结构

背景
Q-Learning with table
Q-Learning with network
后记

背景

RL是reinforcement learning的缩写, 属于机器学习的一个领域,严谨的定义如下:

Reinforcement learning (RL) is an area of [machine learning inspired by behaviourist psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.

我理解RL是一个最优解的寻找问题,通过不同的trick让计算机面对action play的场景下做出最有利的行动,比如玩游戏

Q-Learning with table

q-learning是RL算法中的一个分支, 从wiki中扒的定义如下:Q-learning is a model-free reinforcement learning technique. Specifically, Q-learning can be used to find an optimal action-selection policy for any given (finite) Markov decision process(MDP).
我最开始学习是从Playing Atari with Deep Reinforcement Learning和Simple Reinforcement Learning with Tensorflow开始的,论文主要是讲DQN的
论文中有对MDR&BellmanEquation的详细描述, 简单抽离一下:
我们的agent在每一个场景下可以做出一系列的action中的一个(A = {1, . . . , K}),因为这个action会获得相应reward, take 那个action有2个因素

后续位置的好坏
当前action带来的reward
reward可以是系统返回的,也可以是agent观测到的(就像人玩游戏一样,观察游戏画面),当前take 那个action不禁需要考虑那个action收获最大,还要考虑这个action到了那个状态,因为后续状态决定了后续的reward, 所以我们应该选择action=a 满足

Q(s,a) = r + γ(max(Q(s’,a’))

其中γ是一个系数,标识当下比未来的权重,s是当前状态s'是跳转之后的状态,这个选择就是BellmanEquation
整个环境,action, reward数学模型和框架就是MDR(马尔科夫决策过程)

Screen Shot 2017-12-02 at 8.05.15 PM.png

假如当前这一步action的reward系统可以返回,只要我们之后后续步骤的最优解就可以每一个都按照bellman equation来走了, 所以问题就转化成了如何求解每个状态的最优解,然后把他们存起来,agent执行的时候查表即可
以下我们以FrozenLake为例子看下Q value的table是如何计算的

这个系列的例子都来自Simple Reinforcement Learning with Tensorflow

使用 OpenAI gym我们很容易可以模拟很多toy game

The FrozenLake environment consists of a 4x4 grid of blocks, each one either being the start block, the goal block, a safe frozen block, or a dangerous hole. The objective is to have an agent learn to navigate from the start to the goal without moving onto a hole. At any given time the agent can choose to move either up, down, left, or right.

每个状态有4个action,一共16个状态,所以table是16X4, mdr的decision making是partly random and partly under the control of a decision maker, 有人叫这个是ξ-greedy方式, 我refactor作者原来的变量命名方式,
代码如下(#开头是原作者的Comments, '# #'或者"""包围是我加的comments)

# coding=utf-8
import numpy as np
import gym

env = gym.make('FrozenLake-v0')

q_table = np.zeros([env.observation_space.n, env.action_space.n])
# Set learning parameters
lr = .8
y = .95
num_episodes = 2000
rewards = []
for i in range(num_episodes):
    # Reset environment and get first new observation
    s = env.reset()
    reward_episode = 0
    game_over = False
    j = 0
    # The Q-Table learning algorithm
    while j < 99:
        j += 1
        # Choose an action by greedily (with noise) picking from Q table
        """
        randn return a sequence of numbers from the "standard normal" distribution, with when i becoming larger and
        larger, random take smaller and smaller impact of decision making, @very begging it just random choice 
        """
        action_to_be_taken = np.argmax(q_table[s, :] + np.random.randn(1, env.action_space.n) * (1. / (i + 1)))
        # Get new state and reward from environment
        new_statue, reward, game_over, _ = env.step(action_to_be_taken)
        # Update Q-Table with new knowledge
        """
        according to bell-equation Q(s,a) = r + γ(max(Q(s’,a’)) 
        q_table[s, action_to_be_taken] = r + γ*max(q_table[new_state,:]), after each iteration max(q_table[new_state,:]) 
        may be changed,  hence need updated, let 's lr = γ, q_table[s, action_to_be_taken]  = q = r + γ*max_old, 
        q + γ*(r + y*max_new - q) = r + γ*max_old + γ*(r + y*max_new - r - max_old) = r + γ*y*max_new which exactly 
        equal to new Q(S, a) 's value
        """
        q_table[s, action_to_be_taken] = q_table[s, action_to_be_taken] + lr * (reward + y * np.max(q_table[new_statue, :]) - q_table[s, action_to_be_taken])
        reward_episode += reward
        s = new_statue
        if game_over:
            break
    rewards.append(reward_episode)

print "Score over time: " + str(sum(rewards) / num_episodes)
print "Final Q-Table Values"
print q_table

原来代码中episode是2000, 我尝试了10000,20000,30000的结果如下,可以看出后续增加episode,q-table的值趋于稳定

q-learning with model

table方式虽然高效,但是面对现实问题,table的size可能是非常恐怖的巨大,难以放入内存中,于是就有另一种思路,q_table的value的值不是每一个保存,给出当前状态s模拟计算出每个action对应Q-value, Q-value最大就是最有利的选择

在FrozenLake的例子中, 我们用一层1X16的网络来标识当前的状态, 输出是4个action的q-value,所以网络结构是16X4. 我们用tensorflow来训练矩阵,
其中loss函数

loss = ∑(Q-target - Q)²

代码如下:

# coding=utf-8
import matplotlib.pyplot as plt
import numpy as np
import gym
import tensorflow as tf

env = gym.make('FrozenLake-v0')

tf.reset_default_graph()


# These lines establish the feed-forward part of the network used to choose actions
input_state = tf.placeholder(shape=[1, 16], dtype=tf.float32)
xavier_init = tf.contrib.layers.xavier_initializer()
W = tf.Variable(xavier_init([16, 4]))
q_out = tf.matmul(input_state, W)
predict = tf.argmax(q_out, 1)[0]

# Below we obtain the loss by taking the sum of squares difference between the target and prediction Q values.
target_q = tf.placeholder(shape=[1, 4], dtype=tf.float32)
loss = tf.reduce_sum(tf.square(target_q - q_out))
trainer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
update_model = trainer.minimize(loss)

init = tf.global_variables_initializer()

# Set learning parameters
y = .99
e = 0.1
num_episodes = 2000
# create lists to contain total rewards and steps per episode
rewards = []
counts = []
with tf.Session() as sess:
    sess.run(init)
    for i in range(num_episodes):
        # Reset environment and get first new observation
        s = env.reset()
        reward_episode = 0
        d = False
        j = 0
        # The Q-Network
        # # in case of dead loop within the game, need up limit jump out
        while j < 99:
            j += 1
            # Choose an action by greedily (with e chance of random action) from the Q-network
            # # 1*16 dimension array, a[i] = 1 if i == s otherwise 0
            a, q_out_value = sess.run([predict, q_out], feed_dict={input_state: np.identity(16)[s:s + 1]})
            # # ξ-greedy selection
            if np.random.rand(1) < e:
                a = env.action_space.sample()
            # Get new state and reward from environment
            new_state, r, d, _ = env.step(a)
            # Obtain the Q' values by feeding the new state through our network
            new_q = sess.run(q_out, feed_dict={input_state: np.identity(16)[new_state:new_state + 1]})
            # Obtain maxQ' and set our target value for chosen action.
            new_max_q = np.max(new_q)
            target_value = q_out_value
            # #
            target_value[0, a] = r + y * new_max_q
            # Train our network using target and predicted Q values
            _, W1 = sess.run([update_model, W], feed_dict={input_state: np.identity(16)[s:s + 1], target_q: target_value})
            reward_episode += r
            s = new_state
            if d:
                # Reduce chance of random action as we train the model.
                e = 1. / ((i / 50) + 10)
                break
        rewards.append(reward_episode)
        counts.append(j)
print W1
print "Percent of succesful episodes: " + str(sum(rewards) / num_episodes) + "%"
plt.plot(rewards)
plt.plot(counts)

这个例子网络结构太简单了,用cpu就可以跑, 750个episode就可以达到成绩,盗用个图plot的图

image.png

后记

初见RL, 后续会有更加有意思的,比如人玩游戏的看到的图像然后反应,理应卷积神经网络抽取图像特征来处理而不是类似one_hot数组输入,agent缓存过往训练片段随机抽取batch,大大将强训练效果(experience replay),再比如用两个network来训练Double DQN 和同一个网络中抽离a和v Dueling DQN

感谢medium的作者辛苦讲解和deepmind的论文无私付出

RL[0] - 初见
结构背景 Q-Learning with table Q-Learning with network 后记背景...
人生是一场有去无回的旅程
https://mp.weixin.qq.com/s/-NZ0Rl80Ej36RTejolvsxg
Bella的Scalers-talk第四期新概念朗读持续力训练D
任务配置 L0+L4 001任务L0 təˈwɔrdz] [ˈmɪdˌdeɪ], [ə] [gɜrl] [...
CEC
MCUCAQECIFoOIAsW8LEoyrNEoZSMHb5EESE3n/E3D0TgeiByQ7RL
读《比贫穷更可怕的是“穷人思维”》——感悟
https://docs.qq.com/doc/DUUVQS3pDQmRJa0RL 感悟:穷人思维真是随处可见的现...
mac 本机mysql无法启动
sudo chown -RL root:mysql /usr/local/mysqlsudo chown -RL ...
深度强化学习总结
0. 引言最近跟着 OpenAI 的 Spinning Up 教学文档学习了一遍 Deep RL，对这个领域有...
强化学习
RL 种类 Model-Free RL不理解环境，通过试错来学习 Model-Based RL理解环境，通过想象学...
RL
Q-learning Sarsa Sara-lambda
RL
策略（搜索/优化）都是在学习控制律control law，即系统状态到控制输入的映射（本质上也是个回归问题）。强化...