使用Policy network和Value network实现

作者: 碧影江白 | 来源:发表于2018-02-21 20:28 被阅读16次

    我们知道,著名的AlphaGo的基本组成是由策略网络(Policy network)估值网络(Value network),蒙特卡洛搜索树(Monte Carlo Tree Search)来共同完成,value network用于评估局面,policy network用于决策:

    而Monte Carlo Tree Search作为一种解决多轮序贯博弈问题的策略,我们会在今后进行研究,今天首先要做的是对两个网络进行基本地使用:使用其来实现CartPole,CartPole是一个简单的游戏,游戏策略即为如图所示的模型,为模型施与一个向右或者向左的力,如果小车偏离中心超过2.4个单位距离,或者杆的倾斜度超过15度则视为游戏结束。

    这里我们借助OpenAI Gym来实现。
    下面进入正题:

    Policy network

    策略网络即一个神经网络模型,可以通过观察当前的环境状态,来直接预测出一个最佳的行动策略,使这个策略可以获得最大的期望收益。得到每个行动方案所对应的概率。
    所以解决CartPole问题,我们就有了方案:根据输入的环境参数state,来得到对应的每个action的概率。
    在这里使用一个隐藏层来实现:

    H = 50
    
    observate = tf.placeholder(tf.float32, [None, 4], name="input_x")
    W1 = tf.get_variable("w1", shape=[4, H],
                         initializer=tf.contrib.layers.xavier_initializer())
    layer1 = tf.nn.relu(tf.matmul(observate, W1))
    W2 = tf.get_variable("w2", shape=[H, 1],
                         initializer=tf.contrib.layers.xavier_initializer())
    score = tf.matmul(layer1, W2)
    probability = tf.nn.sigmoid(score)
    

    其中H为隐藏层的层数,环境信息值observate并不是像素值,而是记录小车速度,位置,杆的角度,速度等信息的有四个值的数组。
    我们设力向左为0,向右为1,probability为Action为1的概率。

    而训练的方向,则是基于一个环境,获取价值越高的action所对应的probability应该越大。
    我们设置每做出一个action之后,如果游戏没有结束,则reward为1,否则为0,那么每个环境都有一个对应reward为1的action。
    我们当前的学习目标期望的价值,则为当前的Reward加上未来潜在的可获取的reward。

    设置gamma为0~1的数,防止目标发散:

    def discount_reward(r):
        # 根据每个reward:r和gamma来求每次的潜在价值
        discount_r = np.zeros_like(r)
        running_add = 0
        for t in reversed(range(r.size)):
            running_add = running_add * gamma + r[t]
            discount_r[t] = running_add
        return discount_r
    

    由此得到每个action的潜在价值。
    我们所要得到的训练结果为价值越大概率越大,价值越小概率越小。那么我们将loss设置为

    当前的action对应的probability与其相应的价值取反。可以扩大当前的probability,而由于当前的action为使游戏顺利进行的action,故可以得到目标结果。

    对以上理论进行整合和实现,得到Policy network实现的代码:

    import numpy as np
    import tensorflow as tf
    import gym
    import os
    os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
    env = gym.make('CartPole-v0')
    
    env.reset()
    H = 50
    batch_size = 25
    learning_rate = 1e-1
    D = 4
    gamma = 0.99
    xs, ys, drs = [], [], []
    reward_sum = 0
    episode_number = 1
    total_episodes = 1000
    
    # 根据当前的环境状态根据隐藏节点求action为1的概率
    observate = tf.placeholder(tf.float32, [None, D], name="input_x")
    W1 = tf.get_variable("w1", shape=[D, H],
                         initializer=tf.contrib.layers.xavier_initializer())
    layer1 = tf.nn.relu(tf.matmul(observate, W1))
    W2 = tf.get_variable("w2", shape=[H, 1],
                         initializer=tf.contrib.layers.xavier_initializer())
    score = tf.matmul(layer1, W2)
    probability = tf.nn.sigmoid(score)
    
    # 根据概率来求损失和梯度
    input_y = tf.placeholder(tf.float32, [None, 1], name="input_y")
    advantages = tf.placeholder(tf.float32, name="reward_signal")
    loglik = tf.log(input_y * (input_y - probability) +
                    (1 - input_y) * (input_y + probability))
    loss = -tf.reduce_mean(loglik * advantages)
    
    tvars = tf.trainable_variables()
    newGrads = tf.gradients(loss, tvars)
    
    
    # 根据梯度优化训练两层神经网络
    adam = tf.train.AdamOptimizer(learning_rate=learning_rate)
    W1grad = tf.placeholder(tf.float32, name="batch_grad1")
    W2grad = tf.placeholder(tf.float32, name="batch_grad2")
    batchGrad = [W1grad, W2grad]
    updateGrads = adam.apply_gradients(zip(batchGrad, tvars))
    
    
    def discount_reward(r):
        # 根据每个reward:r和gamma来求每次的潜在价值
        discount_r = np.zeros_like(r)
        running_add = 0
        for t in reversed(range(r.size)):
            running_add = running_add * gamma + r[t]
            discount_r[t] = running_add
        return discount_r
    
    # Session执行
    with tf.Session() as sess:
        rendering = False
        init = tf.global_variables_initializer()
        sess.run(init)
        observation = env.reset()
        gradBuff = sess.run(tvars)
        for ix, grad in enumerate(gradBuff):
            gradBuff[ix] = grad * 0
        while episode_number <= total_episodes:
    
            if reward_sum / batch_size > 100 or rendering == True:
                rendering = True
                env.render()
    
            x = np.reshape(observation, [1, D])
    
            tfprob = sess.run(probability, feed_dict={observate: x})
            action = 1 if np.random.uniform() < tfprob else 0
            xs.append(x)
            y = 1 - action
            ys.append(y)
    
            observation, reward, done, info = env.step(action)
            reward_sum += reward
            drs.append(reward)
    
            if done:
                episode_number += 1
                epx = np.vstack(xs)
                epy = np.vstack(ys)
                epr = np.vstack(drs)
                xs, ys, drs = [], [], []
                discount_epr = discount_reward(epr)
                discount_epr -= np.mean(discount_epr)
                discount_epr /= np.std(discount_epr)
    
                tGrad = sess.run(newGrads, feed_dict={observate:epx,
                                                      input_y:epy,
                                                      advantages: discount_epr})
                for ix, grad in enumerate(tGrad):
                    gradBuff[ix] += grad
    
                if episode_number % batch_size == 0:
                    sess.run(updateGrads, feed_dict={W1grad:gradBuff[0],
                                                     W2grad:gradBuff[1]})
                    for ix, grad in enumerate(gradBuff):
                        gradBuff[ix] = grad * 0
                    print('Average reward for episode %d: %f.' % \
                          (episode_number, reward_sum/batch_size))
    
                    if reward_sum/batch_size > 200:
                        print('Task solve in', episode_number, 'episodes!')
                        break
    
                    reward_sum = 0
    
                observation = env.reset()
    

    观察结果:

    只进行了一百多次实验,平均reward便已经可以达到100了。

    Value network

    与策略网络不同的是,估值网络则是学习action对应的期望价值,成为Q-learning,期望价值指的是从当前的这一步到后续的所有步骤总共可以获得的期望的最大值,用Q表示。
    关于Q-learning的简单实用: https://www.jianshu.com/p/1c0d5e83b066

    可以知道 ,Q矩阵记录的内容为在某一个state下所有的action对应的Q值,但是 在稍微复杂的环境中,如CartPole游戏,state是有非常多的,我们不可能把所有的state都用一个Q矩阵来记录,所以我们引入DQN,即较深层的神经网络,DQN与普通的Q-learning不同的在于相较于简单的矩阵记录方法,我们使用神经网络来对输入的state和每个action的Q值来进行训练

    如上图,输入state,经过神经网络的处理以后 得到每个action的value。根据最大的value来选择action。
    我们使用两个神经网络来对state进行处理得到每个value的价值。

        W1 = tf.Variable(tf.truncated_normal([STATE, HIDDEN_SIZE]))
        b1 = tf.Variable(tf.constant(0.01, shape = [HIDDEN_SIZE]))
        W2 = tf.Variable(tf.truncated_normal([HIDDEN_SIZE, ACTION]))
        b2 = tf.Variable(tf.constant(0.01, shape=[ACTION]))
    
        state_input = tf.placeholder("float",[None,STATE])
        h_layer = tf.nn.relu(tf.matmul(state_input,W1) + b1)
        Q_value = tf.matmul(h_layer,W2) + b2
    

    然后使用一个buffer缓存来储存若干个数据,每次从中随机取出batch_size个数据来进行训练,为了使数据可变,当数据数量超出规定个数以后使用新数据替换掉较老的数据

    buffer = deque()
    
    def add(state,action,reward,next_state,done):
        if len(buffer) > 100:
            buffer.popleft()
        buffer.append((state,action,reward,next_state,done))
    

    规定loss值为step得出的reward值与求解出来的R值来做差平方后求均值,使得两者的值更加接近即可。

    action_input = tf.placeholder("float",[None, ACTION])
    y_input = tf.placeholder("float",[None])
    Q_action = tf.reduce_sum(tf.multiply(Q_value, action_input),reduction_indices=1)
    cost = tf.reduce_mean(tf.square(y_input - Q_action))
    optimizer = tf.train.AdamOptimizer(0.001).minimize(cost)
    

    接下来我们使用类将它们封装后一起代入,可以得到经过训练后的结果

    import gym
    import tensorflow as tf
    import numpy as np
    import random
    from collections import deque
    
    GAMMA = 0.9
    INITIAL_EPSILON = 0.5
    FINAL_EPSILON = 0.01
    REPLAY_SIZE = 10000
    BATCH_SIZE = 32
    HIDDEN_SIZE = 20
    
    class DQN():
        def __init__(self, env):
            self.replay_buffer = deque()
            self.time_step = 0
            self.epsilon = INITIAL_EPSILON
            self.state_dim = env.observation_space.shape[0]
            self.action_dim = env.action_space.n
    
            self.create_Q_network()
            self.create_training_method()
    
            self.session = tf.InteractiveSession()
            self.session.run(tf.initialize_all_variables())
    
        def create_Q_network(self):
            W1 = self.weight_variable([self.state_dim, HIDDEN_SIZE])
            b1 = self.bias_variable([HIDDEN_SIZE])
            W2 = self.weight_variable([HIDDEN_SIZE,self.action_dim])
            b2 = self.bias_variable([self.action_dim])
    
            self.state_input = tf.placeholder("float",[None,self.state_dim])
    
            h_layer = tf.nn.relu(tf.matmul(self.state_input,W1) + b1)
    
            self.Q_value = tf.matmul(h_layer,W2) + b2
    
        def create_training_method(self):
            self.action_input = tf.placeholder("float",[None,self.action_dim]) # one hot presentation
            self.y_input = tf.placeholder("float",[None])
            Q_action = tf.reduce_sum(tf.multiply(self.Q_value,self.action_input),reduction_indices = 1)
            self.cost = tf.reduce_mean(tf.square(self.y_input - Q_action))
            self.optimizer = tf.train.AdamOptimizer(0.0001).minimize(self.cost)
    
        def perceive(self,state,action,reward,next_state,done):
            one_hot_action = np.zeros(self.action_dim)
            one_hot_action[action] = 1
            self.replay_buffer.append((state,one_hot_action,reward,next_state,done))
            if len(self.replay_buffer) > REPLAY_SIZE:
                self.replay_buffer.popleft()
    
            if len(self.replay_buffer) > BATCH_SIZE:
                self.train_Q_network()
    
        def train_Q_network(self):
            self.time_step += 1
            minibatch = random.sample(self.replay_buffer,BATCH_SIZE)
            state_batch = [data[0] for data in minibatch]
            action_batch = [data[1] for data in minibatch]
            reward_batch = [data[2] for data in minibatch]
            next_state_batch = [data[3] for data in minibatch]
            print(reward_batch)
            y_batch = []
            Q_value_batch = self.Q_value.eval(feed_dict={self.state_input:next_state_batch})
            for i in range(0,BATCH_SIZE):
                done = minibatch[i][4]
                if done:
                    y_batch.append(reward_batch[i])
                else :
                    y_batch.append(reward_batch[i] + GAMMA * np.max(Q_value_batch[i]))
            #print(state_batch)
            #print(action_batch)
            #print(y_batch)
            self.optimizer.run(feed_dict={
                self.y_input:y_batch,
                self.action_input:action_batch,
                self.state_input:state_batch
            })
    
        def egreedy_action(self,state):
            value = self.Q_value.eval(feed_dict = {
                self.state_input:[state]
            })
            self.epsilon -= (INITIAL_EPSILON - FINAL_EPSILON)/10000
            Q_value = value[0]
            if random.random() <= self.epsilon:
                return random.randint(0,self.action_dim - 1)
            else:
                return np.argmax(Q_value)
    
    
        def action(self,state):
            value = self.Q_value.eval(feed_dict = {
            self.state_input:[state]
            })
    
            return np.argmax(value[0])
    
        def weight_variable(self,shape):
            initial = tf.truncated_normal(shape)
            return tf.Variable(initial)
    
        def bias_variable(self,shape):
            initial = tf.constant(0.01, shape = shape)
            return tf.Variable(initial)
    
    ENV_NAME = 'CartPole-v0'
    EPISODE = 10000
    STEP = 300
    TEST = 10
    
    
    def main():
        env = gym.make(ENV_NAME)
        agent = DQN(env)
    
        for episode in range(EPISODE):
    
            state = env.reset()
    
            for step in range(STEP):
                action = agent.egreedy_action(state)
                next_state,reward,done,_ = env.step(action)
    
                agent.perceive(state,action,reward,next_state,done)
                state = next_state
                if done:
                    break
    
            if episode % 100 == 0:
                total_reward = 0
                for i in range(TEST):
                    state = env.reset()
                    for j in range(STEP):
                        if total_reward/TEST >= 160:
                            env.render()
                        action = agent.action(state)
                        state,reward,done,_ = env.step(action)
                        total_reward += reward
                        if done:
                            break
                ave_reward = total_reward/TEST
                print ('episode: ',episode,'Evaluation Average Reward:',ave_reward)
    
    
    if __name__ == '__main__':
        main()
    

    相关文章

      网友评论

        本文标题:使用Policy network和Value network实现

        本文链接:https://www.haomeiwen.com/subject/raxdtftx.html