美文网首页
RL: my_learning_agent.py

RL: my_learning_agent.py

作者: 魏鹏飞 | 来源:发表于2020-04-09 11:07 被阅读0次

    Keywords:

    cross-entropy method、noisy_evaluation、BinaryActionLinearPolicy、do_rollout、mean、std、

    _policies.py
    # Support code for cem.py
    
    class BinaryActionLinearPolicy(object):
        def __init__(self, theta):
            self.w = theta[:-1]
            self.b = theta[-1]
        def act(self, ob):
            y = ob.dot(self.w) + self.b
            a = int(y < 0)
            return a
    
    class ContinuousActionLinearPolicy(object):
        def __init__(self, theta, n_in, n_out):
            assert len(theta) == (n_in + 1) * n_out
            self.W = theta[0 : n_in * n_out].reshape(n_in, n_out)
            self.b = theta[n_in * n_out : None].reshape(1, n_out)
        def act(self, ob):
            a = ob.dot(self.W) + self.b
            return a
    
    my_learning_agent.py
    from __future__ import print_function
    
    import gym
    from gym import wrappers, logger
    import numpy as np
    from six.moves import cPickle as pickle
    import json, sys, os
    from os import path
    from _policies import BinaryActionLinearPolicy # Different file so it can be unpickled
    import argparse
    
    def cem(f, th_mean, batch_size, n_iter, elite_frac, initial_std=1.0):
        """
        Generic implementation of the cross-entropy method for maximizing a black-box function
        f: a function mapping from vector -> scalar
        th_mean: initial mean over input distribution
        batch_size: number of samples of theta to evaluate per batch
        n_iter: number of batches
        elite_frac: each batch, select this fraction of the top-performing samples
        initial_std: initial standard deviation over parameter vectors
        """
        n_elite = int(np.round(batch_size*elite_frac))
        th_std = np.ones_like(th_mean) * initial_std
    
        for _ in range(n_iter):
            ths = np.array([th_mean + dth for dth in  th_std[None,:]*np.random.randn(batch_size, th_mean.size)])
            ys = np.array([f(th) for th in ths])
            elite_inds = ys.argsort()[::-1][:n_elite]
            elite_ths = ths[elite_inds]
            th_mean = elite_ths.mean(axis=0)
            th_std = elite_ths.std(axis=0)
            yield {'ys' : ys, 'theta_mean' : th_mean, 'y_mean' : ys.mean()}
    
    def do_rollout(agent, env, num_steps, render=False):
        total_rew = 0
        ob = env.reset()
        for t in range(num_steps):
            a = agent.act(ob)
            (ob, reward, done, _info) = env.step(a)
            total_rew += reward
            if render and t%3==0: env.render()
            if done: break
        return total_rew, t+1
    
    if __name__ == '__main__':
        logger.set_level(logger.INFO)
    
        parser = argparse.ArgumentParser()
        parser.add_argument('--display', action='store_true')
        parser.add_argument('target', nargs="?", default="CartPole-v0")
        args = parser.parse_args()
    
        env = gym.make(args.target)
        env.seed(0)
        np.random.seed(0)
        params = dict(n_iter=100, batch_size=10, elite_frac = 0.2)
        num_steps = 200
    
        def noisy_evaluation(theta):
            agent = BinaryActionLinearPolicy(theta)
            rew, T = do_rollout(agent, env, num_steps)
            return rew
    
        # Train the agent, and snapshot each stage
        for (i, iterdata) in enumerate(cem(noisy_evaluation, np.zeros(env.observation_space.shape[0]+1), **params)):
            print('Iteration %2i. Episode mean reward: %7.3f'%(i, iterdata['y_mean']))
            agent = BinaryActionLinearPolicy(iterdata['theta_mean'])
            do_rollout(agent, env, 200, render=True)
    
    
        env.close()
    
    
    # Results:
    python my_learning_agent.py CartPole-v0
    
    INFO: Making new env: CartPole-v0
    Iteration  0. Episode mean reward:  23.800
    Iteration  1. Episode mean reward:  92.000
    Iteration  2. Episode mean reward: 158.400
    Iteration  3. Episode mean reward: 179.100
    Iteration  4. Episode mean reward: 186.000
    Iteration  5. Episode mean reward: 188.300
    Iteration  6. Episode mean reward: 180.900
    Iteration  7. Episode mean reward: 188.700
    Iteration  8. Episode mean reward: 188.600
    Iteration  9. Episode mean reward: 185.300
    Iteration 10. Episode mean reward: 191.900
    Iteration 11. Episode mean reward: 193.000
    Iteration 12. Episode mean reward: 188.300
    Iteration 13. Episode mean reward: 183.400
    Iteration 14. Episode mean reward: 180.400
    Iteration 15. Episode mean reward: 197.100
    Iteration 16. Episode mean reward: 193.200
    Iteration 17. Episode mean reward: 188.500
    Iteration 18. Episode mean reward: 182.800
    Iteration 19. Episode mean reward: 193.900
    ......
    ......
    ......
    Iteration 81. Episode mean reward: 175.400
    Iteration 82. Episode mean reward: 183.500
    Iteration 83. Episode mean reward: 195.800
    Iteration 84. Episode mean reward: 191.300
    Iteration 85. Episode mean reward: 192.000
    Iteration 86. Episode mean reward: 196.300
    Iteration 87. Episode mean reward: 197.300
    Iteration 88. Episode mean reward: 184.000
    Iteration 89. Episode mean reward: 192.800
    Iteration 90. Episode mean reward: 184.000
    Iteration 91. Episode mean reward: 184.700
    Iteration 92. Episode mean reward: 184.500
    Iteration 93. Episode mean reward: 192.400
    Iteration 94. Episode mean reward: 196.000
    Iteration 95. Episode mean reward: 193.800
    Iteration 96. Episode mean reward: 183.400
    Iteration 97. Episode mean reward: 196.400
    Iteration 98. Episode mean reward: 192.400
    Iteration 99. Episode mean reward: 178.500
    

    At a glance:

    相关文章

      网友评论

          本文标题:RL: my_learning_agent.py

          本文链接:https://www.haomeiwen.com/subject/vhvqmhtx.html