美文网首页
RL: frozenlake_value_iteration.p

RL: frozenlake_value_iteration.p

作者: 魏鹏飞 | 来源:发表于2020-04-13 17:02 被阅读0次

Keywords:

value_iteration、converged、extract_policy、evaluate_policy、

frozenlake_value_iteration.py
"""
Solving FrozenLake environment using Value-Itertion.
Adapted by Bolei Zhou for IERG6130. Originally from Moustafa Alzantot (malzantot@ucla.edu)
"""
import numpy as np
import gym
from gym import wrappers
from gym.envs.registration import register

def run_episode(env, policy, gamma = 1.0, render = False):
    """ Evaluates policy by using it to run an episode and finding its
    total reward.
    args:
    env: gym environment.
    policy: the policy to be used.
    gamma: discount factor.
    render: boolean to turn rendering on/off.
    returns:
    total reward: real value of the total reward recieved by agent under policy.
    """
    obs = env.reset()
    total_reward = 0
    step_idx = 0
    while True:
        if render:
            env.render()
        obs, reward, done , _ = env.step(int(policy[obs]))
        total_reward += (gamma ** step_idx * reward)
        step_idx += 1
        if done:
            break
    return total_reward


def evaluate_policy(env, policy, gamma = 1.0,  n = 100):
    """ Evaluates a policy by running it n times.
    returns:
    average total reward
    """
    scores = [
            run_episode(env, policy, gamma = gamma, render = False)
            for _ in range(n)]
    return np.mean(scores)

def extract_policy(v, gamma = 1.0):
    """ Extract the policy given a value-function """
    policy = np.zeros(env.env.nS)
    for s in range(env.env.nS):
        q_sa = np.zeros(env.action_space.n)
        for a in range(env.action_space.n):
            for next_sr in env.env.P[s][a]:
                # next_sr is a tuple of (probability, next state, reward, done)
                p, s_, r, _ = next_sr
                q_sa[a] += (p * (r + gamma * v[s_]))
        policy[s] = np.argmax(q_sa)
    return policy


def value_iteration(env, gamma = 1.0):
    """ Value-iteration algorithm """
    v = np.zeros(env.env.nS)  # initialize value-function
    max_iterations = 100000
    eps = 1e-20
    for i in range(max_iterations):
        prev_v = np.copy(v)
        for s in range(env.env.nS):
            q_sa = [sum([p*(r + prev_v[s_]) for p, s_, r, _ in env.env.P[s][a]]) for a in range(env.env.nA)] 
            v[s] = max(q_sa)
        if (np.sum(np.fabs(prev_v - v)) <= eps):
            print ('Value-iteration converged at iteration# %d.' %(i+1))
            break
    return v


if __name__ == '__main__':

    env_name  = 'FrozenLake-v0' # 'FrozenLake8x8-v0'
    env = gym.make(env_name)
    gamma = 1.0
    optimal_v = value_iteration(env, gamma);
    policy = extract_policy(optimal_v, gamma)
    policy_score = evaluate_policy(env, policy, gamma, n=1000)
    print('Policy average score = ', policy_score)


# Results:
python frozenlake_value_iteration.py 

Value-iteration converged at iteration# 1373.

......
......
......
  (Down)
SFFF
FHFH
FFFH
HFFG
  (Left)
SFFF
FHFH
FFFH
HFFG
......
......
......
  (Up)
SFFF
FHFH
FFFH
HFFG
  (Left)
SFFF
FHFH
FFFH
HFFG
  (Left)
SFFF
FHFH
FFFH
HFFG
......
......
......

Policy average score =  0.742

At a glance

相关文章

  • RL: frozenlake_value_iteration.p

    Keywords: value_iteration、converged、extract_policy、evalua...

  • mac 本机mysql无法启动

    sudo chown -RL root:mysql /usr/local/mysqlsudo chown -RL ...

  • 强化学习

    RL 种类 Model-Free RL不理解环境,通过试错来学习 Model-Based RL理解环境,通过想象学...

  • RL

    Q-learning Sarsa Sara-lambda

  • RL

    策略(搜索/优化)都是在学习控制律control law,即系统状态到控制输入的映射(本质上也是个回归问题)。强化...

  • RL

    RL 强化学习任务通常用马尔科夫决策过程(Markov Decision Process,简称 MDP)来描述: ...

  • rl

    recyclerview

  • 10.31 背

    单臂哑铃划船 20lbs 12*2组 RL 22.5lbs 10*4组 RL ...

  • Arrow Of RL

    This is my favorite APP, my own independent development, ...

  • 再见RL

    我们本该是路人,本该是属于各自价值体系。原本的感情,在看见你秀恩爱时,化成一把提在我手里的刀。但我知道,走到你婚礼...

网友评论

      本文标题:RL: frozenlake_value_iteration.p

      本文链接:https://www.haomeiwen.com/subject/xldnmhtx.html