深度强化学习(5) Policy Gradients (1) 基于Berkeley CS285 的讲义介绍了 Policy Gradients 算法。 本文参考了一篇 Blog, 以更加易懂的方式, 重新说一下 Policy Gradients, 并给出实现代码(PyTorch)。
Policy Gradients(PG) 是一个 On-Policy 的算法, 它可以学习 Stochastic Policies。(通常来说Stochastic Policies 优于 Deterministic Policies)
概念
Trajectory
在固定的Policy 下, 得到的一个 action 和 reward 组成的序列。
Reward Function
这里引入了 Discount Factor , 这个代表远期的收益会打折(折现的概念)。
Performance Metric
Derivative 梯度
因为我们需要利用梯度上升来提高 performance, 所以, 我们需要知道 Performance 对 的微分:
利用一个有助于我们简化的公式:
然后我们可以得到:
为了求解上面公式, 一个重要的点是,我们如何求得 ? 我们可以利用链式法则(Chain Rule):
已知
两边取对数后,可以得到:
然后我们求梯度:
最终,我们得到:
因为需要计算期望, 因此再实践上,我们可以简单的使用 Sample。
Rewards To Go
再实际计算 Rewards 的时候,我们会计算从t 开始的Rewards, 因为这样才符合常识。(我们只关心采取行动后获得的收益)
Entropy 奖励
在代码中,加入了entropy bonus, 主要目的是为了鼓励模型加入更多的不确定性,这样在训练的时候, 模型就会去探索更多的可能性。 关于Entropy 的介绍,可以参见这篇文章。 在强化学习中, Entropy Bouns 会用在很多算法中, 有点像Supervised Learning 中的 Regularization。
Baseline
关于 Baseline 的讨论,请参考深度强化学习(5)# PG的一个问题
代码
Episode 循环
每一个 Trajectory 就是一个Episode:
def play_episode(self, episode: int):
"""
Plays an episode of the environment.
episode: the episode counter
Returns:
sum_weighted_log_probs: the sum of the log-prob of an action multiplied by the reward-to-go from that state
episode_logits: the logits of every step of the episode - needed to compute entropy for entropy bonus
finished_rendering_this_epoch: pass-through rendering flag
sum_of_rewards: sum of the rewards for the episode - needed for the average over 200 episode statistic
"""
# 随机初始化环境设定
state = self.env.reset()
# 初始化参数
episode_actions = torch.empty(size=(0,), dtype=torch.long, device=self.DEVICE)
episode_logits = torch.empty(size=(0, self.env.action_space.n), device=self.DEVICE)
average_rewards = np.empty(shape=(0,), dtype=np.float)
episode_rewards = np.empty(shape=(0,), dtype=np.float)
# episode loop
while True:
# 显示环境(这里的环境是一个游戏)
if not self.finished_rendering_this_epoch:
self.env.render()
# 从Agent 中拿到各个 Action 的logit,其实就是 Policy 给出的结果
action_logits = self.agent(torch.tensor(state).float().unsqueeze(dim=0).to(self.DEVICE))
# 保持logit 的历史
episode_logits = torch.cat((episode_logits, action_logits), dim=0)
# 因为已经拿到 logits, 所以就可以利用logist 提供的分布做Sampling,得到action
action = Categorical(logits=action_logits).sample()
# 保持action, 一会计算 rewards 用的到
episode_actions = torch.cat((episode_actions, action), dim=0)
# 在环境中执行 Action, 环境env, 会给出新的state 和 reward
state, reward, done, _ = self.env.step(action=action.cpu().item())
# 保存 Reward
episode_rewards = np.concatenate((episode_rewards, np.array([reward])), axis=0)
# 保存到本轮位置的 Average Reward, AR 是从一开始到本轮的rewards 的平均
# 所以,除了最后一次的 reward, 都会被重复计算
average_rewards = np.concatenate((average_rewards,
np.expand_dims(np.mean(episode_rewards), axis=0)),
axis=0)
# 如果episode 完成,注意, done 由 env 给出: 可以想象成游戏结束了,或者规定时间到达了。
if done:
# 计数加 1
episode += 1
# 计算 Rewards to go, 需要 gamma, 每一步的rewards
discounted_rewards_to_go = PolicyGradient.get_discounted_rewards(rewards=episode_rewards,
gamma=self.GAMMA)
# 为了降低 Variance, 需要减去 baseline
discounted_rewards_to_go -= average_rewards # baseline - state specific average
sum_of_rewards = np.sum(episode_rewards)
# one hot encoding mask
mask = one_hot(episode_actions, num_classes=self.env.action_space.n)
# 计算每一轮的 log probability
episode_log_probs = torch.sum(mask.float() * log_softmax(episode_logits, dim=1), dim=1)
# 利用 rewards-to-go 作为权重, 去调整 log probability
episode_weighted_log_probs = episode_log_probs * \
torch.tensor(discounted_rewards_to_go).float().to(self.DEVICE)
# 最终我们需要计算的值
sum_weighted_log_probs = torch.sum(episode_weighted_log_probs).unsqueeze(dim=0)
# won't render again this epoch
self.finished_rendering_this_epoch = True
return sum_weighted_log_probs, episode_logits, sum_of_rewards, episode
这是如何计算 Rewards To Go 的代码
@staticmethod
def get_discounted_rewards(rewards: np.array, gamma: float) -> np.array:
"""
Calculates the sequence of discounted rewards-to-go.
Args:
rewards: the sequence of observed rewards
gamma: the discount factor
Returns:
discounted_rewards: the sequence of the rewards-to-go
"""
discounted_rewards = np.empty_like(rewards, dtype=np.float)
for i in range(rewards.shape[0]):
gammas = np.full(shape=(rewards[i:].shape[0]), fill_value=gamma)
# Gamma 的幂操作
discounted_gammas = np.power(gammas, np.arange(rewards[i:].shape[0]))
discounted_reward = np.sum(rewards[i:] * discounted_gammas)
discounted_rewards[i] = discounted_reward
return discounted_rewards
计算Loss, 注意前面我们其实计算的是收益 Reward, 为了能够使用梯度下降, 我们要把 Reward 转化成 loss,做法其实很简单,就是乘以 -1.
- 这里在计算 entropy bouns 的时候, 我们需要概率, 所以使用了 softmax, 把 logit 转化成概率。
- BETA 是超参数
def calculate_loss(self, epoch_logits: torch.Tensor, weighted_log_probs: torch.Tensor) -> (torch.Tensor, torch.Tensor):
"""
Calculates the policy "loss" and the entropy bonus
Args:
epoch_logits: logits of the policy network we have collected over the epoch
weighted_log_probs: loP * W of the actions taken
Returns:
policy loss + the entropy bonus
entropy: needed for logging
"""
policy_loss = -1 * torch.mean(weighted_log_probs)
# add the entropy bonus
p = softmax(epoch_logits, dim=1)
log_p = log_softmax(epoch_logits, dim=1)
entropy = -1 * torch.mean(torch.sum(p * log_p, dim=1), dim=0)
# 熵也要做 -1, 把 Bouns 变成 Loss
entropy_bonus = -1 * self.BETA * entropy
return policy_loss + entropy_bonus, entropy
下面, 把前面所有代码整合在一起, 就是完成的训练流程。
- 在公式中,我们是用到了期望, 所以需要用多个 Trajectory 的 loss 均值对 Policy 做调整, self.BATCH_SIZE 控制了每个批次用多少样本来取均值。
def solve_environment(self):
"""
The main interface for the Policy Gradient solver
"""
# init the episode and the epoch
episode = 0
epoch = 0
# init the epoch arrays
# used for entropy calculation
epoch_logits = torch.empty(size=(0, self.env.action_space.n), device=self.DEVICE)
epoch_weighted_log_probs = torch.empty(size=(0,), dtype=torch.float, device=self.DEVICE)
while True:
# 在环境中玩一轮游戏
(episode_weighted_log_prob_trajectory,
episode_logits,
sum_of_episode_rewards,
episode) = self.play_episode(episode=episode)
# after each episode append the sum of total rewards to the deque
self.total_rewards.append(sum_of_episode_rewards)
# append the weighted log-probabilities of actions
epoch_weighted_log_probs = torch.cat((epoch_weighted_log_probs, episode_weighted_log_prob_trajectory),
dim=0)
# append the logits - needed for the entropy bonus calculation
epoch_logits = torch.cat((epoch_logits, episode_logits), dim=0)
# if the epoch is over - we have epoch trajectories to perform the policy gradient
if episode >= self.BATCH_SIZE:
# reset the rendering flag
self.finished_rendering_this_epoch = False
# reset the episode count
episode = 0
# increment the epoch
epoch += 1
# 计算loss
# logit 用来计算Entropy
# weighted_log_probs 用来计算 Reward ,乘以 -1 就变成 loss 了
loss, entropy = self.calculate_loss(epoch_logits=epoch_logits,
weighted_log_probs=epoch_weighted_log_probs)
# zero the gradient
self.adam.zero_grad()
# backprop
loss.backward()
# update the parameters
self.adam.step()
# feedback
print("\r", f"Epoch: {epoch}, Avg Return per Epoch: {np.mean(self.total_rewards):.3f}",
end="",
flush=True)
self.writer.add_scalar(tag='Average Return over 100 episodes',
scalar_value=np.mean(self.total_rewards),
global_step=epoch)
self.writer.add_scalar(tag='Entropy',
scalar_value=entropy,
global_step=epoch)
# reset the epoch arrays
# used for entropy calculation
epoch_logits = torch.empty(size=(0, self.env.action_space.n), device=self.DEVICE)
epoch_weighted_log_probs = torch.empty(size=(0,), dtype=torch.float, device=self.DEVICE)
# check if solved
if np.mean(self.total_rewards) > 200:
print('\nSolved!')
break
# close the environment
self.env.close()
# close the writer
self.writer.close()
训练结果
训练了100轮, 这是训练结果:
![](https://img.haomeiwen.com/i25067830/c835f493c4970a26.png)
熵随着训练降低, 意味着 Action 不确定性在逐渐降低,模型越来越确定自己要干嘛。
![](https://img.haomeiwen.com/i25067830/da51b1ee7a91673c.png)
网友评论