美文网首页
深度强化学习(6) Policy Gradients (2)

深度强化学习(6) Policy Gradients (2)

作者: 数科每日 | 来源:发表于2022-02-12 03:48 被阅读0次

深度强化学习(5) Policy Gradients (1) 基于Berkeley CS285 的讲义介绍了 Policy Gradients 算法。 本文参考了一篇 Blog, 以更加易懂的方式, 重新说一下 Policy Gradients, 并给出实现代码(PyTorch)。

Policy Gradients(PG) 是一个 On-Policy 的算法, 它可以学习 Stochastic Policies。(通常来说Stochastic Policies 优于 Deterministic Policies)

概念

Trajectory

在固定的Policy 下, 得到的一个 action 和 reward 组成的序列。

\tau = (a_{0}, r_{0}, a_{1}, r_{1} ... ... )

Reward Function

这里引入了 Discount Factor \gamma, 这个代表远期的收益会打折(折现的概念)。

R(\tau)=\sum_{t=0}^{T} \gamma^{t} r_{t}

Performance Metric

J(\pi) = E_{\tau\sim\pi}[R(\tau)] = \int_{\tau} R(\tau) P(\tau \mid \pi) d \tau

Derivative 梯度

因为我们需要利用梯度上升来提高 performance, 所以, 我们需要知道 Performance 对 \theta 的微分:

\nabla_{\Theta} J(\pi)=\nabla_{\Theta} \mathbf{E}_{\tau \sim \pi}[R(\tau)]=\int_{\tau} R(\tau) \nabla_{\Theta} P(\tau \mid \pi) d \tau

利用一个有助于我们简化的公式:

\frac{\partial \log (x)}{\partial x}=\frac{1}{x}=>\partial x=x \partial \log (x)

然后我们可以得到:

\int_{\tau} R(\tau) \nabla_{\Theta} P(\tau \mid \pi) d \tau=\int_{\tau} R(\tau) P(\tau \mid \pi) \nabla_{\Theta} \log P(\tau \mid \pi) d \tau = \mathbb{E}_{\tau \sim \pi}\left[\nabla_{\Theta} \log P(\tau \mid \pi) R(\tau)\right]

为了求解上面公式, 一个重要的点是,我们如何求得 logP(\tau|\pi) ? 我们可以利用链式法则(Chain Rule):

已知
P(\tau \mid \pi)=\rho\left(s_{0}\right) \prod_{t=0}^{T-1} P\left(s_{t+1} \mid s_{t}, a_{t}\right) \pi\left(a_{t} \mid s_{t}\right)

两边取对数后,可以得到:

\log P(\tau \mid \pi)=\log \rho\left(s_{0}\right)+\sum_{t=0}^{T-1} \log P\left(s_{t+1} \mid s_{t}, a_{t}\right)+\log \pi\left(a_{t} \mid s_{t}\right)

然后我们求梯度:

\nabla_{\Theta} \log P(\tau \mid \pi)=
=\nabla_{\Theta} \log \rho\left(s_{0}\right)+\sum_{t=0}^{T-1} \nabla_{\Theta} \log P\left(s_{t+1} \mid s_{t}, a_{t}\right)+\nabla_{\Theta} \log \pi\left(a_{t} \mid s_{t}\right)=
=\nabla_{\Theta} \log \pi\left(a_{t} \mid s_{t}\right)

最终,我们得到:

\nabla_{\Theta} J(\pi)=\mathbb{E}_{\tau \sim \pi}\left[\nabla_{\Theta} \log \pi\left(a_{t} \mid s_{t}\right) R(\tau)\right]

因为需要计算期望, 因此再实践上,我们可以简单的使用 Sample。

Rewards To Go

再实际计算 Rewards 的时候,我们会计算从t 开始的Rewards, 因为这样才符合常识。(我们只关心采取行动后获得的收益)

R(\tau)=\sum_{t^{\prime}=t}^{T-1} \gamma^{t} r_{t}

Entropy 奖励

在代码中,加入了entropy bonus, 主要目的是为了鼓励模型加入更多的不确定性,这样在训练的时候, 模型就会去探索更多的可能性。 关于Entropy 的介绍,可以参见这篇文章。 在强化学习中, Entropy Bouns 会用在很多算法中, 有点像Supervised Learning 中的 Regularization。

Baseline

关于 Baseline 的讨论,请参考深度强化学习(5)# PG的一个问题

代码

Episode 循环

每一个 Trajectory \tau 就是一个Episode:

def play_episode(self, episode: int):
    """
        Plays an episode of the environment.
        episode: the episode counter
        Returns:
            sum_weighted_log_probs: the sum of the log-prob of an action multiplied by the reward-to-go from that state
            episode_logits: the logits of every step of the episode - needed to compute entropy for entropy bonus
            finished_rendering_this_epoch: pass-through rendering flag
            sum_of_rewards: sum of the rewards for the episode - needed for the average over 200 episode statistic
    """
    # 随机初始化环境设定
    state = self.env.reset()

    # 初始化参数
    episode_actions = torch.empty(size=(0,), dtype=torch.long, device=self.DEVICE)
    episode_logits = torch.empty(size=(0, self.env.action_space.n), device=self.DEVICE)
    average_rewards = np.empty(shape=(0,), dtype=np.float)
    episode_rewards = np.empty(shape=(0,), dtype=np.float)

    # episode loop
    while True:

        # 显示环境(这里的环境是一个游戏)
        if not self.finished_rendering_this_epoch:
            self.env.render()

        # 从Agent 中拿到各个 Action 的logit,其实就是 Policy 给出的结果
        action_logits = self.agent(torch.tensor(state).float().unsqueeze(dim=0).to(self.DEVICE))

        # 保持logit 的历史
        episode_logits = torch.cat((episode_logits, action_logits), dim=0)

        # 因为已经拿到 logits, 所以就可以利用logist 提供的分布做Sampling,得到action
        action = Categorical(logits=action_logits).sample()

        # 保持action, 一会计算 rewards 用的到
        episode_actions = torch.cat((episode_actions, action), dim=0)

        # 在环境中执行 Action, 环境env, 会给出新的state 和 reward
        state, reward, done, _ = self.env.step(action=action.cpu().item())

        # 保存 Reward
        episode_rewards = np.concatenate((episode_rewards, np.array([reward])), axis=0)

        # 保存到本轮位置的 Average Reward,   AR 是从一开始到本轮的rewards 的平均
        # 所以,除了最后一次的 reward, 都会被重复计算
        average_rewards = np.concatenate((average_rewards,
                                          np.expand_dims(np.mean(episode_rewards), axis=0)),
                                         axis=0)

        # 如果episode 完成,注意, done 由 env 给出: 可以想象成游戏结束了,或者规定时间到达了。 
        if done:

            # 计数加 1
            episode += 1

            # 计算 Rewards to go, 需要 gamma, 每一步的rewards
            discounted_rewards_to_go = PolicyGradient.get_discounted_rewards(rewards=episode_rewards,
                                                                             gamma=self.GAMMA)

            # 为了降低 Variance, 需要减去 baseline 
            discounted_rewards_to_go -= average_rewards  # baseline - state specific average

            sum_of_rewards = np.sum(episode_rewards)

            # one hot encoding mask
            mask = one_hot(episode_actions, num_classes=self.env.action_space.n)

            # 计算每一轮的 log probability
            episode_log_probs = torch.sum(mask.float() * log_softmax(episode_logits, dim=1), dim=1)

            # 利用 rewards-to-go  作为权重, 去调整 log probability 
            episode_weighted_log_probs = episode_log_probs * \
                torch.tensor(discounted_rewards_to_go).float().to(self.DEVICE)

            # 最终我们需要计算的值
            sum_weighted_log_probs = torch.sum(episode_weighted_log_probs).unsqueeze(dim=0)

            # won't render again this epoch
            self.finished_rendering_this_epoch = True

            return sum_weighted_log_probs, episode_logits, sum_of_rewards, episode

这是如何计算 Rewards To Go 的代码

@staticmethod
def get_discounted_rewards(rewards: np.array, gamma: float) -> np.array:
    """
        Calculates the sequence of discounted rewards-to-go.
        Args:
            rewards: the sequence of observed rewards
            gamma: the discount factor
        Returns:
            discounted_rewards: the sequence of the rewards-to-go
    """
    discounted_rewards = np.empty_like(rewards, dtype=np.float)
    for i in range(rewards.shape[0]):
        gammas = np.full(shape=(rewards[i:].shape[0]), fill_value=gamma)

        #  Gamma 的幂操作
        discounted_gammas = np.power(gammas, np.arange(rewards[i:].shape[0]))

        discounted_reward = np.sum(rewards[i:] * discounted_gammas)
        discounted_rewards[i] = discounted_reward
    return discounted_rewards

计算Loss, 注意前面我们其实计算的是收益 Reward, 为了能够使用梯度下降, 我们要把 Reward 转化成 loss,做法其实很简单,就是乘以 -1.

  • 这里在计算 entropy bouns 的时候, 我们需要概率, 所以使用了 softmax, 把 logit 转化成概率。
  • BETA 是超参数
def calculate_loss(self, epoch_logits: torch.Tensor, weighted_log_probs: torch.Tensor) -> (torch.Tensor, torch.Tensor):
    """
        Calculates the policy "loss" and the entropy bonus
        Args:
            epoch_logits: logits of the policy network we have collected over the epoch
            weighted_log_probs: loP * W of the actions taken
        Returns:
            policy loss + the entropy bonus
            entropy: needed for logging
    """
    policy_loss = -1 * torch.mean(weighted_log_probs)

    # add the entropy bonus
    p = softmax(epoch_logits, dim=1)
    log_p = log_softmax(epoch_logits, dim=1)
    entropy = -1 * torch.mean(torch.sum(p * log_p, dim=1), dim=0)

    # 熵也要做 -1, 把 Bouns 变成 Loss
    entropy_bonus = -1 * self.BETA * entropy

    return policy_loss + entropy_bonus, entropy

下面, 把前面所有代码整合在一起, 就是完成的训练流程。

  • 在公式中,我们是用到了期望, 所以需要用多个 Trajectory 的 loss 均值对 Policy 做调整, self.BATCH_SIZE 控制了每个批次用多少样本来取均值。

def solve_environment(self):
    """
        The main interface for the Policy Gradient solver
    """
    # init the episode and the epoch
    episode = 0
    epoch = 0

    # init the epoch arrays
    # used for entropy calculation
    epoch_logits = torch.empty(size=(0, self.env.action_space.n), device=self.DEVICE)
    epoch_weighted_log_probs = torch.empty(size=(0,), dtype=torch.float, device=self.DEVICE)

    while True:

        #  在环境中玩一轮游戏
        (episode_weighted_log_prob_trajectory,
         episode_logits,
         sum_of_episode_rewards,
         episode) = self.play_episode(episode=episode)

        # after each episode append the sum of total rewards to the deque
        self.total_rewards.append(sum_of_episode_rewards)

        # append the weighted log-probabilities of actions
        epoch_weighted_log_probs = torch.cat((epoch_weighted_log_probs, episode_weighted_log_prob_trajectory),
                                             dim=0)

        # append the logits - needed for the entropy bonus calculation
        epoch_logits = torch.cat((epoch_logits, episode_logits), dim=0)

        # if the epoch is over - we have epoch trajectories to perform the policy gradient
        if episode >= self.BATCH_SIZE:

            # reset the rendering flag
            self.finished_rendering_this_epoch = False

            # reset the episode count
            episode = 0

            # increment the epoch
            epoch += 1

            # 计算loss
            #  logit 用来计算Entropy 
            #  weighted_log_probs  用来计算 Reward ,乘以 -1 就变成 loss 了
            loss, entropy = self.calculate_loss(epoch_logits=epoch_logits,
                                                weighted_log_probs=epoch_weighted_log_probs)

            # zero the gradient
            self.adam.zero_grad()

            # backprop
            loss.backward()

            # update the parameters
            self.adam.step()

            # feedback
            print("\r", f"Epoch: {epoch}, Avg Return per Epoch: {np.mean(self.total_rewards):.3f}",
                  end="",
                  flush=True)

            self.writer.add_scalar(tag='Average Return over 100 episodes',
                                   scalar_value=np.mean(self.total_rewards),
                                   global_step=epoch)

            self.writer.add_scalar(tag='Entropy',
                                   scalar_value=entropy,
                                   global_step=epoch)

            # reset the epoch arrays
            # used for entropy calculation
            epoch_logits = torch.empty(size=(0, self.env.action_space.n), device=self.DEVICE)
            epoch_weighted_log_probs = torch.empty(size=(0,), dtype=torch.float, device=self.DEVICE)

            # check if solved
            if np.mean(self.total_rewards) > 200:
                print('\nSolved!')
                break

    # close the environment
    self.env.close()

    # close the writer
    self.writer.close()

训练结果

训练了100轮, 这是训练结果:

平均收益

熵随着训练降低, 意味着 Action 不确定性在逐渐降低,模型越来越确定自己要干嘛。

相关文章

网友评论

      本文标题:深度强化学习(6) Policy Gradients (2)

      本文链接:https://www.haomeiwen.com/subject/aaoikrtx.html