深度强化学习（6） Policy Gradients (2)

作者: 数科每日 | 来源:发表于2022-02-12 03:48 被阅读0次

深度强化学习（6） Policy Gradients (2)
强化学习——Policy Gradients
深度强化学习（5） Policy Gradients (1)
Proximal Policy Optimization Alg
强化学习在聊天机器人中的应用
windows下安装强化学习开源库 tf2rl
Tensorflow2.x 深度强化学习——Policy Gra
深度强化学习DQN实现细节，入门深度强化学习
强化学习基础篇（六）动态规划之策略迭代（2）
深度强化学习——Policy Gradient 玩转 CartP

深度强化学习（5） Policy Gradients (1) 基于Berkeley CS285 的讲义介绍了 Policy Gradients 算法。本文参考了一篇 Blog，以更加易懂的方式，重新说一下 Policy Gradients，并给出实现代码(PyTorch)。

Policy Gradients(PG) 是一个 On-Policy 的算法，它可以学习 Stochastic Policies。(通常来说Stochastic Policies 优于 Deterministic Policies)

概念

Trajectory

在固定的Policy 下，得到的一个 action 和 reward 组成的序列。

$\tau = (a_{0}, r_{0}, a_{1}, r_{1} ... ... )$

Reward Function

这里引入了 Discount Factor $\gamma$ ，这个代表远期的收益会打折（折现的概念）。

$R(\tau)=\sum_{t=0}^{T} \gamma^{t} r_{t}$

Performance Metric

$J(\pi) = E_{\tau\sim\pi}[R(\tau)] = \int_{\tau} R(\tau) P(\tau \mid \pi) d \tau$

Derivative 梯度

因为我们需要利用梯度上升来提高 performance, 所以，我们需要知道 Performance 对 $\theta$ 的微分：

$\nabla_{\Theta} J(\pi)=\nabla_{\Theta} \mathbf{E}_{\tau \sim \pi}[R(\tau)]=\int_{\tau} R(\tau) \nabla_{\Theta} P(\tau \mid \pi) d \tau$

利用一个有助于我们简化的公式:

$\frac{\partial \log (x)}{\partial x}=\frac{1}{x}=>\partial x=x \partial \log (x)$

然后我们可以得到：

$\int_{\tau} R(\tau) \nabla_{\Theta} P(\tau \mid \pi) d \tau=\int_{\tau} R(\tau) P(\tau \mid \pi) \nabla_{\Theta} \log P(\tau \mid \pi) d \tau = \mathbb{E}_{\tau \sim \pi}\left[\nabla_{\Theta} \log P(\tau \mid \pi) R(\tau)\right]$

为了求解上面公式，一个重要的点是，我们如何求得 $logP(\tau|\pi)$ ？我们可以利用链式法则(Chain Rule):

已知
$P(\tau \mid \pi)=\rho\left(s_{0}\right) \prod_{t=0}^{T-1} P\left(s_{t+1} \mid s_{t}, a_{t}\right) \pi\left(a_{t} \mid s_{t}\right)$

两边取对数后，可以得到:

$\log P(\tau \mid \pi)=\log \rho\left(s_{0}\right)+\sum_{t=0}^{T-1} \log P\left(s_{t+1} \mid s_{t}, a_{t}\right)+\log \pi\left(a_{t} \mid s_{t}\right)$

然后我们求梯度：

$\nabla_{\Theta} \log P(\tau \mid \pi)=$
$=\nabla_{\Theta} \log \rho\left(s_{0}\right)+\sum_{t=0}^{T-1} \nabla_{\Theta} \log P\left(s_{t+1} \mid s_{t}, a_{t}\right)+\nabla_{\Theta} \log \pi\left(a_{t} \mid s_{t}\right)=$
$=\nabla_{\Theta} \log \pi\left(a_{t} \mid s_{t}\right)$

最终，我们得到：

$\nabla_{\Theta} J(\pi)=\mathbb{E}_{\tau \sim \pi}\left[\nabla_{\Theta} \log \pi\left(a_{t} \mid s_{t}\right) R(\tau)\right]$

因为需要计算期望，因此再实践上，我们可以简单的使用 Sample。

Rewards To Go

再实际计算 Rewards 的时候，我们会计算从t 开始的Rewards，因为这样才符合常识。（我们只关心采取行动后获得的收益）

$R(\tau)=\sum_{t^{\prime}=t}^{T-1} \gamma^{t} r_{t}$

Entropy 奖励

在代码中，加入了entropy bonus, 主要目的是为了鼓励模型加入更多的不确定性，这样在训练的时候，模型就会去探索更多的可能性。关于Entropy 的介绍，可以参见这篇文章。在强化学习中， Entropy Bouns 会用在很多算法中，有点像Supervised Learning 中的 Regularization。

Baseline

关于 Baseline 的讨论，请参考深度强化学习（5）# PG的一个问题

代码

Episode 循环

每一个 Trajectory $\tau$ 就是一个Episode：

def play_episode(self, episode: int):
    """
        Plays an episode of the environment.
        episode: the episode counter
        Returns:
            sum_weighted_log_probs: the sum of the log-prob of an action multiplied by the reward-to-go from that state
            episode_logits: the logits of every step of the episode - needed to compute entropy for entropy bonus
            finished_rendering_this_epoch: pass-through rendering flag
            sum_of_rewards: sum of the rewards for the episode - needed for the average over 200 episode statistic
    """
    # 随机初始化环境设定
    state = self.env.reset()

    # 初始化参数
    episode_actions = torch.empty(size=(0,), dtype=torch.long, device=self.DEVICE)
    episode_logits = torch.empty(size=(0, self.env.action_space.n), device=self.DEVICE)
    average_rewards = np.empty(shape=(0,), dtype=np.float)
    episode_rewards = np.empty(shape=(0,), dtype=np.float)

    # episode loop
    while True:

        # 显示环境（这里的环境是一个游戏）
        if not self.finished_rendering_this_epoch:
            self.env.render()

        # 从Agent 中拿到各个 Action 的logit，其实就是 Policy 给出的结果
        action_logits = self.agent(torch.tensor(state).float().unsqueeze(dim=0).to(self.DEVICE))

        # 保持logit 的历史
        episode_logits = torch.cat((episode_logits, action_logits), dim=0)

        # 因为已经拿到 logits， 所以就可以利用logist 提供的分布做Sampling，得到action
        action = Categorical(logits=action_logits).sample()

        # 保持action， 一会计算 rewards 用的到
        episode_actions = torch.cat((episode_actions, action), dim=0)

        # 在环境中执行 Action， 环境env， 会给出新的state 和 reward
        state, reward, done, _ = self.env.step(action=action.cpu().item())

        # 保存 Reward
        episode_rewards = np.concatenate((episode_rewards, np.array([reward])), axis=0)

        # 保存到本轮位置的 Average Reward,   AR 是从一开始到本轮的rewards 的平均
        # 所以，除了最后一次的 reward， 都会被重复计算
        average_rewards = np.concatenate((average_rewards,
                                          np.expand_dims(np.mean(episode_rewards), axis=0)),
                                         axis=0)

        # 如果episode 完成，注意， done 由 env 给出： 可以想象成游戏结束了，或者规定时间到达了。 
        if done:

            # 计数加 1
            episode += 1

            # 计算 Rewards to go, 需要 gamma， 每一步的rewards
            discounted_rewards_to_go = PolicyGradient.get_discounted_rewards(rewards=episode_rewards,
                                                                             gamma=self.GAMMA)

            # 为了降低 Variance， 需要减去 baseline 
            discounted_rewards_to_go -= average_rewards  # baseline - state specific average

            sum_of_rewards = np.sum(episode_rewards)

            # one hot encoding mask
            mask = one_hot(episode_actions, num_classes=self.env.action_space.n)

            # 计算每一轮的 log probability
            episode_log_probs = torch.sum(mask.float() * log_softmax(episode_logits, dim=1), dim=1)

            # 利用 rewards-to-go  作为权重， 去调整 log probability 
            episode_weighted_log_probs = episode_log_probs * \
                torch.tensor(discounted_rewards_to_go).float().to(self.DEVICE)

            # 最终我们需要计算的值
            sum_weighted_log_probs = torch.sum(episode_weighted_log_probs).unsqueeze(dim=0)

            # won't render again this epoch
            self.finished_rendering_this_epoch = True

            return sum_weighted_log_probs, episode_logits, sum_of_rewards, episode

这是如何计算 Rewards To Go 的代码

@staticmethod
def get_discounted_rewards(rewards: np.array, gamma: float) -> np.array:
    """
        Calculates the sequence of discounted rewards-to-go.
        Args:
            rewards: the sequence of observed rewards
            gamma: the discount factor
        Returns:
            discounted_rewards: the sequence of the rewards-to-go
    """
    discounted_rewards = np.empty_like(rewards, dtype=np.float)
    for i in range(rewards.shape[0]):
        gammas = np.full(shape=(rewards[i:].shape[0]), fill_value=gamma)

        #  Gamma 的幂操作
        discounted_gammas = np.power(gammas, np.arange(rewards[i:].shape[0]))

        discounted_reward = np.sum(rewards[i:] * discounted_gammas)
        discounted_rewards[i] = discounted_reward
    return discounted_rewards

计算Loss，注意前面我们其实计算的是收益 Reward，为了能够使用梯度下降，我们要把 Reward 转化成 loss，做法其实很简单，就是乘以 -1.

这里在计算 entropy bouns 的时候，我们需要概率，所以使用了 softmax，把 logit 转化成概率。
BETA 是超参数

def calculate_loss(self, epoch_logits: torch.Tensor, weighted_log_probs: torch.Tensor) -> (torch.Tensor, torch.Tensor):
    """
        Calculates the policy "loss" and the entropy bonus
        Args:
            epoch_logits: logits of the policy network we have collected over the epoch
            weighted_log_probs: loP * W of the actions taken
        Returns:
            policy loss + the entropy bonus
            entropy: needed for logging
    """
    policy_loss = -1 * torch.mean(weighted_log_probs)

    # add the entropy bonus
    p = softmax(epoch_logits, dim=1)
    log_p = log_softmax(epoch_logits, dim=1)
    entropy = -1 * torch.mean(torch.sum(p * log_p, dim=1), dim=0)

    # 熵也要做 -1， 把 Bouns 变成 Loss
    entropy_bonus = -1 * self.BETA * entropy

    return policy_loss + entropy_bonus, entropy

下面，把前面所有代码整合在一起，就是完成的训练流程。

在公式中，我们是用到了期望，所以需要用多个 Trajectory 的 loss 均值对 Policy 做调整， self.BATCH_SIZE 控制了每个批次用多少样本来取均值。


def solve_environment(self):
    """
        The main interface for the Policy Gradient solver
    """
    # init the episode and the epoch
    episode = 0
    epoch = 0

    # init the epoch arrays
    # used for entropy calculation
    epoch_logits = torch.empty(size=(0, self.env.action_space.n), device=self.DEVICE)
    epoch_weighted_log_probs = torch.empty(size=(0,), dtype=torch.float, device=self.DEVICE)

    while True:

        #  在环境中玩一轮游戏
        (episode_weighted_log_prob_trajectory,
         episode_logits,
         sum_of_episode_rewards,
         episode) = self.play_episode(episode=episode)

        # after each episode append the sum of total rewards to the deque
        self.total_rewards.append(sum_of_episode_rewards)

        # append the weighted log-probabilities of actions
        epoch_weighted_log_probs = torch.cat((epoch_weighted_log_probs, episode_weighted_log_prob_trajectory),
                                             dim=0)

        # append the logits - needed for the entropy bonus calculation
        epoch_logits = torch.cat((epoch_logits, episode_logits), dim=0)

        # if the epoch is over - we have epoch trajectories to perform the policy gradient
        if episode >= self.BATCH_SIZE:

            # reset the rendering flag
            self.finished_rendering_this_epoch = False

            # reset the episode count
            episode = 0

            # increment the epoch
            epoch += 1

            # 计算loss
            #  logit 用来计算Entropy 
            #  weighted_log_probs  用来计算 Reward ，乘以 -1 就变成 loss 了
            loss, entropy = self.calculate_loss(epoch_logits=epoch_logits,
                                                weighted_log_probs=epoch_weighted_log_probs)

            # zero the gradient
            self.adam.zero_grad()

            # backprop
            loss.backward()

            # update the parameters
            self.adam.step()

            # feedback
            print("\r", f"Epoch: {epoch}, Avg Return per Epoch: {np.mean(self.total_rewards):.3f}",
                  end="",
                  flush=True)

            self.writer.add_scalar(tag='Average Return over 100 episodes',
                                   scalar_value=np.mean(self.total_rewards),
                                   global_step=epoch)

            self.writer.add_scalar(tag='Entropy',
                                   scalar_value=entropy,
                                   global_step=epoch)

            # reset the epoch arrays
            # used for entropy calculation
            epoch_logits = torch.empty(size=(0, self.env.action_space.n), device=self.DEVICE)
            epoch_weighted_log_probs = torch.empty(size=(0,), dtype=torch.float, device=self.DEVICE)

            # check if solved
            if np.mean(self.total_rewards) > 200:
                print('\nSolved!')
                break

    # close the environment
    self.env.close()

    # close the writer
    self.writer.close()

训练结果

训练了100轮，这是训练结果：

平均收益

熵随着训练降低，意味着 Action 不确定性在逐渐降低，模型越来越确定自己要干嘛。

熵

深度强化学习（6） Policy Gradients (2)
深度强化学习（5） Policy Gradients (1)[https://www.jianshu.com/p/...
强化学习——Policy Gradients
一、与其他强化学习方法不同强化学习是一个通过奖惩来学习正确行为的机制。家族中有很多种不一样的成员，有学习奖惩值...
深度强化学习（5） Policy Gradients (1)
本文主要内容来源于 Berkeley CS285 Deep Reinforcement Learning[http...
Proximal Policy Optimization Alg
Introduction 目前深度强化学习主要有deep Q-learning、policy gradient m...
强化学习在聊天机器人中的应用
1.深度强化学习在面向任务的对话管理中的应用 2.李纪为：用于对话生成的深度强化学习 3.基于深度强化学习打造聊天...
windows下安装强化学习开源库 tf2rl
一、TF2RL介绍 TF2RL是一个深度强化学习库，它使用TensorFlow 2.0实现了各种深度强化学习算法。...
Tensorflow2.x 深度强化学习——Policy Gra
在之前的文章中我们系统地介绍了强化学习，以及与神经网络相结合的深度强化学习。期间由于 Tensorflow 2.0...
深度强化学习DQN实现细节，入门深度强化学习
本文主要讲解深度强化学习的开山之作，文献[1,2]. 本文主要由学习“深度之眼强化学习中Alex老师的课”...
强化学习基础篇（六）动态规划之策略迭代（2）
强化学习基础篇（六）动态规划之策略迭代（2） 1、策略改进（Policy improvement）的理论证明考虑...
深度强化学习——Policy Gradient 玩转 CartP
Image from unsplash.com by helloquence 前面的文章我们介绍了 Q-learn...