美文网首页
强化学习(Reinforcement Learning)

强化学习(Reinforcement Learning)

作者: Lyudmilalala | 来源:发表于2024-08-23 22:52 被阅读0次

Model-Free vs. Model-Based

Two big classification of the reinforcement learning algorithm. The differences are: Whether the agent can fully understand or learn the model of the environment

Model-Based has an early understanding of the environment, and can consider planning in advance. But if the model is inconsistent with the real use scenarios, it will not perform well.

Model-free abandons model learning, thus it is easier to implement and fits to a real-world scenarios better. It is more popular now.

major categories of RL

Basic models and concepts

Process Model

train_process

Math Model

Use a Markov chain to represents the stages of an agent.

The agent will switch between two stages: approaching a state and taking an action.

State (S): The current situation the agent is in.
Action (A): The decision or move the agent takes to get from a state S to another state S’.
Reward (R): The immediate gain or loss the agent receives after taking an action in a state.
Policy (π): The strategy that determines the next action A to take for the current state S.

Uncertainty

  • When Policy π is different, the same state S may get different next action A.
  • Markov chain allows the environment to include randomness, thus even when two a state S takes the same action A twice, the next state S’ may be different.

Our target is to get a Policy π that leads us to the future (game end) with most benefits.

Action-Value Function (Q): Evaluates the expected average reward R for taking action A at state S. Determines the action A maximize long-term return at the current state S.

State-Value Function (V): Evaluates the expected average reward R at state S (taking all actions it might take later into consideration). Represents whether the current state S is promising comparing to the other states.

Monte Carlo Sampling

正向传播:选择一个state S,持续执行到最终state S’。
反向传播:从最终state S’开始,倒着计算每一个state所累积的价值V,在未来state的价值上增加折扣率。选择到达原始state S时累计价值V最高的Policy。

Monte Carlo Estimation

添加增量更新法,类似于梯度下降,不需要再等待所有的分步到达最终状态即可开始调整

learning rate alpha

Time Difference(TD) Estimation

正向传播只进行最多N步,若到达的state Sn已经有V值,则将这个V值纳入反向传播计算

SARSA

使用Q值代替V值进行计算

SARSA

Q-learning

An off-policy RL algorithm, updating its Q-values using the maximum possible future reward, regardless of the action taken. It go through zll possible actions and choose the best one.

Q-learning

Comparatively, SARSA is a on-policy RL algorithm. It updates its Q-values based on the actions actually taken by the policy.

e.g. The.current policy has 20% chance to choose Action1 and 80% to choose Action2. Action1 has total future reward larger than Action2. In the current step, the policy chooses Action2.
With SARSA, the agent updates Q-values as it takes Action2 in this step.
With Q-learning, the agent compares the future total reward of all items in the action space (Action1 & Action2), then updates Q-values as it takes Action1 in this step.

Epsilon Greedy

Randomly choose next action instead of choosing action with the max future Q for a percentage of times (e.g. 10%). Add exploration chances to the model.

Which Part is target? Which Part is predict?

DQN

Q-learning需要查表,因此只适合处理离散的state。对于处理连续的state,比如速度、距离等,需要使用DQN。

DQN中使用一个函数F(S) = A来代替Q-learning中的Q-Table表,来实现找到对于State S来说未来总收益reward最大的Action A。

DQN

确定性策略

根据State直接输出Action值而不是Action的概率分布(与策略梯度方法对应),有助于在连续动作空间里更好的学习

Replay Buffer

At each step, store the state S, action A, next state S', reward R in a buffer.

After the buffer size reach batch_size, take batch_size rows of data for training together with the current state at each step. (Mini-batch GD)

Make the training partially off-policy.

Benefits:

  1. Model may converge faster.
  2. A large variety of input helps avoiding model overfit.

Fix Q-Target 目标网络

DQN的target里包含一个在训练过程中一直变化的深度神经网络Q
一直变化导致深度神经网络Q学习效率较低,不易收敛
Solution:固定深度神经网络targetQ在N次训练的过程中保持参数不变,变更的参数另记在一个地方,N次训练后更新一次参数到targetQ

目标网络的更新通常采取软更新,即取一个学习率t(通常为0.005),将旧的网络参数和新的网络参数做加权平均,然后赋值给目标网络,即
Q target params = t(Q params) + (1-t)Q target params

Double DQN

Policy Gradient (PG)

Actor-Critic (AC)

合并了以值为基础(比如Q- Learning)和以动作概率为基础(比如Policy Gradient)两类强化学习算法

Actor前生为如Policy Gradient,可以在连续的动作空间里选择合适的动作

Critic前生为Q- Learning,可以单步更新,类似于TD之于Monte Carlo。而传统的Policy Gradient必须回合更新,学习效率较低

Actor网络的表现由Critic网络进行打分

Deep Determinstic Policy Gradient (DDPG)

进一步结合了DQN,AC,和Target Q的思路,一共有四个神经网络:Actor,Target Actor,Critic,Target Critic

实验环境搭建

可以分为三个部分

  1. 将环境env与策略action串联起来,形成迭代优化回环
    简单的使用gym引入环境教程
  2. 自定义环境
    简单教程
    注册环境的部分,如果是简单的项目可以在项目代码内直接注册
    假设环境文件env1.py和引用他的算法文件dqn.py在同一目录下,则可以在dqn.py开头编写如下代码引入环境
from gym.envs.registration import register
register(
    id="qkd-v1",
    entry_point="env1:Env1",  # env1为文件名,Env1为env1.py中继承了gym.Env的class
)

自己使用时其实可以直接在

  1. 自定义智能体agent,以及其策略policy
    编写DQN网络解决CartPole问题

相关文章

网友评论

      本文标题:强化学习(Reinforcement Learning)

      本文链接:https://www.haomeiwen.com/subject/wcxpkjtx.html