看这方面的论文和开源实现挺多了,但混在一起还是有点不清醒,注重分布式架构,小记~
我要上王者了呀。
1. PAAC
< EFFICIENT PARALLEL METHODS FOR DEEP REINFORCEMENT LEARNING >
link: https://arxiv.org/pdf/1705.04862.pdf
keys: sync, parallel, batch, actor-critic
structure
- Learner:保存唯一的网络,采取动作以及学习并更新网络
-
Workers:多个workers,每个worker负责多个envs(worker内的env顺序执行),workers 同时运行,通过共享变量将data 反馈给Learner,并从Learner获取下一步动作, 如此执行max_local_steps后获取到当前batch,进行网络更新
image.png
each loop in realization
workers — [states] —> Learner
<— [actions] —
// after max_local_steps:
batch data —> update network
2. GA3C
< REINFORCEMENT LEARNING THROUGH ASYNCHRONOUS ADVANTAGE ACTOR-CRITIC ON A GPU >
link: https://openreview.net/pdf?id=r1VGvBcxl
keys: async, GPU, actor-critic
structure
- agents:与环境交互的进程,负责经验数据的收集并根据习得策略采取动作. 多个并发执行的agents (每个agent拥有自己的env) . 将预测队列中的策略请求进行排队,并定期向训练队列提交一批经验数据.
- predictors: query the network for policies; 出列策略请求并转给GPU上的模型网络,将请求的策略返回给对应的agent
- trainers: 出列agents提交的经验数据,提交给GPU用于网络更新
——也只有一个网络模型, 拥有training_queue和prediction_queue
image.png
each loop in realization
在其实现中,给定义的NetWork分别实现了predict_p_and_v(self, x)
和train(self, x, y_r, a, trainer_id)
两个函数。即只有一个网络副本,只是把预测和训练方法暴露给对应的vThreadPredictor和ThreadTrainer调用。
// entities
server: model,agents,predictors,trainers,training_queue,prediction_queue
agents: env, wait_queue(wait for prediction reply)
agents — [prediction_queue] —> predictors — call —> server.model.predict()
|_ [training_queue] —> trainers — call —> server.model.train()
3. Ape-X
< DISTRIBUTED PRIORITIZED EXPERIENCE REPLAY >
link: https://arxiv.org/pdf/1803.00933.pdf
keys: prioritized experience replay
structure
- multiple actors: 每个actor都有一个自己的环境,共享一个网络模型(定期从learner处拉取新网络),相互独立地产生经验数据放至经验数据池. 每个actor拥有一个小的缓存(buffer),并本地计算优先级',定期提交至经验数据池
- learner: 从经验数据池中采样数据,更新经验的优先级,更新网络参数
-
shared, distributed, prioritized,experience replay memory: 计算初始优先级,存经验数据
经验数据的产生和采样都相互独立开了
uber replication version
uber 团队复现的Ape-X,使用QueenRunner,灵活的多线程+流式操作
PrioritizedReplayBufferWrapper: env with local prioritized replay buffer
QueneRunner ——> running multi actors' threads
|—>deque actors' local buffer to enque data to PrioritizedReplayBuffer
使用 QueueRunner to parallelize actors,数据流操作
实现中共享网络(网络作为参数传给actors操作, 无需定期拉取)
others
Deep Deterministic Policy Gradient (DDPG)
Distributed Stochastic Gradient Descent
Distributed Importance Sampling
Prioritized Experience Replay
4. IMPALA
< IMPALA: Scalable Distributed Deep-RL with Importance Weighted
Actor-Learner Architectures >
link: https://arxiv.org/pdf/1802.01561.pdf
keys: scalable, v-trace
structure
解耦经验数据的产生和参数学习, 两种运行架构
- Single Learner
actor异步产生数据,并通过队列发送至learner,每次迭代前都从learner拉取最新的网络参数 - Multiple Synchronous Learners
多个同步的learner. 策略参数分布在多个learners上,同步更新
realization
没看懂这代码实现啊, 网络的共享等部分被封装好了,还得学习下TF的高级用法:monitoredSession, replica_str等
using TensorFlow Serving: mutiple actors and one learner
网友评论