Tensorflow分布式

作者: 陈泽_06aa | 来源:发表于2020-04-28 18:22 被阅读0次

白话tensorflow分布式部署和开发
分布式 tensorflow （2）
【TensorFlow实战——笔记】第1章：TensorFlow
TensorFlow架构与设计：概述
TensorFlow分布式
数据采集，图像数据处理，Python分布式爬虫，Mahout，T
学习笔记TF061:分布式TensorFlow，分布式原理、最佳
Tensorflow分布式
tf参数分割存储
学习笔记TF041:分布式并行

MirroredStrategy：

单机多卡；
同步训练；
Each variable in the model is mirrored across all the replicas. Together, these variables form a single conceptual variable called MirroredVariable. These variables are kept in sync with each other by applying identical updates.
采用all-reduce算法进行数据同步和整合
通信实现支持：HierarchicalCopyAllReduce，ReductionToOneDevice，NcclAllReduce(默认)；也支持自定义的实现tf.distribute.CrossDeviceOps。

TPUStrategy
MultiWorkerMirroredStrategy

多机多卡
同步训练
it creates copies of all variables in the model on each device across all workers.
MultiWorkerMirroredStrategy currently allows you to choose between two different implementations of collective ops. CollectiveCommunication.RING implements ring-based collectives using gRPC as the communication layer. CollectiveCommunication.NCCL uses Nvidia's NCCL to implement collectives. CollectiveCommunication.AUTO defers the choice to the runtime.
The best choice of collective implementation depends upon the number and kind of GPUs, and the network interconnect in the cluster.

CentralStorageStrategy

非稳定版本
单机多卡；
同步训练；
VariablesVariables are not mirrored, instead they are placed on the CPU and operations are replicated across all local GPUs. If there is only one GPU, all variables and operations will be placed on that GPU.
Update to variables on replicas will be aggregated before being applied to variables.

ParameterServerStrategy

多机多卡
同步训练
In this setup, some machines are designated as workers and some as parameter servers. Each variable of the model is placed on one parameter server. Computation is replicated across all GPUs of all the workers.

OneDeviceStrategy

单机单卡
You can use this strategy to test your code before switching to other strategies which actually distributes to multiple devices/machines.

Setting up TF_CONFIG environment variable

For multi-worker training, as mentioned before, you need to set TF_CONFIG environment variable for each binary running in your cluster
json os.environ["TF_CONFIG"] = json.dumps({ "cluster": { "worker": ["host1:port", "host2:port", "host3:port"], "ps": ["host4:port", "host5:port"] }, "task": {"type": "worker", "index": 1} })
This TF_CONFIG specifies that there are three workers and two ps tasks in the cluster along with their hosts and ports. The "task" part specifies that the role of the current task in the cluster, worker 1 (the second worker). Valid roles in a cluster is "chief", "worker", "ps" and "evaluator". There should be no "ps" job except when using ParameterServerStrategy.

参考文章:

如何搭建DGX-1高性能AI集群

如何理解Nvidia英伟达的Multi-GPU多卡通信框架NCCL？：(目前NCCL 1.0版本只支持单机多卡，卡之间通过PCIe、NVlink、GPU Direct P2P来通信。NCCL 2.0会支持多机多卡，多机间通过Sockets (Ethernet)或者InfiniBand with GPU Direct RDMA通信。)

MPI Scatter, Gather, and Allgather

分布式TensorFlow编程模型演进

一文说清楚Tensorflow分布式训练必备知识

Goodbye Horovod, Hello CollectiveAllReduce

机器学习领域技术大图：硬件算力：自研的EXSPARCL（Extremely Scalable and high Performance Alibaba gRoup Communication Library）集合通信库提供通用的集合通信功能，同时兼容 NVIDIA 的 NCCL。ExSparcl 专门优化支持大规模 AI 集群的高速互联架构和多网卡特性, 充分利用设备之间的互联带宽，保证通信和业务性能的线性扩展。通过对集群/主机物理互联的拓扑感知，和网络路由的最优选择，实现创新的无拥塞算法，确保节点内和节点间的高速通信；

NVIDIA Collective Communication Library (NCCL) Documentation