美文网首页
Tensorflow分布式

Tensorflow分布式

作者: 陈泽_06aa | 来源:发表于2020-04-28 18:22 被阅读0次
    • MirroredStrategy:

    单机多卡;
    同步训练;
    Each variable in the model is mirrored across all the replicas. Together, these variables form a single conceptual variable called MirroredVariable. These variables are kept in sync with each other by applying identical updates.
    采用all-reduce算法进行数据同步和整合
    通信实现支持:HierarchicalCopyAllReduce,ReductionToOneDevice,NcclAllReduce(默认);也支持自定义的实现tf.distribute.CrossDeviceOps。

    • TPUStrategy
    • MultiWorkerMirroredStrategy

    多机多卡
    同步训练
    it creates copies of all variables in the model on each device across all workers.
    MultiWorkerMirroredStrategy currently allows you to choose between two different implementations of collective ops. CollectiveCommunication.RING implements ring-based collectives using gRPC as the communication layer. CollectiveCommunication.NCCL uses Nvidia's NCCL to implement collectives. CollectiveCommunication.AUTO defers the choice to the runtime.
    The best choice of collective implementation depends upon the number and kind of GPUs, and the network interconnect in the cluster.

    • CentralStorageStrategy

    非稳定版本
    单机多卡;
    同步训练;
    VariablesVariables are not mirrored, instead they are placed on the CPU and operations are replicated across all local GPUs. If there is only one GPU, all variables and operations will be placed on that GPU.
    Update to variables on replicas will be aggregated before being applied to variables.

    • ParameterServerStrategy

    多机多卡
    同步训练
    In this setup, some machines are designated as workers and some as parameter servers. Each variable of the model is placed on one parameter server. Computation is replicated across all GPUs of all the workers.

    • OneDeviceStrategy

    单机单卡
    You can use this strategy to test your code before switching to other strategies which actually distributes to multiple devices/machines.

    Setting up TF_CONFIG environment variable

    For multi-worker training, as mentioned before, you need to set TF_CONFIG environment variable for each binary running in your cluster
    json os.environ["TF_CONFIG"] = json.dumps({ "cluster": { "worker": ["host1:port", "host2:port", "host3:port"], "ps": ["host4:port", "host5:port"] }, "task": {"type": "worker", "index": 1} })
    This TF_CONFIG specifies that there are three workers and two ps tasks in the cluster along with their hosts and ports. The "task" part specifies that the role of the current task in the cluster, worker 1 (the second worker). Valid roles in a cluster is "chief", "worker", "ps" and "evaluator". There should be no "ps" job except when using ParameterServerStrategy.

    参考文章:

    1. 如何搭建DGX-1高性能AI集群
    2. 如何理解Nvidia英伟达的Multi-GPU多卡通信框架NCCL?:(目前NCCL 1.0版本只支持单机多卡,卡之间通过PCIe、NVlink、GPU Direct P2P来通信。NCCL 2.0会支持多机多卡,多机间通过Sockets (Ethernet)或者InfiniBand with GPU Direct RDMA通信。)
    3. MPI Scatter, Gather, and Allgather
    4. 分布式TensorFlow编程模型演进
    5. 一文说清楚Tensorflow分布式训练必备知识
    6. Goodbye Horovod, Hello CollectiveAllReduce
    7. 机器学习领域技术大图:硬件算力:自研的EXSPARCL(Extremely Scalable and high Performance Alibaba gRoup Communication Library)集合通信库提供通用的集合通信功能,同时兼容 NVIDIA 的 NCCL。ExSparcl 专门优化支持大规模 AI 集群的高速互联架构和多网卡特性, 充分利用设备之间的互联带宽,保证通信和业务性能的线性扩展。通过对集群/主机物理互联的拓扑感知,和网络路由的最优选择,实现创新的无拥塞算法,确保节点内和节点间的高速通信;
    8. NVIDIA Collective Communication Library (NCCL) Documentation

    相关文章

      网友评论

          本文标题:Tensorflow分布式

          本文链接:https://www.haomeiwen.com/subject/xqrbwhtx.html