美文网首页
TensorFlow Training - Distribute

TensorFlow Training - Distribute

作者: 左心Chris | 来源:发表于2019-10-08 19:16 被阅读0次

    https://github.com/tensorflow/examples/blob/master/community/en/docs/deploy/distributed.md

    http://sharkdtu.com/posts/dist-tf-evolution.html

    http://download.tensorflow.org/paper/whitepaper2015.pdf

    https://segmentfault.com/a/1190000008376957

    PS-Worker 图内/图间
    All-Reduce 简单 -> half ps(做集合分发的) worker -> butterfly -> ring all-reduce
    MirroredStrategy
    MultiWorkerStrategy
    ParameterServerStrategy

    All-reduce和ps worker对比

    https://zhuanlan.zhihu.com/p/50116885

    1 不同的分布式策略

    All reduce

    把参数加在一起,然后同步到所有的机器上去
    https://zhuanlan.zhihu.com/p/79030485

    MirroredStrategy

    support synchronous distributed training on multiple GPUs on one machine

    Multi-workerMirroredStrategy

    It implements synchronous distributed training across multiple workers, each with potentially multiple GPUs

    ParameterServerStrategy

    supports parameter servers training on multiple machines. In this setup, some machines are designated as workers and some as parameter servers. Each variable of the model is placed on one parameter server. Computation is replicated across all GPUs of all the workers.

    2 演进

    http://sharkdtu.com/posts/dist-tf-evolution.html

    1 基本组件

    client/master/worker
    server(host:port)和task一一对应,cluster由server组成,一系列task称为一个job,每个server有两个Service,即master service和worker service,client通过session连接集群的任意一个server的master service来划分派发task

    2 基于PS的分布式TensorFlow编程模型

    Parameter Server Task集合为ps
    Worker Task集合为worker

    Low-level 分布式编程模型
    • 启动
      开启3个(两个worker,一个ps)
      定义一个cluster
      启动三个server
    • 图内复制
      模型参数放在ps上,不同部分的副本存在不同的worker上
      新建一个client,执行上面步骤,适用于单机多卡,不适用于大规模分布式多机训练
    • 图间复制
      每个worker都创建一个client,各个Client构建相同的graph,参数还是放置在ps上,一个worker节点挂掉了,系统还可以继续跑
      https://github.com/tensorflow/examples/blob/master/community/en/docs/deploy/distributed.md
    High-level 分布式编程模型

    使用Estimator和Dataset高阶API

    3 基于All-Reduce的分布式TensorFlow架构

    相关文章

      网友评论

          本文标题:TensorFlow Training - Distribute

          本文链接:https://www.haomeiwen.com/subject/wxnqsctx.html