美文网首页
Project Adam: Building an Effici

Project Adam: Building an Effici

作者: 世间五彩我执纯白 | 来源:发表于2016-04-06 15:51 被阅读0次

    1. Introduction

    main contributions:

    • Optimizing and balancing both computation and communication through whole system co-design
    • Achieving high performance and scalability by exploiting the ability of machine learning training to tolerate inconsistencies well
    • Demonstrating that system efficiency, scaling, and asynchrony all contribute to improvements in trained model accuracy

    2. System architecture

    • 通过parameter server,异步更新shared model
    • Adam是一个general-purpose系统,因为SGD能够训练任何基于BP的DNN模型

    2.1. Fast data server

    • 一些机器当作data serving machine,提供数据服务,减少了model training machine的负载

    2.2. Model training

    • We partition our models vertically across the model worker machines,垂直划分模型能够将卷积层cross-machine的通信最小化
    • 一个机器上是多线程,共享一份模型,用无锁的方式更新本地的shared model
    • 其他优化方法:pass a pointer rather than copy data, cache locality
    • 减轻straggler的影响:为了避免快的machine等待慢的machine的数据,允许线程并行处理多个images;只要一定数量的image被处理完,就认为一个epoch已经结束
    • PS通信:两种通信策略。accumulate updates,定期发送给PS,PS直接将更新加到全局参数上,这对于卷积层有效,因为weight sharing;对于全连接层,参数更大,发送activation和gradient vector到PS,矩阵乘法在PS上计算,这能够较少通信开销。

    2.3. 全局Parameter Server

    • Hash存储。shards are hashed into storage buckets that are distributed equally among the parameter server machines.
    • batch updates。applying all updates in a batch to a block of parameters before moving to next block in the shard.
    • lock free。We use lock free data structures for queues and hash tables. In addition, we implement lock free memory allocation
    • inconsistency. DNN models are capable of learning even in the presence of small amounts of lost
      updates. 能够容忍少量的delayed updates
    • Fault tolerance. 每个parameter shard有3份copies;primary给secondary machine发送updates时使用2-phase commit protocol

    相关文章

      网友评论

          本文标题:Project Adam: Building an Effici

          本文链接:https://www.haomeiwen.com/subject/ynfmlttx.html