1. Introduction
main contributions:
- Optimizing and balancing both computation and communication through whole system co-design
- Achieving high performance and scalability by exploiting the ability of machine learning training to tolerate inconsistencies well
- Demonstrating that system efficiency, scaling, and asynchrony all contribute to improvements in trained model accuracy
2. System architecture
- 通过parameter server,异步更新shared model
- Adam是一个general-purpose系统,因为SGD能够训练任何基于BP的DNN模型
2.1. Fast data server
- 一些机器当作data serving machine,提供数据服务,减少了model training machine的负载
2.2. Model training
- We partition our models vertically across the model worker machines,垂直划分模型能够将卷积层cross-machine的通信最小化
- 一个机器上是多线程,共享一份模型,用无锁的方式更新本地的shared model
- 其他优化方法:pass a pointer rather than copy data, cache locality
- 减轻straggler的影响:为了避免快的machine等待慢的machine的数据,允许线程并行处理多个images;只要一定数量的image被处理完,就认为一个epoch已经结束
- PS通信:两种通信策略。accumulate updates,定期发送给PS,PS直接将更新加到全局参数上,这对于卷积层有效,因为weight sharing;对于全连接层,参数更大,发送activation和gradient vector到PS,矩阵乘法在PS上计算,这能够较少通信开销。
2.3. 全局Parameter Server
- Hash存储。shards are hashed into storage buckets that are distributed equally among the parameter server machines.
- batch updates。applying all updates in a batch to a block of parameters before moving to next block in the shard.
- lock free。We use lock free data structures for queues and hash tables. In addition, we implement lock free memory allocation
- inconsistency. DNN models are capable of learning even in the presence of small amounts of lost
updates. 能够容忍少量的delayed updates - Fault tolerance. 每个parameter shard有3份copies;primary给secondary machine发送updates时使用2-phase commit protocol