深度学习分布式训练 - 以FireCaffe为例

作者: MatrixOnEarth | 来源:发表于2020-03-09 13:17 被阅读0次

深度学习分布式训练 - 以FireCaffe为例
矩池云 | 高性价比的GPU租用深度学习平台
1-2 神经网络基础
2018-06-02
以慢为快学习法
关键词
深度学习分布式训练(下)-原理篇
【转】听说你了解深度学习最常用的学习算法：Adam优化算法？
如何择校？(高考+考研+出国择校)
深度学习分布式训练实战（二）——TF

这篇文章写于2015年，最近翻出来再看，其实江山哪曾变一分。

论文

Forrest N. Iandola etc., FireCaffe: near-linear acceleration of deep neural network training on computer clusters 2016.1

Problem statements from data scientists

4 key pain points summarized by Jeff Dean from Google:

DNN researchers and users want results of experiments quickly.
There is a "patience threshold": No one wants to wait more than a few days or a week for result.
This significantly affects scale of problems that can be tackled.
We sometimes optimize for experiment turnaround time, rather than absolute minimal system resources for performing the experiments

Turn-around time impacts for DL projects

Problem analysis

The speed and scalability of distributed algorithm are almost always limited by the overhead of communicating between servers; DNN training is not an exception to this rule.
So the design focuses on the communication enhancement, including:

Upgrade to high throughput interconnects, i.e. use high throughput interconnects like IB etc.
Decrease the data transmission volume while training, which includes:
- Balance carefully between data parallelism and model parallelism
- Increase batch size to reduce communication quantity. And identify hyperparameters suitable for large batch size.
- Communication data quantity balance among nodes to avoid single point dependency.

Key take-aways

Parallelism Scheme: Model parallelism or Data Parallelism

Model parallelism

model parallelism
Each worker gets a subset of the model parameters, and the workers communication by exchanging data gradients and exchanging activations . and data quantity is:

Data parallelism

data parallelism

Each worker gets a subset of the batch, and then the workers communicate by exchanging weight gradient updates $\nabla W$ , where $W$ and $\nabla W$ data quantity is:
$|W| = \sum_{L=1}^{\#layers} ch_L * numFilt_L * filterW_L * filterH_L$

Convolution layer and fully connection layer have different characteristics in data/weight ratio. So they can use different parallelism schemes.

Different model prefers different parallelism scheme

So a basic conclusion is: convolution layers can be fitted into data parallelism, and fc layers can be fitted into model parallelism.
Further more, for more advanced CNNs like GoogLeNet and ResNet etc., we can directly use data parallelism, as this paper is using.

Gradient Aggregation Scheme: Parameter Server or Reduction Tree

One picture to show how parameter server and reduction tree work in data parallelism.

gradient aggregation scheme

Parameter Server

Parameter communication time with regard to worker number $p$ in parameter server scheme.
$param\_server\_communication\_time=\dfrac{|\nabla W| * p}{BW}$
The communication time scales linearly as we increase the number of workers. single parameter server becomes scalability bottleneck.
Microsoft Adam and Google DistBelief relief this issue by defining a poll of nodes taht colelctively behave as a parameter server. The bigger the parameter server hierarchy gets, the more it looks like a reduction tree.

Reduction Tree

The idea is same as allreduce in message passing model. Parameter communication time with regard to worker number $p$ in reduction tree scheme.
$t=\dfrac{|\nabla W| * 2log_{2}(p)}{BW}$
It scales logrithmatically as the number of workers.

reduce tree scalability

Batch size selection

Larger batch size lead to less frequent communication and therefore enable more scalability in a distributed setting. But for larger batch size, we need identify a suitable hyperparameter setting to maintain the speed and accuracy produced in DNN training.
Hyperparameters includes:

Initial learning rate $initial\_lr$
learning rate update scheme
weight delay $\omega$
momentum $\mu$
Weight update rule used, here $i$ means iteration index:

Learning rate update rule:
$lr = lr_{0}(1 - \dfrac{iter}{max\_iter})^{\alpha}, \alpha = 0.5$
On how to get hyperparameters according to batch size, I will write another article for this.

Results

Final results on GPU cluster w/ GoogleNet.

results

More thinkings

以上方案基本上是无损的，为了更进一步减少通信开销，大家开始尝试有损的方案，在训练速度和准确度之间进行折衷。典型的有:
1. Reduce parameter size using 16-bit floating-point - Google
2. Use 16-bit weights and 8-bit activations.
3. 1-bit gradients backpropagation - Microsoft
4. Discard gradients whose numerical values fall below a certain threshold - Amazon
5. Compress(e.g. using PCA) weights before transmitting
6. Network pruning/encoding/quantization - Intel, DeePhi
使用新的底层技术来减少通信开销
1. RDMA rather than traditional TCP/IP?

References

https://www.slideshare.net/AIFrontiers/jeff-dean-trends-and-developments-in-deep-learning-research

深度学习分布式训练 - 以FireCaffe为例
这篇文章写于2015年，最近翻出来再看，其实江山哪曾变一分。论文 Forrest N. Iandola etc....
矩池云 | 高性价比的GPU租用深度学习平台
矩池云是一个专业的国内深度学习云平台，拥有着良好的深度学习云端训练体验。在性价比上，我们以 2080Ti 单卡为例...
1-2 神经网络基础
吴恩达《神经网络和深度学习》课程笔记以监督学习为例 1. 训练集的表达一组训练样本： m组训练样本的集合：把...
2018-06-02
深度学习分布式训练在KubernetesDocker实践小结
以慢为快学习法
当我们知道深度学习的重要性后，就要扎实地进行学习了。以读书为例，看看深度学习的方法—以慢为快学习法。 1）避免...
关键词
个人深度学习，元认知，演讲，阳志平，以项目管理为例
深度学习分布式训练(下)-原理篇
接上篇深度学习分布式训练(上)-Pytorch实现篇分布式训练前面讲了在深度学习中随着数据量和模型的复杂度增...
【转】听说你了解深度学习最常用的学习算法：Adam优化算法？
深度学习常常需要大量的时间和机算资源进行训练，这也是困扰深度学习算法开发的重大原因。虽然我们可以采用分布式并行训练...
如何择校？(高考+考研+出国择校)
朱丹-深度搜索实战⽤实战训练：识别需求、获取信息、甄别信息此题选校，以高考选校为例，但是考研选校，出国选...
深度学习分布式训练实战（二）——TF
本篇博客主要介绍TF的分布式训练，重点从代码层面进行讲解。理论部分可以参考深度学习分布式训练实战（一) TF的分布...