前提
- (多机多GPU运行前提)安装了分布式文件共享或者类似的网络文件共享,即在运行keras时,目录需要一致.或者可以试试将相同的代码和数据拷贝多份到不同的主机上(没测试过这种方法)
- horovod\keras已经安装成功.
基于horovod分布式训练的keras代码结构:
import tensorflow as tf
import horovod.keras as hvd
# Initialize Horovod
hvd.init()
# Pin GPU to be used to process local rank (one GPU per process)
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())
K.set_session(tf.Session(config=config))
# Build model...
loss = ...
model = net()
model.summary()
opt = keras.optimizers.Adam(lr=1.0 * hvd.size())
# Add Horovod Distributed Optimizer
opt = hvd.DistributedOptimizer(opt)
model.compile(optimizer=opt, loss=loss, metrics=['mse'])
callbacks = [
# Horovod: broadcast initial variable states from rank 0 to all other processes.
# This is necessary to ensure consistent initialization of all workers when
# training is started with random weights or restored from a checkpoint.
hvd.callbacks.BroadcastGlobalVariablesCallback(0),
# Horovod: average metrics among workers at the end of every epoch.
#
# Note: This callback must be in the list before the ReduceLROnPlateau,
# TensorBoard or other metrics-based callbacks.
hvd.callbacks.MetricAverageCallback(),
# Horovod: using `lr = 1.0 * hvd.size()` from the very beginning leads to worse final
# accuracy. Scale the learning rate `lr = 1.0` ---> `lr = 1.0 * hvd.size()` during
# the first five epochs. See https://arxiv.org/abs/1706.02677 for details.
hvd.callbacks.LearningRateWarmupCallback(warmup_epochs=5, verbose=1),
# Reduce the learning rate if training plateaues.
keras.callbacks.ReduceLROnPlateau(patience=10, verbose=1),
]
# Horovod: save checkpoints only on worker 0 to prevent other workers from corrupting them.
if hvd.rank() == 0:
callbacks.append(keras.callbacks.ModelCheckpoint('./checkpoint-{epoch}.h5'))
## 或者使用model.fit_generator()
model.fit(train_shuffled, train_mask_shuffled , batch_size=10, epochs=200, verbose=1, shuffle=True,
validation_data=(val_shuffled, val_mask_shuffled),callbacks=callbacks)
启动训练的脚本:
单机多GPU:
#!/usr/bin/env bash
mpirun -np 2 \
-H localhost:2 \
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl ^openib \
python horovod_keras.py
多机多GPU:
#!/usr/bin/env bash
mpirun -np 3 \
-H node1:2,node2:1 \
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-x NCCL_SOCKET_IFNAME=^lo,docker0,cni0,virbr0 \
-mca pml ob1 -mca btl ^openib \
-mca btl_tcp_if_exclude lo,docker0,cni0,virbr0 \
python horovod_keras.py
注意node1:2,node2:2表示了主机:GPU个数,所以加起来的总数要和-np指定的个数一致,即为4.
有待改进的地方:
- 现在运行基于horovod运行keras时,每个主机上都会加载所有的数据集,这一点现有思路即使用fit_generator()函数继续迭代加入数据;或者考虑使用tensorflow里的库函数,给不同主机随机加载不同的数据.
参考
数据分布训练时的划分:
网友评论