参考了 Pytorch自动混合精度(AMP)介绍与使用与pytorch官方,如有侵权请联系作者删除。
1. 引入amp
一般情况下,主流深度学习框架都默认采用32位浮点精度进行运算。pytorch也不例外,在pytorch中创建的Tensor默认是FloatTensor,计算机中用32bit单精度浮点数来表示。 我们知道深度学习任务通常是运算密集型的,那有没有一种方法能够在保证精度损失可接受的情况下减少运算量呢?
>>import torch
>>t1=torch.zeros(2,2)
>>t1.type()
'torch.FloatTensor'
>>t2=torch.Tensor([0,0])
>>t2.type()
'torch.FloatTensor'
# Pytorch中各种类型的tensor
torch.FloatTensor(32bit floating point)
torch.DoubleTensor(64bit floating point)
torch.HalfTensor(16bit floating piont1)
torch.BFloat16Tensor(16bit floating piont2)
torch.ByteTensor(8bit integer(unsigned)
torch.CharTensor(8bit integer(signed))
torch.ShortTensor(16bit integer(signed))
torch.IntTensor(32bit integer(signed))
torch.LongTensor(64bit integer(signed))
torch.BoolTensor(Boolean)
2017年NVIDIA提出了一种基于混合精度训练的解决方案,该方案在网络训练时将单精度(Float32)与半精度(Float16)结合在一起,使用相同的超参数实现了与FP32几乎相同的精度,相关算法集成在apex包中。在pytorch1.6中,该方法被进一步封装变成torch.cuda.amp包。
torch.cuda.amp provides convenience methods for mixed precision, where some operations use the torch.float32
(float
) datatype and other operations use torch.float16
(half
). Some ops, like linear layers and convolutions, are much faster in float16
. Other ops, like reductions, often require the dynamic range of float32
. Mixed precision tries to match each op to its appropriate datatype, which can reduce your network’s runtime and memory footprint.
amp (automatic mixed precision) 其实就是自动混合的使用Float32与Float16精度的tensor进行训练、存储模型参数,至于具体使用哪种精度,在需要的时候由框架自动决定并进行转换。相比之前只使用Float32单精度计算,这会带来一些优势:
- 显存占用更少,batch size可以变的更大;
- 计算更快,加快了训练和推断的速度;
- 顺应低精度计算趋势。
为什么不直接采用Float16半精度?因为Float16在某些情况下会导致“下溢出”和“舍入误差”。
2. 如何使用(pytorch1.6+)
Mixed precision primarily benefits Tensor Core-enabled architectures (Volta, Turing, Ampere). This recipe should show significant (2-3X) speedup on those architectures. On earlier architectures (Kepler, Maxwell, Pascal), you may observe a modest speedup. Run nvidia-smi to display your GPU’s architecture.
自动混合精度训练需要同时使用到 torch.cuda.amp.autocast and torch.cuda.amp.GradScaler这两个类。
2.1 添加autocast
Instances of torch.cuda.amp.autocast
enable autocasting for chosen regions. Autocasting automatically chooses the precision for GPU operations to improve performance while maintaining accuracy.
可以使用autocast的context managers语义,也可以使用decorators语义。当进入autocast上下文后,在这之后的cuda ops会把tensor的数据类型转换为半精度浮点型,从而在不损失训练精度的情况下加快运算。而不需要手动调用.half(),框架会自动完成转换。
for epoch in range(0): # 0 epochs, this section is for illustration only
for input, target in zip(data, targets):
# Runs the forward pass under autocast.
with torch.cuda.amp.autocast():
output = net(input)
# output is float16 because linear layers autocast to float16.
assert output.dtype is torch.float16
loss = loss_fn(output, target)
# loss is float32 because mse_loss layers autocast to float32.
assert loss.dtype is torch.float32
# Exits autocast before backward().
# Backward passes under autocast are not recommended.
# Backward ops run in the same dtype autocast chose for corresponding forward ops.
loss.backward()
opt.step()
opt.zero_grad() # set_to_none=True here can modestly improve performance
2.2 添加GradScaler
Instances of torch.cuda.amp.GradScaler
help perform the steps of gradient scaling conveniently. Gradient scaling improves convergence for networks with float16
gradients by minimizing gradient underflow, as explained here.
torch.cuda.amp.GradScaler performs the steps of gradient scaling conveniently.
# Constructs scaler once, at the beginning of the convergence run, using default args.
# If your network fails to converge with default GradScaler args, please file an issue.
# The same GradScaler instance should be used for the entire convergence run.
# If you perform multiple convergence runs in the same script, each run should use
# a dedicated fresh GradScaler instance. GradScaler instances are lightweight.
scaler = torch.cuda.amp.GradScaler()
for epoch in range(0): # 0 epochs, this section is for illustration only
for input, target in zip(data, targets):
with torch.cuda.amp.autocast():
output = net(input)
loss = loss_fn(output, target)
# Scales loss. Calls backward() on scaled loss to create scaled gradients.
scaler.scale(loss).backward()
# scaler.step() first unscales the gradients of the optimizer's assigned params.
# If these gradients do not contain infs or NaNs, optimizer.step() is then called,
# otherwise, optimizer.step() is skipped.
scaler.step(opt)
# Updates the scale for next iteration.
scaler.update()
opt.zero_grad() # set_to_none=True here can modestly improve performance
2.3 典型的混合精度训练案例
# Creates model and optimizer in default precision
model = Net().cuda()
optimizer = optim.SGD(model.parameters(), ...)
# Creates a GradScaler once at the beginning of training.
scaler = GradScaler()
for epoch in epochs:
for input, target in data:
optimizer.zero_grad()
# Runs the forward pass with autocasting.
with autocast():
output = model(input)
loss = loss_fn(output, target)
# Scales loss. Calls backward() on scaled loss to create scaled gradients.
# Backward passes under autocast are not recommended.
# Backward ops run in the same dtype autocast chose for corresponding forward ops.
scaler.scale(loss).backward()
# scaler.step() first unscales the gradients of the optimizer's assigned params.
# If these gradients do not contain infs or NaNs, optimizer.step() is then called,
# otherwise, optimizer.step() is skipped.
scaler.step(optimizer)
# Updates the scale for next iteration.
scaler.update()
3 分布式多GPU并行
仅影响autocast
,GradScaler
使用方式不变。
3.1 单进程DataParallel
torch.nn.DataParallel会在每个设备上生成线程来运行前向传递。autocast是thread local的,所以以下操作不起作用。
model = MyModel()
dp_model = nn.DataParallel(model)
# Sets autocast in the main thread
with autocast():
# dp_model's internal threads won't autocast. The main thread's autocast state has no effect.
output = dp_model(input)
# loss_fn still autocasts, but it's too late...
loss = loss_fn(output)
解决方法很简单,使用autocast装饰model的forward函数
MyModel(nn.Module):
...
@autocast()
def forward(self, input):
...
# Alternatively
MyModel(nn.Module):
...
def forward(self, input):
with autocast():
...
3.2 DDP one GPU per process
torch.nn.parallel.DistributedDataParallel的文档建议每个进程使用一个GPU以获得最佳性能。在这种情况下,DistributedDataParallel不会在内部生成线程,因此autocast和GradScaler的使用不会受到影响。
3.3 DDP multiple GPUs per process
这里 torch.nn.parallel.DistributedDataParallel 可能会像 torch.nn.DataParallel 一样,在每个设备上生成一个线程来运行前向传递。方法等同于DP,参考3.1节。
网友评论