美文网首页
pytorch多卡训练

pytorch多卡训练

作者: 顾北向南 | 来源:发表于2021-07-10 22:59 被阅读0次

    参考链接:https://fyubang.com/2019/07/23/distributed-training3/

    pytorch多gpu并行训练 - 知乎 (zhihu.com)

    PyTorch DistributedDataParallel使用小结_yuanye_yuanye的博客-CSDN博客

    GitHub - tczhangzhi/pytorch-distributed: A quickstart and benchmark for pytorch distributed training.

    1. 使用DistributedDataParallel并行训练

    • 运行命令:CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 torch_ddp.py
    • DistributedDataParallel比DataParallel运行的更快, 然后显存分配的更加均衡。
    import torch
    import torch.nn as nn
    from torch.autograd import Variable
    from torch.utils.data import Dataset, DataLoader
    import os
    from torch.utils.data.distributed import DistributedSampler
    # 1) 初始化
    torch.distributed.init_process_group(backend="nccl")
    
    input_size = 5
    output_size = 2
    batch_size = 30
    data_size = 90
    
    # 2) 配置每个进程的gpu
    local_rank = torch.distributed.get_rank()
    torch.cuda.set_device(local_rank)
    device = torch.device("cuda", local_rank)
    
    class RandomDataset(Dataset):
        def __init__(self, size, length):
            self.len = length
            self.data = torch.randn(length, size).to('cuda')
    
        def __getitem__(self, index):
            return self.data[index]
    
        def __len__(self):
            return self.len
    
    dataset = RandomDataset(input_size, data_size)
    # 3)使用DistributedSampler
    rand_loader = DataLoader(dataset=dataset,
                             batch_size=batch_size,
                             sampler=DistributedSampler(dataset))
    
    class Model(nn.Module):
        def __init__(self, input_size, output_size):
            super(Model, self).__init__()
            self.fc = nn.Linear(input_size, output_size)
    
        def forward(self, input):
            output = self.fc(input)
            print("  In Model: input size", input.size(),
                  "output size", output.size())
            return output
        
    model = Model(input_size, output_size)
    
    # 4) 封装之前要把模型移到对应的gpu
    model.to(device)
        
    if torch.cuda.device_count() > 1:
        print("Let's use", torch.cuda.device_count(), "GPUs!")
        # 5) 封装
        model = torch.nn.parallel.DistributedDataParallel(model,
                                                          device_ids=[local_rank],
                                                          output_device=local_rank)
       
    for data in rand_loader:
        if torch.cuda.is_available():
            input_var = data
        else:
            input_var = data
    
        output = model(input_var)
        print("Outside: input size", input_var.size(), "output_size", output.size())
    
    • 使用DistributedDataParallel分布式训练,模型中含有BatchNormalization层会出现如下问题:
    • 解决方法:1)删除BN层;2)使用SyncBatchNorm层代替

    相关文章

      网友评论

          本文标题:pytorch多卡训练

          本文链接:https://www.haomeiwen.com/subject/miuzdltx.html