美文网首页
Pytorch袖珍手册之十一

Pytorch袖珍手册之十一

作者: 深思海数_willschang | 来源:发表于2021-08-27 10:06 被阅读0次
pytorch pocket reference

第六章 Pytorch加速及优化(性能提升) 之二

模型并行处理 model parallel processing

模型在各GPUs间被分成几个部分,同时数据批量输送到各个单元进行并行化运算。

Model parallel processing is often reserved for cases in which the model does not fit on a single GPU.

Model parallel processing

上图中可以看出模型也被分成了N个部分(N为GPUs个数)。同时这里的输入数据是通过“通道”方式传输到GPUs,这样子只是在第一次数据是串行传输的,其他批次就是各GPUs并行运算了。

When we pipeline the data, only the first N batches are run in sequence, and then each subsequent run activates all the GPUs.

模型并行化不像数据并行那么简单,需要对模型进行重构的。
需要定义如何切分模型及数据通道的搭建。
示例代码:

class TwoGPUAlexNet(AlexNet):
    """
    subclass from the AlexNet class
    we need to describe which pieces of the model go
    on GPU0 and which pieces go on GPU1 in the __init__() con‐
    structor. Then we need to pipeline the data through each GPU
    in the forward() method to implement GPU pipelining. 
    """
    def __init__(self):
        super(ModelParallelAlexNet, self).__init__(num_classes=num_classes, *args, **kwargs)
        self.features.to('cuda:0')
        self.avgpool.to('cuda:0')
        self.classifier.to('cuda:1')
        self.split_size = split_size
    
    def forward(self, x):
        splits = iter(x.split(self.split_size, dim=0))
        s_next = next(splits)
        s_prev = self.seq1(s_next).to('cuda:1')
        ret = []
        
        for s_next in splits:
            s_prev = self.seq2(s_prev)
            ret.append(self.fc(s_prev.view(s_prev.size(0), -1)))
            
            s_prev = self.seq1.(s_next).to('cuda:1')
        
        s_prev = self.seq2(s_prev)
        ret.append(self.fc(s_prev.view(s_prev.size(0), -1)))
        
        return torch.cat(ret)

model = TwoGPUAlexNet()
loss_fn = nn.MSELoss()
optimizer = optim.SGD(model.parameters(), lr=0.001)

# 训练模型,确保数据的分派,即labels放置于最后一张显卡上
for epochs in range(n_epochs):
    for input, labels in dataloader:
        # 数据分别传输到不同GPU上
        input, labels = input.to('cuda:0'), labels.to('cuda:1')
        optimizer.zero_grad()
        outputs = model(input)
        loss_fn(outputs, labels).backward()
        optimizer.step()
数据与模型双并行化处理

In this case, you will wrap your model using DDP to distribute your data batches among multiple processes. Each process will use multiple GPUs, and your model will be partitioned among each of those GPUs.

跟之前的模式有两个地方需要修改:

  1. 修改多GPU模型类以支持设备作为输入
  2. 在forward里可省略输出设备设置,因为DDP会自主决定输入输出数据的位置。
class Simple2GPUModel(nn.Module):
    def __init__(self, dev0, dev1):
        super(Simple2GPUModel, self).__init__()
        self.dev0 = dev0
        self.dev1 = dev1
        self.net1 = torch.nn.Linear(10, 10).to(dev0)
        self.relu = torch.nn.ReLU()
        self.net2 = torch.nn.Linear(10, 5).to(dev1)
        
    def forward(self, x):
        x = x.to(self.dev0)
        x = self.relu(self.net1(x))
        x = x.to(self.dev1)
        return self.net2(x)

    
def model_parallel_training(rank, world_size):
    print(f"Running DDP with a model parallel")
    setup(rank, world_size)
    # set up mp_model and devices for this process
    dev0 = rank * 2
    dev1 = rank * 2 + 1
    mp_model = Simple2GPUModel(dev0, dev1)
    # Wrap the model in DDP
    ddp_mp_model = DDP(mp_model)
    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_mp_model.parameters(), lr=0.001)
    
    for epochs in range(n_epochs):
        for input, labels in dataloader:
            # Move the inputs and labels to the appropriate device IDs.
            input = input.to(dev0),
            labels = labels,to(dev1)
            optimizer.zero_grad()
            # The output is on dev1
            outputs = ddp_mp_model(input)
            loss = loss_fn(outputs, labels)
            loss.backward()
            optimizer.step()
    
    cleanup()
分布式训练(Distributed Traing Multiple Machine)

PyTorch’s distributed subpackage, torch.distributed, provides a rich set of capabilities to suit a variety of training architectures and hardware platforms.
The torch.distributed subpackage consists of three components: DDP, RPC-based distributed training (RPC), and collective communication (c10d).

该部分内容因为也未接触过,这里只是做个记录,实际应用中有用到再来补充。

相关文章

  • Pytorch袖珍手册之十一

    第六章 Pytorch加速及优化(性能提升) 之二 模型并行处理 model parallel processin...

  • Pytorch袖珍手册之十

    第六章 Pytorch加速及优化(性能提升)之一 在实际应用中,我们可能面对的数据是比之前章节里的还要多,模型网络...

  • Pytorch袖珍手册之五

    我用阿里云盘分享了「OReilly.PyTorch.Pocket.R...odels.149209000X.pdf...

  • Pytorch袖珍手册之四

    第三章 基于Pytorch的深度学习开发 前面章节我们已经了解tensor及其操作,这章主要就是学习如何用Pyto...

  • Pytorch袖珍手册之八

    我用阿里云盘分享了「OReilly.PyTorch.Pocket.R...odels.149209000X.pdf...

  • Pytorch袖珍手册之九

    第五章 基于Pytorch的深度学习网络结构自主式开发 前面章节我们主要通过pytorch提供的类,函数和各种库进...

  • Pytorch袖珍手册之六

    我用阿里云盘分享了「OReilly.PyTorch.Pocket.R...odels.149209000X.pdf...

  • Pytorch袖珍手册之七

    我用阿里云盘分享了「OReilly.PyTorch.Pocket.R...odels.149209000X.pdf...

  • Pytorch袖珍手册之十三

    第六章 Pytorch加速及优化(性能提升) 之四 模型优化--量化 Quantization 模型量化属于模型压...

  • Pytorch袖珍手册之十四

    第六章 Pytorch加速及优化(性能提升) 之五 模型优化--剪枝 Pruning 现在的模型基本上都是成百上千...

网友评论

      本文标题:Pytorch袖珍手册之十一

      本文链接:https://www.haomeiwen.com/subject/vupliltx.html