深度学习笔记(八)—— CNN-2

1. 实验内容与流程

1.1 实验要求

1.结合理论课内容,深入理解DenseNet、ResNeXt的结构与FaceNet的工作机制,理解Triplet loss的工作过程。



1.2 知识预备


2.理解Residual Block的结构。

3.Triplet loss

1.3 实验内容


2.利用Triplet loss训练一个简单的FaceNet网络,并用knn的方式进行分类预测。



2. DenseNet与ResNeXt






2.1 DenseNet网络的搭建


在一个DenseBlock里面,每个非线性变换H输出的channels数为恒定的Growth_rate,那么第i层的输入的channels数便是k+(i+1)* Growth_rate, k为Input

的channels数,比如,假设我们把Growth_rate设为4,上图中H1的输入的size为8 * 32 * 32,输出为4 * 32 * 32, 则H2的输入的size为12 * 32 * 32,

输出还是4 * 32 * 32,H3、H4以此类推,在实验中,用较小的Growth_rate就能实现较好的效果。

Transition Layer

请注意, 在一个DenseBlock里面,feature size并没有发生改变,因为需要对不同层的feature map进行concatenate操作,这需要保持相同的feature size。

因此在相邻的DenseBlock中间使用Down Sampling来增大感受野,即使用Transition Layer来实现,一般的Transition Layer包含BN、Conv和Avg_pool,

同时减少维度,压缩率(compress rate)通常为0.5, 即减少一半的维度。


例如,假设block1的输出c * w * h是24 * 32 * 32,那么经过transition之后,block2的输入就是12 * 16 * 16。



低至4 * Growth_rate。

2.2 定义网络

# Load necessary modules here
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.backends.cudnn as cudnn
import os

class Bottleneck(nn.Module):
        the above mentioned bottleneck, including two conv layer, one's kernel size is 1×1, another's is 3×3

        after non-linear operation, concatenate the input to the output
    def __init__(self, in_planes, growth_rate):
        super(Bottleneck, self).__init__()
        self.bn1 = nn.BatchNorm2d(in_planes)
        self.conv1 = nn.Conv2d(in_planes, 4*growth_rate, kernel_size=1, bias=False)
        self.bn2 = nn.BatchNorm2d(4*growth_rate)
        self.conv2 = nn.Conv2d(4*growth_rate, growth_rate, kernel_size=3, padding=1, bias=False)

    def forward(self, x):
        out = self.conv1(F.relu(self.bn1(x)))
        out = self.conv2(F.relu(self.bn2(out)))
        # input and output are concatenated here
        out = torch.cat([out,x], 1)
        return out

class Transition(nn.Module):
        transition layer is used for down sampling the feature
        when compress rate is 0.5, out_planes is a half of in_planes
    def __init__(self, in_planes, out_planes):
        super(Transition, self).__init__()
        self.bn = nn.BatchNorm2d(in_planes)
        self.conv = nn.Conv2d(in_planes, out_planes, kernel_size=1, bias=False)

    def forward(self, x):
        out = self.conv(F.relu(self.bn(x)))
        # use average pooling change the size of feature map here
        out = F.avg_pool2d(out, 2)
        return out 

class DenseNet(nn.Module):
    def __init__(self, block, nblocks, growth_rate=12, reduction=0.5, num_classes=10):
        super(DenseNet, self).__init__()
            block: bottleneck
            nblock: a list, the elements is number of bottleneck in each denseblock
            growth_rate: channel size of bottleneck's output
        self.growth_rate = growth_rate

        num_planes = 2*growth_rate
        self.conv1 = nn.Conv2d(3, num_planes, kernel_size=3, padding=1, bias=False)
        # a DenseBlock and a transition layer
        self.dense1 = self._make_dense_layers(block, num_planes, nblocks[0])
        num_planes += nblocks[0]*growth_rate
        # the channel size is superposed, mutiply by reduction to cut it down here, the reduction is also known as compress rate
        out_planes = int(math.floor(num_planes*reduction))
        self.trans1 = Transition(num_planes, out_planes)
        num_planes = out_planes
        # a DenseBlock and a transition layer
        self.dense2 = self._make_dense_layers(block, num_planes, nblocks[1])
        num_planes += nblocks[1]*growth_rate
        # the channel size is superposed, mutiply by reduction to cut it down here, the reduction is also known as compress rate
        out_planes = int(math.floor(num_planes*reduction))
        self.trans2 = Transition(num_planes, out_planes)
        num_planes = out_planes

        # a DenseBlock and a transition layer
        self.dense3 = self._make_dense_layers(block, num_planes, nblocks[2])
        num_planes += nblocks[2]*growth_rate
        # the channel size is superposed, mutiply by reduction to cut it down here, the reduction is also known as compress rate
        out_planes = int(math.floor(num_planes*reduction))
        self.trans3 = Transition(num_planes, out_planes)
        num_planes = out_planes

        # only one DenseBlock 
        self.dense4 = self._make_dense_layers(block, num_planes, nblocks[3])
        num_planes += nblocks[3]*growth_rate

        # the last part is a linear layer as a classifier
        self.bn = nn.BatchNorm2d(num_planes)
        self.linear = nn.Linear(num_planes, num_classes)

    def _make_dense_layers(self, block, in_planes, nblock):
        layers = []
        # number of non-linear transformations in one DenseBlock depends on the parameter you set
        for i in range(nblock):
            layers.append(block(in_planes, self.growth_rate))
            in_planes += self.growth_rate
        return nn.Sequential(*layers)

    def forward(self, x):
        out = self.conv1(x)
        out = self.trans1(self.dense1(out))
        out = self.trans2(self.dense2(out))
        out = self.trans3(self.dense3(out))
        out = self.dense4(out)
        out = F.avg_pool2d(F.relu(self.bn(out)), 4)
        out = out.view(out.size(0), -1)
        out = self.linear(out)
        return out

def densenet():
    return DenseNet(Bottleneck, [2, 5, 4, 6])

Question 1


Answer 1

Every Bottleneck block contains 2 Conv2d layers; every Transition block contains 1 Conv2d layer; _make_dense_layers makes nblock Bottleneck blocks. We can see that the total number of Conv2D is 2(2+5+4+6)/Bottlenecks+14/Transitions = 38. One linear occurs, thus the layer number is 39.

def densenet52():
    return  DenseNet(Bottleneck, [2, 2, 4, 16])

2.3 训练与测试

import torchvision
import torchvision.transforms as transforms
from torch.autograd import Variable

def train(epoch, model, lossFunction, optimizer, device, trainloader):
    """train model using loss_fn and optimizer. When this function is called, model trains for one epoch.
        train_loader: train data
        model: prediction model
        loss_fn: loss function to judge the distance between target and outputs
        optimizer: optimize the loss function
        get_grad: True, False
        total_loss: loss
        average_grad2: average grad for hidden 2 in this epoch
        average_grad3: average grad for hidden 3 in this epoch
    print('\nEpoch: %d' % epoch)
    model.train()     # enter train mode
    train_loss = 0    # accumulate every batch loss in a epoch
    correct = 0       # count when model' prediction is correct i train set
    total = 0         # total number of prediction in train set
    for batch_idx, (inputs, targets) in enumerate(trainloader):
        inputs, targets = inputs.to(device), targets.to(device) # load data to gpu device
        inputs, targets = Variable(inputs), Variable(targets)
        optimizer.zero_grad()            # clear gradients of all optimized torch.Tensors'
        outputs = model(inputs)          # forward propagation return the value of softmax function
        loss = lossFunction(outputs, targets) #compute loss
        loss.backward()                  # compute gradient of loss over parameters 
        optimizer.step()                 # update parameters with gradient descent 

        train_loss += loss.item()        # accumulate every batch loss in a epoch
        _, predicted = outputs.max(1)    # make prediction according to the outputs
        total += targets.size(0)
        correct += predicted.eq(targets).sum().item() # count how many predictions is correct
        if (batch_idx+1) % 100 == 0:
            # print loss and acc
            print( 'Train loss: %.3f | Train Acc: %.3f%% (%d/%d)'
                % (train_loss/(batch_idx+1), 100.*correct/total, correct, total))
    print( 'Train loss: %.3f | Train Acc: %.3f%% (%d/%d)'
                % (train_loss/(batch_idx+1), 100.*correct/total, correct, total))
def test(model, lossFunction, optimizer, device, testloader):
    test model's prediction performance on loader.  
    When thid function is called, model is evaluated.
        loader: data for evaluation
        model: prediction model
        loss_fn: loss function to judge the distance between target and outputs
    global best_acc
    model.eval() #enter test mode
    test_loss = 0 # accumulate every batch loss in a epoch
    correct = 0
    total = 0
    with torch.no_grad():
        for batch_idx, (inputs, targets) in enumerate(testloader):
            inputs, targets = inputs.to(device), targets.to(device)
            outputs = model(inputs)
            loss = lossFunction(outputs, targets) #compute loss

            test_loss += loss.item() # accumulate every batch loss in a epoch
            _, predicted = outputs.max(1) # make prediction according to the outputs
            total += targets.size(0)
            correct += predicted.eq(targets).sum().item() # count how many predictions is correct
        # print loss and acc
        print('Test Loss: %.3f  | Test Acc: %.3f%% (%d/%d)'
            % (test_loss/(batch_idx+1), 100.*correct/total, correct, total))

def data_loader():
    # define method of preprocessing data for evaluating
    transform_train = transforms.Compose([
        transforms.RandomCrop(32, padding=4),
        # Normalize a tensor image with mean and standard variance
        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),

    transform_test = transforms.Compose([
        # Normalize a tensor image with mean and standard variance
        transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
    # prepare dataset by ImageFolder, data should be classified by directory
    trainset = torchvision.datasets.ImageFolder(root='./mnist/train', transform=transform_train)

    testset = torchvision.datasets.ImageFolder(root='./mnist/test', transform=transform_test)

    # Data loader. 

    # Combines a dataset and a sampler, 

    trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)

    testloader = torch.utils.data.DataLoader(testset, batch_size=100, shuffle=False)
    return trainloader, testloader

def run(model, num_epochs):
    # load model into GPU device
    # device = 'cuda' if torch.cuda.is_available() else 'cpu'
    device = 'cuda:1' 
    if device == 'cuda':
        model = torch.nn.DataParallel(model)
        cudnn.benchmark = True

    # define the loss function and optimizer

    lossFunction = nn.CrossEntropyLoss()
    lr = 0.01
    optimizer = optim.SGD(model.parameters(), lr=lr, momentum=0.9, weight_decay=5e-4)

    trainloader, testloader = data_loader()
    for epoch in range(num_epochs):
        train(epoch, model, lossFunction, optimizer, device, trainloader)
        test(model, lossFunction, optimizer, device, testloader)
        if (epoch + 1) % 50 == 0 :
            lr = lr / 10
            for param_group in optimizer.param_groups:
                param_group['lr'] = lr


# start training and testing
model = densenet52()
# num_epochs is adjustable
run(model, num_epochs=20)
Epoch: 0
Train loss: 2.238 | Train Acc: 18.300% (366/2000)
Test Loss: 2.308  | Test Acc: 10.000% (100/1000)

Epoch: 1
Train loss: 1.825 | Train Acc: 39.500% (790/2000)
Test Loss: 1.612  | Test Acc: 36.100% (361/1000)

Epoch: 2
Train loss: 1.333 | Train Acc: 58.400% (1168/2000)
Test Loss: 1.724  | Test Acc: 46.900% (469/1000)

Epoch: 3
Train loss: 0.908 | Train Acc: 71.500% (1430/2000)
Test Loss: 0.779  | Test Acc: 75.100% (751/1000)

Epoch: 4
Train loss: 0.667 | Train Acc: 79.250% (1585/2000)
Test Loss: 0.878  | Test Acc: 71.700% (717/1000)

Epoch: 5
Train loss: 0.596 | Train Acc: 78.500% (1570/2000)
Test Loss: 0.765  | Test Acc: 70.000% (700/1000)

Epoch: 6
Train loss: 0.496 | Train Acc: 82.500% (1650/2000)
Test Loss: 0.550  | Test Acc: 78.000% (780/1000)

Epoch: 7
Train loss: 0.444 | Train Acc: 85.250% (1705/2000)
Test Loss: 0.697  | Test Acc: 73.200% (732/1000)

Epoch: 8
Train loss: 0.450 | Train Acc: 85.600% (1712/2000)
Test Loss: 0.574  | Test Acc: 78.000% (780/1000)

Epoch: 9
Train loss: 0.407 | Train Acc: 86.100% (1722/2000)
Test Loss: 0.421  | Test Acc: 87.800% (878/1000)

Epoch: 10
Train loss: 0.358 | Train Acc: 87.650% (1753/2000)
Test Loss: 0.487  | Test Acc: 81.000% (810/1000)

Epoch: 11
Train loss: 0.319 | Train Acc: 89.850% (1797/2000)
Test Loss: 0.467  | Test Acc: 85.000% (850/1000)

Epoch: 12
Train loss: 0.282 | Train Acc: 91.400% (1828/2000)
Test Loss: 0.904  | Test Acc: 70.000% (700/1000)

Epoch: 13
Train loss: 0.258 | Train Acc: 92.150% (1843/2000)
Test Loss: 0.276  | Test Acc: 92.100% (921/1000)

Epoch: 14
Train loss: 0.214 | Train Acc: 94.000% (1880/2000)
Test Loss: 0.411  | Test Acc: 87.400% (874/1000)

Epoch: 15
Train loss: 0.238 | Train Acc: 93.300% (1866/2000)
Test Loss: 0.313  | Test Acc: 90.900% (909/1000)

Epoch: 16
Train loss: 0.234 | Train Acc: 93.200% (1864/2000)
Test Loss: 0.215  | Test Acc: 93.100% (931/1000)

Epoch: 17
Train loss: 0.208 | Train Acc: 94.150% (1883/2000)
Test Loss: 0.208  | Test Acc: 93.500% (935/1000)

Epoch: 18
Train loss: 0.217 | Train Acc: 93.700% (1874/2000)
Test Loss: 0.199  | Test Acc: 94.000% (940/1000)

Epoch: 19
Train loss: 0.183 | Train Acc: 94.150% (1883/2000)
Test Loss: 0.183  | Test Acc: 94.200% (942/1000)

2.4 ResNeXt网络的搭建


cardinality, 指的是repeat layer的个数,下图右边cardinality为32。左图是ResNet的基本结构,输入channel size为64,右图是ResNeXt的基本结构,

输入channel size是128,但两者具有相近的参数量。

ResNeXt Block

有三种等价的ResNeXt Block,如下图,a是ResNeXt基本单元,如果把输出那里的1x1合并到一起,得到等价网络b拥有和Inception-ResNet相似的结构,



下图表示ResNeXt-50(32x4d)的网络结构,卷积层和全连接层总数为50层,32表示的是cardinality,4d表示每一个repeat layer的channel数为4,所以整个block的通道数是32x4=128.

class Block(nn.Module):
        Grouped convolution block(c).
    expansion = 2

    def __init__(self, in_planes, cardinality=32, bottleneck_width=4, stride=1):
            in_planes: channel size of input
            cardinality: number of groups
            bottleneck_width: channel size of each group
        super(Block, self).__init__()
        group_width = cardinality * bottleneck_width
        self.conv1 = nn.Conv2d(in_planes, group_width, kernel_size=1, bias=False)
        self.bn1 = nn.BatchNorm2d(group_width)
        # divide into 32 groups which 32 is cardinality
        self.conv2 = nn.Conv2d(group_width, group_width, kernel_size=3, stride=stride, padding=1, groups=cardinality, bias=False)
        self.bn2 = nn.BatchNorm2d(group_width)
        self.conv3 = nn.Conv2d(group_width, self.expansion*group_width, kernel_size=1, bias=False)
        self.bn3 = nn.BatchNorm2d(self.expansion*group_width)

        self.shortcut = nn.Sequential()
        if stride != 1 or in_planes != self.expansion*group_width:
            self.shortcut = nn.Sequential(
                nn.Conv2d(in_planes, self.expansion*group_width, kernel_size=1, stride=stride, bias=False),

    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = F.relu(self.bn2(self.conv2(out)))
        out = self.bn3(self.conv3(out))
        out += self.shortcut(x)
        out = F.relu(out)
        return out

class ResNeXt(nn.Module):
    def __init__(self, num_blocks, cardinality, bottleneck_width, num_classes=10):
            num_blocks: list type, channel size of input
            cardinality: number of groups
            bottleneck_width: channel size of each group
        super(ResNeXt, self).__init__()
        self.cardinality = cardinality
        self.bottleneck_width = bottleneck_width
        self.in_planes = 64

        self.conv1 = nn.Conv2d(3, 64, kernel_size=1, bias=False)
        self.bn1 = nn.BatchNorm2d(64)
        # size 32x32
        self.layer1 = self._make_layer(num_blocks[0], 1)
        # size 32x32
        self.layer2 = self._make_layer(num_blocks[1], 2)
        # size 16x16
        self.layer3 = self._make_layer(num_blocks[2], 2)
        # size 8x8
        self.linear = nn.Linear(cardinality*bottleneck_width*8, num_classes)

    def _make_layer(self, num_blocks, stride):
        strides = [stride] + [1]*(num_blocks-1)
        layers = []
        for stride in strides:
            layers.append(Block(self.in_planes, self.cardinality, self.bottleneck_width, stride))
            self.in_planes = Block.expansion * self.cardinality * self.bottleneck_width
        # Increase bottleneck_width by 2 after each stage.
        self.bottleneck_width *= 2
        return nn.Sequential(*layers)

    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = self.layer1(out)
        out = self.layer2(out)
        out = self.layer3(out)
        out = F.avg_pool2d(out, 8)
        out = out.view(out.size(0), -1)
        out = self.linear(out)
        return out

Question 2


ResNeXt32_16x8d = ResNeXt([3, 3, 3], 16, 8)
run(ResNeXt32_16x8d, num_epochs=20)
Epoch: 0
Train loss: 2.167 | Train Acc: 17.650% (353/2000)
Test Loss: 4.626  | Test Acc: 10.000% (100/1000)

Epoch: 1
Train loss: 1.719 | Train Acc: 37.950% (759/2000)
Test Loss: 1.829  | Test Acc: 36.000% (360/1000)

Epoch: 2
Train loss: 1.139 | Train Acc: 59.450% (1189/2000)
Test Loss: 1.462  | Test Acc: 47.800% (478/1000)

Epoch: 3
Train loss: 0.768 | Train Acc: 73.500% (1470/2000)
Test Loss: 1.732  | Test Acc: 57.800% (578/1000)

Epoch: 4
Train loss: 0.602 | Train Acc: 78.850% (1577/2000)
Test Loss: 1.518  | Test Acc: 60.900% (609/1000)

Epoch: 5
Train loss: 0.522 | Train Acc: 83.050% (1661/2000)
Test Loss: 1.213  | Test Acc: 67.700% (677/1000)

Epoch: 6
Train loss: 0.369 | Train Acc: 88.500% (1770/2000)
Test Loss: 0.349  | Test Acc: 90.400% (904/1000)

Epoch: 7
Train loss: 0.312 | Train Acc: 90.150% (1803/2000)
Test Loss: 0.369  | Test Acc: 89.300% (893/1000)

Epoch: 8
Train loss: 0.272 | Train Acc: 92.050% (1841/2000)
Test Loss: 0.476  | Test Acc: 85.800% (858/1000)

Epoch: 9
Train loss: 0.257 | Train Acc: 92.500% (1850/2000)
Test Loss: 0.286  | Test Acc: 91.600% (916/1000)

Epoch: 10
Train loss: 0.189 | Train Acc: 94.400% (1888/2000)
Test Loss: 0.786  | Test Acc: 80.600% (806/1000)

Epoch: 11
Train loss: 0.171 | Train Acc: 95.050% (1901/2000)
Test Loss: 0.615  | Test Acc: 80.000% (800/1000)

Epoch: 12
Train loss: 0.130 | Train Acc: 96.400% (1928/2000)
Test Loss: 0.387  | Test Acc: 89.300% (893/1000)

Epoch: 13
Train loss: 0.143 | Train Acc: 95.700% (1914/2000)
Test Loss: 0.240  | Test Acc: 93.500% (935/1000)

Epoch: 14
Train loss: 0.129 | Train Acc: 95.900% (1918/2000)
Test Loss: 0.174  | Test Acc: 95.400% (954/1000)

Epoch: 15
Train loss: 0.100 | Train Acc: 97.200% (1944/2000)
Test Loss: 0.273  | Test Acc: 91.800% (918/1000)

Epoch: 16
Train loss: 0.106 | Train Acc: 96.700% (1934/2000)
Test Loss: 0.190  | Test Acc: 94.600% (946/1000)

Epoch: 17
Train loss: 0.106 | Train Acc: 96.700% (1934/2000)
Test Loss: 0.306  | Test Acc: 91.400% (914/1000)

Epoch: 18
Train loss: 0.094 | Train Acc: 96.900% (1938/2000)
Test Loss: 0.155  | Test Acc: 95.200% (952/1000)

Epoch: 19
Train loss: 0.077 | Train Acc: 97.300% (1946/2000)
Test Loss: 0.170  | Test Acc: 95.500% (955/1000)

Answer 2

ResNet and DenseNet are proved to have the same path topology – “dense topology” essentially, whilst their only difference lies in the form of connections.

However, there is some differences in this experiment. Clearly see that generalization error of the two NNs are approximately the same, and empirical error of ResNext is lower.

DenseNet has some shortcomings:

  • In general case, DenseNet uses a lot more memory, as the tensors from different are concatenated together compared with ResNeXt. Ref. 1
  • Although Densenet can reuse the feature, dense connection (L(L − 1)2) increases its computating time. Ref. 2 The excessive connections not only decrease networks' computation-efficiency and parameter-efficiency, but also make networks more prone to overfitting. Ref. 3

In this situation, ResNeXt enjoys more merits:

  • ResNeXt derives the shorcut thinking of ResNet, who add multi-branch Inception as enhancement. Thus, ResNeXt inherits the strong modularity and high scalability of its predecessor - ResNet. Ref. 4 Ref. 5



