神经网络压缩实验-Deep-compression

作者: 月见樽 | 来源:发表于2018-10-05 17:22 被阅读11次

    首发于个人博客,结合论文阅读笔记更佳

    实验准备

    基础网络搭建

    为了实现神经网络的deep compression,首先要训练一个深度神经网络,为了方便实现,这里实现一个两层卷积+两层MLP的神经网络

    class net(pt.nn.Module):
        
        def __init__(self):
            super(net,self).__init__()
            self.conv1 = pt.nn.Conv2d(in_channels=1,out_channels=64,kernel_size=3,padding=1)
            self.conv2 = pt.nn.Conv2d(in_channels=64,out_channels=256,kernel_size=3,padding=1)
            self.fc1 = pt.nn.Linear(in_features=7*7*256,out_features=512)
            self.fc2 = pt.nn.Linear(in_features=512,out_features=10)
            self.pool = pt.nn.MaxPool2d(2)
            
        def forward(self,x):
            x = self.pool(pt.nn.functional.relu(self.conv1(x)))
            x = self.pool(pt.nn.functional.relu(self.conv2(x)))
            x = pt.nn.functional.relu(self.fc1(x.view((-1,7*7*256))))
            return self.fc2(x)
    
    model = net().cuda()
    print(model)
    print(model(pt.rand(1,1,28,28).cuda()))
    
    net(
      (conv1): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (conv2): Conv2d(64, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (fc1): Linear(in_features=12544, out_features=512, bias=True)
      (fc2): Linear(in_features=512, out_features=10, bias=True)
      (pool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    )
    tensor(1.00000e-02 *
           [[-7.7157,  3.0435, -6.5732,  6.5343, -4.2159, -2.8651, -0.6792,
              3.9223, -3.7523,  2.4532]], device='cuda:0')
    

    基础网络训练

    准备数据集

    train_dataset = ptv.datasets.MNIST("./",download=True,transform=ptv.transforms.ToTensor())
    test_dataset = ptv.datasets.MNIST("./",train=False,transform=ptv.transforms.ToTensor())
    trainloader = pt.utils.data.DataLoader(train_dataset,shuffle=True,batch_size=128)
    testloader = pt.utils.data.DataLoader(test_dataset,shuffle=True,batch_size=128)
    

    代价函数与优化器

    lossfunc = pt.nn.CrossEntropyLoss().cuda()
    optimizer = pt.optim.Adam(model.parameters(),1e-4)
    
    def acc(outputs,label):
        _,data = pt.max(outputs,dim=1)
        return pt.mean((data.float()==label.float()).float()).item()
    

    网络训练

    for _ in range(1):
        for i,(data,label) in enumerate(trainloader):
            data,label = data.cuda(),label.cuda()
            model.zero_grad()
            outputs = model(data)
            loss = lossfunc(outputs,label)
            loss.backward()
            optimizer.step()
            if i % 100 == 0:
                print(i,acc(outputs,label))
    
    0 0.1171875
    100 0.8984375
    200 0.953125
    300 0.984375
    400 0.96875
    

    测试网络

    def test_model(model,testloader):
        result = []
        for data,label in testloader:
            data,label = data.cuda(),label.cuda()
            outputs = model(data)
            result.append(acc(outputs,label))
        result = sum(result) / len(result)
        print(result)
        return result
    test_model(model,testloader)
    
    0.96875
    

    保存网络

    pt.save(model.state_dict(),"./base.ptb")
    

    剪枝实验

    剪枝是deep compression的第一步,含义是将部分较小(小于某个阈值)的权值置位为0,表示这个连接被剪掉,且在之后的微调过程中,这个连接的梯度也将被置位为0,即不参加训练

    准备相关工具

    剪枝实验需要准备一些函数:剪枝函数,梯度剪枝函数和稀疏度评估函数

    剪枝函数

    剪枝函数输入模型和阈值,将所有绝对值小于阈值的权值置位为0

    def puring(model,threshold):
        for i in model.parameters():
            i.data[pt.abs(i) < threshold] = 0
        return model
    

    梯度剪枝函数

    def grad_puring(model):
        for i in model.parameters():
            mask = i.clone()
            mask[mask != 0] = 1
            i.grad.data.mul_(mask)
    

    稀疏度评估函数

    def print_sparse(model):
        result = []
        total_num = 0
        total_sparse = 0
        print("-----------------------------------")
        print("Layer sparse")
        for name,f in model.named_parameters():
            num = f.view(-1).shape[0]
            total_num += num
            sparse = pt.nonzero(f).shape[0]
            total_sparse+= sparse
            print("\t",name,(sparse)/num)
            result.append((sparse)/num)
        total = total_sparse/total_num
        print("Total:",total)
        return total
    

    剪枝

    首先,查看原有网络的稀疏度情况

    model = net().cuda()
    model.load_state_dict(pt.load("./base.ptb"))
    _ = test_model(model,testloader)
    
    0.96875
    
    print_sparse(model)
    
    -----------------------------------
    Layer sparse
         conv1.weight 1.0
         conv1.bias 1.0
         conv2.weight 1.0
         conv2.bias 1.0
         fc1.weight 1.0
         fc1.bias 1.0
         fc2.weight 1.0
         fc2.bias 1.0
    Total: 1.0
    

    可以发现,原有网络完全没有稀疏性,现在进行剪枝,使用阈值为0.01进行剪枝,小于0.01的连接将被剪掉。根据结果可以发现,在阈值0.01下,剪枝后仅剩8.3%参数,且准确率不受影响

    model1 = puring(model,0.01)
    test_model(model1,testloader)
    print_sparse(model1)
    
    0.9706289556962026
    -----------------------------------
    Layer sparse
         conv1.weight 0.9739583333333334
         conv1.bias 0.90625
         conv2.weight 0.7641262478298612
         conv2.bias 0.71875
         fc1.weight 0.06729390669842156
         fc1.bias 0.025390625
         fc2.weight 0.7837890625
         fc2.bias 0.9
    Total: 0.08358673475128647
    
    0.08358673475128647
    

    现在调整阈值为0.1,准确率大幅度下降,现在仅剩很少的参数

    model.load_state_dict(pt.load("./base.ptb"))
    model2 = puring(model,0.1)
    test_model(model2,testloader)
    print_sparse(model2)
    
    0.09760680379746836
    -----------------------------------
    Layer sparse
         conv1.weight 0.671875
         conv1.bias 0.6875
         conv2.weight 0.0
         conv2.bias 0.0
         fc1.weight 0.0
         fc1.bias 0.0
         fc2.weight 0.0
         fc2.bias 0.0
    Total: 6.553616029871108e-05
    
    6.553616029871108e-05
    

    现在进行阈值的格点扫描,扫描的范围从0.1到0.01,步长为0.01

    sparse_list = []
    threshold_list = [x*0.01+0.01 for x in range(10)]
    acc_list = []
    for i in threshold_list:
        model.load_state_dict(pt.load("./base.ptb"))
        model3 = puring(model,i)
        acc_list.append(test_model(model3,testloader))
        sparse_list.append(print_sparse(model3))
        threshold_list.append
    
    0.9706289556962026
    -----------------------------------
    Layer sparse
         conv1.weight 0.9739583333333334
         conv1.bias 0.90625
         conv2.weight 0.7641262478298612
         conv2.bias 0.71875
         fc1.weight 0.06729390669842156
         fc1.bias 0.025390625
         fc2.weight 0.7837890625
         fc2.bias 0.9
    Total: 0.08358673475128647
    0.47735363924050633
    -----------------------------------
    Layer sparse
         conv1.weight 0.9375
         conv1.bias 0.890625
         conv2.weight 0.5333726671006944
         conv2.bias 0.4765625
         fc1.weight 0.0015011222995057398
         fc1.bias 0.0
         fc2.weight 0.5765625
         fc2.bias 0.7
    Total: 0.01398429139292775
    0.09513449367088607
    -----------------------------------
    Layer sparse
         conv1.weight 0.9045138888888888
         conv1.bias 0.890625
         conv2.weight 0.3156263563368056
         conv2.bias 0.2578125
         fc1.weight 1.5414490991709182e-05
         fc1.bias 0.0
         fc2.weight 0.371875
         fc2.bias 0.4
    Total: 0.007479941525322959
    0.09612341772151899
    -----------------------------------
    Layer sparse
         conv1.weight 0.8732638888888888
         conv1.bias 0.875
         conv2.weight 0.13545735677083334
         conv2.bias 0.0546875
         fc1.weight 0.0
         fc1.bias 0.0
         fc2.weight 0.1615234375
         fc2.bias 0.1
    Total: 0.003250198205069488
    0.09691455696202532
    -----------------------------------
    Layer sparse
         conv1.weight 0.8402777777777778
         conv1.bias 0.84375
         conv2.weight 0.03839111328125
         conv2.bias 0.00390625
         fc1.weight 0.0
         fc1.bias 0.0
         fc2.weight 0.016796875
         fc2.bias 0.0
    Total: 0.0009558243703890901
    0.1003757911392405
    -----------------------------------
    Layer sparse
         conv1.weight 0.8142361111111112
         conv1.bias 0.796875
         conv2.weight 0.0084228515625
         conv2.bias 0.0
         fc1.weight 0.0
         fc1.bias 0.0
         fc2.weight 0.0
         fc2.bias 0.0
    Total: 0.00026792277133719006
    0.09760680379746836
    -----------------------------------
    Layer sparse
         conv1.weight 0.7760416666666666
         conv1.bias 0.765625
         conv2.weight 0.0014580620659722222
         conv2.bias 0.0
         fc1.weight 0.0
         fc1.bias 0.0
         fc2.weight 0.0
         fc2.bias 0.0
    Total: 0.00010811185608441666
    0.09760680379746836
    -----------------------------------
    Layer sparse
         conv1.weight 0.7447916666666666
         conv1.bias 0.734375
         conv2.weight 0.00014241536458333334
         conv2.bias 0.0
         fc1.weight 0.0
         fc1.bias 0.0
         fc2.weight 0.0
         fc2.bias 0.0
    Total: 7.55718600196274e-05
    0.09968354430379747
    -----------------------------------
    Layer sparse
         conv1.weight 0.7065972222222222
         conv1.bias 0.71875
         conv2.weight 0.0
         conv2.bias 0.0
         fc1.weight 0.0
         fc1.bias 0.0
         fc2.weight 0.0
         fc2.bias 0.0
    Total: 6.888139353901653e-05
    0.09760680379746836
    -----------------------------------
    Layer sparse
         conv1.weight 0.671875
         conv1.bias 0.6875
         conv2.weight 0.0
         conv2.bias 0.0
         fc1.weight 0.0
         fc1.bias 0.0
         fc2.weight 0.0
         fc2.bias 0.0
    Total: 6.553616029871108e-05
    
    import matplotlib.pyplot as plt
    plt.figure(figsize=(10,3))
    plt.subplot(131)
    plt.plot(threshold_list,acc_list)
    plt.subplot(132)
    plt.plot(threshold_list,acc_list)
    plt.subplot(133)
    plt.plot(sparse_list,acc_list)
    plt.show()
    
    output_30_0.png

    上图自左向右分别是阈值-准确率,阈值-稀疏度和稀疏度-准确率关系

    剪枝后微调

    我们发现,阈值为大约0.02时,准确率仅为47%左右,考虑使用微调阈值的方式进行调整

    model = net().cuda()
    model.load_state_dict(pt.load("./base.ptb"))
    model1 = puring(model,0.02)
    test_model(model1,testloader)
    print_sparse(model1)
    
    0.4759691455696203
    -----------------------------------
    Layer sparse
         conv1.weight 0.9375
         conv1.bias 0.890625
         conv2.weight 0.5333726671006944
         conv2.bias 0.4765625
         fc1.weight 0.0015011222995057398
         fc1.bias 0.0
         fc2.weight 0.5765625
         fc2.bias 0.7
    Total: 0.01398429139292775
    
    optimizer = pt.optim.Adam(model1.parameters(),1e-5)
    lossfunc = pt.nn.CrossEntropyLoss().cuda()
    for _ in range(4):
        for i,(data,label) in enumerate(trainloader):
            data,label = data.cuda(),label.cuda()
            outputs = model1(data)
            loss = lossfunc(outputs,label)
            loss.backward()
            grad_puring(model1)
            optimizer.step()
            if i % 100 == 0:
                print(i,acc(outputs,label))
    
    
    0 0.4375
    100 0.4375
    200 0.5625
    300 0.6015625
    400 0.6875
    0 0.7265625
    100 0.6953125
    200 0.7890625
    300 0.8046875
    400 0.7734375
    0 0.8125
    100 0.8046875
    200 0.890625
    300 0.8515625
    400 0.875
    0 0.859375
    100 0.8515625
    200 0.9140625
    300 0.890625
    400 0.9296875
    
    test_model(model1,testloader)
    print_sparse(model1)
    pt.save(model1.state_dict(),'./puring.pt')
    
    0.9367088607594937
    -----------------------------------
    Layer sparse
         conv1.weight 0.9375
         conv1.bias 0.890625
         conv2.weight 0.5333726671006944
         conv2.bias 0.4765625
         fc1.weight 0.0015011222995057398
         fc1.bias 0.0
         fc2.weight 0.5765625
         fc2.bias 0.7
    Total: 0.01398429139292775
    

    由上发现,经过权值微调后,在保持原有的稀疏度的情况下将准确率提高到了90%以上

    量化实验

    量化过程比较复杂,分为量化和微调两个步骤,量化步骤使用sklearn的k-mean实现,微调使用pytorch本身实现

    量化

    model = net().cuda()
    model.load_state_dict(pt.load("./puring.pt"))
    test_model(model,testloader)
    
    0.9367088607594937
    
    from sklearn.cluster import KMeans
    import numpy as np
    kmean_list = []
    bit = 2
    for name,i in model.named_parameters():
        data = i.data.clone().view(-1).cpu().detach().numpy().reshape(-1)
        data = data[data != 0]
        if data.size < 2 ** bit:
            kmean_list.append(None)
            continue
        init = [x*(np.max(data)+np.min(data))/(2 ** bit) + np.min(data) for x in range(2 ** bit)]
        kmn = KMeans(2 ** bit,init=np.array(init).reshape(2 ** bit,1))
        kmn.fit(data.reshape((-1,1)))
        kmean_list.append(kmn)
        print(name,i.shape)
    
    conv1.weight torch.Size([64, 1, 3, 3])
    conv1.bias torch.Size([64])
    conv2.weight torch.Size([256, 64, 3, 3])
    conv2.bias torch.Size([256])
    fc1.weight torch.Size([512, 12544])
    fc2.weight torch.Size([10, 512])
    fc2.bias torch.Size([10])
    
    
    c:\program files\python35\lib\site-packages\sklearn\cluster\k_means_.py:896: RuntimeWarning: Explicit initial center position passed: performing only one init in k-means instead of n_init=10
      return_n_iter=True)
    

    训练完量化器后,将每一层数据使用对应的量化器进行量化

    for i,(name,f) in enumerate(model.named_parameters()):
        data = f.data.clone().view(-1).cpu().detach().numpy().reshape(-1)
        data_nozero = data[data != 0].reshape((-1,1))
        if data_nozero.size == 0 or data.size < 2 ** bit or kmean_list[i] is None:
            f.kmeans_result = None
            f.kmeans_label = None
            continue
    #     print(name)
    #     print(data.size)
    
        result = data.copy()
        result[result == 0] = -1
        
    #     print(data_nozero)
    #     print(kmean_list[i])
        label = kmean_list[i].predict(data_nozero).reshape(-1)
    #     print(data_nozero)
    #     print(label)
        new_data = np.array([kmean_list[i].cluster_centers_[x] for x in label])
        data[data != 0] = new_data.reshape(-1)
    #     print(data,new_data)
        f.data = pt.from_numpy(data).view(f.data.shape).cuda()
        result[result != -1] = label
        f.kmeans_result = pt.from_numpy(result).view(f.data.shape).cuda()
        f.kmeans_label = pt.from_numpy(kmean_list[i].cluster_centers_).cuda()
    
    test_model(model,testloader)
    print_sparse(model)
    
    0.8919106012658228
    -----------------------------------
    Layer sparse
         conv1.weight 0.9375
         conv1.bias 0.890625
         conv2.weight 0.5333726671006944
         conv2.bias 0.4765625
         fc1.weight 0.0015011222995057398
         fc1.bias 0.0
         fc2.weight 0.5765625
         fc2.bias 0.7
    Total: 0.01398429139292775
    
    0.01398429139292775
    

    由上可以发现,对于这种玩具级的网络来说,2bit量化已经完全足够了,精度损失3个百分点

    微调

    lossfunc = pt.nn.CrossEntropyLoss().cuda()
    lr = 0.001
    for _ in range(1):
        for a,(data,label) in enumerate(trainloader):
            data,label = data.cuda(),label.cuda()
            model.zero_grad()
            outputs = model(data)
            loss = lossfunc(outputs,label)
            loss.backward()
    
            for name,i in model.named_parameters():
    #             print(i.data)
    #             break
                if i.kmeans_result is None:
                    continue
                for x in range(2 ** bit):
                    grad = pt.sum(i.grad.detach()[i.kmeans_result == x])
    #                 print(grad.item())
                    i.kmeans_label[x] += -lr * grad.item()
                    i.data[i.kmeans_result == x] = i.kmeans_label[x].item()
    #                 print(i.data)
    #                 break
    #             print(name)
    #             test_model(model,testloader)
    #             break
            if a % 100 == 0:
                print(a,acc(outputs,label))
    #         break
    #     break
    
    0 0.8828125
    100 0.921875
    200 0.9296875
    300 0.9296875
    400 0.9140625
    
    test_model(model,testloader)
    print_sparse(model)
    pt.save(model.state_dict(),"quantization.pt")
    
    0.9384889240506329
    -----------------------------------
    Layer sparse
         conv1.weight 0.9375
         conv1.bias 0.890625
         conv2.weight 0.5333726671006944
         conv2.bias 0.4765625
         fc1.weight 0.0015011222995057398
         fc1.bias 0.0
         fc2.weight 0.5765625
         fc2.bias 0.7
    Total: 0.01398429139292775
    

    通过对量化中心的微调,2bit量化网络的准确率已经与非量化网络的准确率相当

    相关文章

      网友评论

        本文标题:神经网络压缩实验-Deep-compression

        本文链接:https://www.haomeiwen.com/subject/nuvqaftx.html