利用SVD对卷积核进行低秩近似

作者: 白老包 | 来源:发表于2019-08-22 21:53 被阅读0次

利用SVD对卷积核进行低秩近似
29.深度学习模型压缩方法-3
模型压缩
图像的卷积
TensorFlow从头迈步W6.2-卷积核
优化模型及其约束
细数那些“高大上”的卷积
tf.layers.conv1d
反卷积（转）
卷积神经网(1)

模型压缩是个很大的罩子，其中一些内容比如模型剪枝，知识蒸馏是现在的研究热点。本菜鸟想蹭热点但是水平有限，所以暂时只能从最简单的部分开始走。以前的学习中接触过利用低秩稀疏进行视频分割的工作。因此现在迁移学习一下，把这部分知识用在深度学习上。
早期的深度学习模型存在一定的冗余性，例如VGG。因此在训练好的模型基础上进行压缩模型成为了一个有意义的研究点。这部分的文献比较多，我学习了两篇模型压缩综述类的文献。
Recent Advances in Efficient Computation of Deep Convolutional Neural Networks
A Survey of Model Compression and Acceleration for Deep Neural Networks
发现在低秩近似这个领域，SVD应用最多。因此对这个内容进行了了解。
SVD狙击步枪，也称德拉贡诺夫狙击步枪。咳咳，走错片场了。SVD也就是奇异值分解。资料中经常将其于方阵的特征值与特征向量进行类比。详细原理可参见这里.这里需要专门提出来说的是SVD的分解形式，如下图所示。图片来自于

QQ截图20190822204851.jpg
一个二维矩阵可以分解为三个矩阵，依次为U，Σ，V。其中Σ是对角线矩阵。U和V是方阵。奇异值越大，代表的信息就越多。因此可以利用Σ筛出冗余的信息，从而达到压缩的目的。
我使用的工具，numpy.linalg.svd直接把Σ中的奇异值进行了排序。因此压缩工作只需要取前n个奇异值即可。
使用pytorch，搭建了一个拥有两层卷积层的网络。我主要对卷积核进行压缩。此外，卷积核是一个斯威向量，为了能进行SVD操作，需要将其进行合并。论文中一般将长，宽，输入通道，三维合并在一起。由于我搭建的卷积层核通道数较少，因此将输入通道和输出通道合并，长和宽合并。全部代码如下。

# -*- coding: utf-8 -*-
"""
Created on Thu Aug 22 16:31:35 2019

@author: BHN
"""

import torch
import torchvision
import torchvision.transforms as transforms

transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,
                                          shuffle=True, num_workers=0)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=4,
                                         shuffle=False, num_workers=0)

classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


net = Net()

import torch.optim as optim

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

for epoch in range(2):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i % 2000 == 1999:    # print every 2000 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 2000))
            running_loss = 0.0

print('Finished Training')

correct = 0
total = 0
with torch.no_grad():
    for data in testloader:
        images, labels = data
        outputs = net(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print('Accuracy of the network on the 10000 test images: %d %%' % (
    100 * correct / total))

#print("Model's state_dict:")
#for param_tensor in net.state_dict():
#    print(param_tensor, "\t", net.state_dict()[param_tensor].size())
#    
import numpy as np
conv1_weight = net.state_dict()['conv1.weight'].numpy()
conv2_weight = net.state_dict()['conv2.weight'].numpy()
sval_nums = 10
U,Sigma,VT = np.linalg.svd(np.reshape(conv1_weight,(conv1_weight.shape[0]*conv1_weight.shape[1],\
                               conv1_weight.shape[2]*conv1_weight.shape[3])))
con_restruct1 = (U[:,0:sval_nums]).dot(np.diag(Sigma[0:sval_nums])).dot(VT[0:sval_nums,:])
conv1_weight = np.reshape(con_restruct1,(conv1_weight.shape[0],conv1_weight.shape[1],\
                               conv1_weight.shape[2],conv1_weight.shape[3]))
#net.state_dict()['conv1.weight'] = torch.from_numpy(conv1_weight)

U,Sigma,VT = np.linalg.svd(np.reshape(conv2_weight,(conv2_weight.shape[0]*conv2_weight.shape[1],\
                               conv2_weight.shape[2]*conv2_weight.shape[3])))
con_restruct2 = (U[:,0:sval_nums]).dot(np.diag(Sigma[0:sval_nums])).dot(VT[0:sval_nums,:])
conv2_weight = np.reshape(con_restruct2,(conv2_weight.shape[0],conv2_weight.shape[1],\
                               conv2_weight.shape[2],conv2_weight.shape[3]))
#net.state_dict()['conv2.weight'] = torch.from_numpy(conv2_weight)
#print(net.state_dict()['conv2.weight'].shape)

torch.save(net.state_dict(), 'old.pth')
new_state_dict = torch.load('old.pth')
new_state_dict['conv1.weight'] = torch.from_numpy(conv1_weight)
new_state_dict['conv2.weight'] = torch.from_numpy(conv2_weight)
torch.save(new_state_dict, 'svd.pth')
net.load_state_dict(torch.load('svd.pth'))
net.eval()


correct = 0
total = 0
with torch.no_grad():
    for data in testloader:
        images, labels = data
        outputs = net(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print('Accuracy of the network on the 10000 test images: %d %%' % (
    100 * correct / total))

可以计算一下参数量。两个卷积核的大小为3655+61655=2850。取前n个奇异值的参数量为18n+n+25n+96n+n+25n=166n。经过计算n取6以上的时候，正确率下降在5%以下。因此压缩3倍的时候精度依然可以接受。
对于全卷积层的压缩更加简便。而且通过模型结构也可发现全卷积层的参数量更大，压缩也更容易些。网上的资料也较多，这里就不详细介绍了。