美文网首页CVPR paper translation
RePr: Improved Training of Convo

RePr: Improved Training of Convo

作者: Lornatang | 来源:发表于2019-03-29 15:04 被阅读0次

    RePr: Improved Training of Convolutional Filters翻译 下

    RePr: Improved Training of Convolutional Filters

    RePr:改进的卷积滤波器训练

    论文:http://static.tongtianta.site/paper_pdf/2e51726c-3f1d-11e9-ba5b-00163e08bb86.pdf

    Abstract

    摘要

    A well-trained Convolutional Neural Network can easily be pruned without significant loss of performance. This is because of unnecessary overlap in the features captured by the network’s filters.Innovations in network architecture such as skip/dense connections and Inception units have mitigated this problem to some extent, but these improvements come with increased computation and memory requirements at run-time. We attempt to address this problem from another angle - not by changing the network structure but by altering the training method. We show that by temporarily pruning and then restoring a subset of the model’s filters, and repeating this process cyclically, overlap in the learned features is reduced, producing improved generalization. We show that the existing model-pruning criteria are not optimal for selecting filters to prune in this context and introduce inter-filter orthogonality as the ranking criteria to determine under-expressive filters. Our method is applicable both to vanilla convolutional networks and more complex modern architectures, and improves the performance across a variety of tasks, especially when applied to smaller networks.

    训练有素的卷积神经网络可以轻松修剪而不会显着降低性能。这是因为网络过滤器捕获的功能不必要地重叠。网络架构的创新,例如跳过/密集连接和Inception单元在某种程度上缓解了这个问题,但这些改进伴随着运行时计算和内存需求的增加。我们试图从另一个角度解决这个问题 - 不是通过改变网络结构,而是通过改变训练方法。我们通过临时修剪然后恢复模型滤波器的子集并循环地重复该过程来显示,减少了学习特征的重叠,从而产生了改进的泛化。我们表明,现有的模型修剪标准不是最佳选择过滤器在这种情况下修剪,并引入过滤器正交性作为确定表达不足过滤器的排名标准。我们的方法既适用于香草卷积网络,也适用于更复杂的现代架构,可以提高各种任务的性能,特别是在应用于较小的网络时。

    1. Introduction

    1.简介

    Convolutional Neural Networks have achieved state-ofthe-art results in various computer vision tasks [1, 2]. Much of this success is due to innovations of a novel, taskspecific network architectures [3, 4].Despite variation in network design, the same core optimization techniques are used across tasks. These techniques consider each individual weight as its own entity and update them independently. Limited progress has been made towards developing a training process specifically designed for convolutional networks, in which filters are the fundamental unit of the network. A filter is not a single weight parameter but a stack of spatial kernels.

    卷积神经网络在各种计算机视觉任务中取得了最先进的成果[1,2]。这种成功很大程度上归功于一种新颖的,任务特定的网络架构的创新[3,4]。尽管网络设计存在差异,但跨任务使用相同的核心优化技术。这些技术将每个个体权重视为自己的实体并独立更新。在开发专门针对卷积网络设计的培训过程方面取得了有限的进展,其中过滤器是网络的基本单元。过滤器不是单个权重参数,而是一堆空间内核。

    Because models are typically over-parameterized, a trained convolutional network will contain redundant filters [5, 6]. This is evident from the common practice of pruning filters [7, 8, 6, 9, 10, 11], rather than individual parameters [12], to achieve model compression. Most of these pruning methods are able to drop a significant number of filters with only a marginal loss in the performance of the model. However, a model with fewer filters cannot be trained from scratch to achieve the performance of a large model that has been pruned to be roughly the same size [6, 11, 13]. Standard training procedures tend to learn models with extraneous and prunable filters, even for architectures without any excess capacity. This suggests that there is room for improvement in the training of Convolutional Neural Networks (ConvNets).

    由于模型通常过度参数化,训练有素的卷积网络将包含冗余滤波器[5,6]。从修剪过滤器[7,8,6,9,10,11]的常规做法中可以看出这一点,而不是单个参数[12],以实现模型压缩。大多数这些修剪方法能够丢弃相当数量的过滤器,而模型的性能只有轻微的损失。然而,具有较少滤波器的模型无法从头开始训练以实现已被修剪为大致相同尺寸的大型模型的性能[6,11,13]。标准培训程序倾向于学习具有无关和可修剪过滤器的模型,即使对于没有任何过剩容量的体系结构也是如此。这表明卷积神经网络(ConvNets)的训练还有改进的余地。

    Part of the research was done while the author was an intern at MSR

    部分研究是在作者是MSR实习生的情况下完成的

    image

    To this end, we propose a training scheme in which, after some number of iterations of standard training, we select a subset of the model’s filters to be temporarily dropped. After additional training of the reduced network, we reintroduce the previously dropped filters, initialized with new weights, and continue standard training. We observe that following the reintroduction of the dropped filters, the model is able to achieve higher performance than was obtained before the drop. Repeated application of this process obtains models which outperform those obtained by standard training as seen in Figure 1 and discussed in Section 4. We observe this improvement across various tasks and over various types of convolutional networks. This training procedure is able to produce improved performance across a range of possible criteria for choosing which filters to drop, and further gains can be achieved by careful selection of the ranking criterion. According to a recent hypothesis [14], the relative success of over-parameterized networks may largely be due to an abundance of initial sub-networks. Our method aims to preserve successful sub-networks while allowing the re-initialization of less useful filters.

    为此,我们提出了一种训练方案,其中在经过一定数量的标准训练迭代后,我们选择暂时丢弃模型滤波器的子集。在对减少的网络进行额外培训后,我们重新引入先前丢弃的过滤器,使用新的权重进行初始化,并继续进行标准培训。我们观察到,在重新引入掉落的过滤器之后,该模型能够实现比下降之前获得的更高的性能。重复应用该过程获得的模型优于通过标准培训获得的模型,如图1所示并在第4节中讨论。我们观察到各种任务和各种类型的卷积网络的这种改进。该培训程序能够在一系列可能的标准中产生改进的性能,以选择丢弃哪些过滤器,并且通过仔细选择排序标准可以实现进一步的增益。根据最近的一个假设[14],过度参数化网络的相对成功可能主要是由于大量的初始子网络。我们的方法旨在保留成功的子网络,同时允许重新初始化不太有用的过滤器。

    In addition to our novel training strategy, the second major contribution of our work is an exploration of metrics to guide filter dropping. Our experiments demonstrate that standard techniques for permanent filter pruning are suboptimal in our setting, and we present an alternative metric which can be efficiently computed, and which gives a significant improvement in performance. We propose a metric based on the inter-filter orthogonality within convolutional layers and show that this metric outperforms state-of-the-art filter importance ranking methods used for network pruning in the context of our training strategy. We observe that even small, under-parameterized networks tend to learn redundant filters, which suggests that filter redundancy is not solely a result of over-parameterization, but is also due to ineffective training. Our goal is to reduce the redundancy of the filters and increase the expressive capacity of ConvNets and we achieve this by changing the training scheme rather than the model architecture.

    除了我们新颖的培训策略,我们工作的第二个主要贡献是探索指导过滤器丢失的指标。我们的实验表明,永久性过滤器修剪的标准技术在我们的设置中并不是最理想的,我们提出了一个可以有效计算的替代度量,并且可以显着提高性能。我们提出了一个基于卷积层内的中间滤波器正交性的度量,并表明该度量优于在我们的训练策略环境中用于网络修剪的最先进的滤波器重要性排序方法。我们观察到即使是小型,参数不足的网络也倾向于学习冗余过滤器,这表明过滤器冗余不仅仅是过度参数化的结果,而且还是由于无效的训练。我们的目标是减少过滤器的冗余并提高ConvNets的表达能力,我们通过改变培训方案而不是模型架构来实现这一目标。

    2. Related Work

    2.相关工作

    Training Scheme Many changes to the training paradigm have been proposed to reduce over-fitting and improve generalization. Dropout [15] is widely used in training deep nets. By stochastically dropping the neurons it prevents co-adaption of feature detectors. A similar effect can be achieved by dropping a subset of activations [16]. Wu et al. [15] extend the idea of stochastic dropping to convolutional neural networks by probabilistic pooling of convolution activations. Yet another form of stochastic training recommends randomly dropping entire layers [17], forcing the model to learn similar features across various layers which prevent extreme overfitting. In contrast, our technique encourages the model to use a linear combination of features instead of duplicating the same feature. Han et al. [18] propose Dense-Sparse-Dense (DSD), a similar training scheme, in which they apply weight regularization mid-training to encourage the development of sparse weights, and subsequently remove the regularization to restore dense weights. While DSD works at the level of individual parameters, our method is specifically designed to apply to convolutional filters.

    培训计划已经提出了许多培训模式的变化,以减少过度使用并改善泛化。辍学[15]广泛用于训练深网。通过随机丢弃神经元,它可以防止特征检测器的共同适应。通过删除激活子集可以实现类似的效果[16]。吴等人。 [15]通过卷积激活的概率汇集将随机下降的概念扩展到卷积神经网络。另一种形式的随机训练建议随机丢弃整个层[17],迫使模型学习各层之间的相似特征,以防止极端过度配置。相比之下,我们的技术鼓励模型使用线性特征组合,而不是复制相同的特征。韩等人。 [18]提出了密集稀疏密集(DSD),一种类似的训练方案,其中它们在训练中应用权重正则化以鼓励稀疏权重的发展,并随后去除正则化以恢复密集权重。虽然DSD在各个参数的水平上工作,但我们的方法专门设计用于卷积滤波器。

    Model Compression Knowledge Distillation (KD) [19] is a training scheme which uses soft logits from a larger trained model (teacher) to train a smaller model (student). Soft logits capture hierarchical information about the object and provide a smoother loss function for optimization. This leads to easier training and better convergence for small models. In a surprising result, Born-Again-Network [20] shows that if the student model is of the same capacity as the teacher it can outperform the teacher. A few other variants of KD have been proposed [21] and all of them require training several models. Our training scheme does not depend on an external teacher and requires less training than KD. More importantly, when combined with KD, our method gives better performance than can be achieved by either technique independently (discussed in Section 7).

    模型压缩知识蒸馏(KD)[19]是一种训练方案,它使用来自较大训练模型(教师)的软记录来训练较小的模型(学生)。软logit捕获有关对象的分层信息,并为优化提供更平滑的损失函数。这使得小型模型的培训更容易,收敛更好。令人惊讶的结果是,Born-Again-Network [20]表明,如果学生模型与教师具有相同的能力,那么它可以胜过教师。已经提出了KD的一些其他变体[21]并且所有这些变体都需要训练几个模型。我们的培训计划不依赖于外部教师,需要的培训少于KD。更重要的是,当与KD结合使用时,我们的方法比单独使用任何一种技术都能提供更好的性能(在第7节中讨论)。

    Neuron ranking Interest in finding the least salient neurons/weights has a long history. LeCun [22] and Hassibiet al. [23] show that using the Hessian, which contains secondorder derivative, identifies the weak neurons and performs better than using the magnitude of the weights. Computing the Hessian is expensive and thus is not widely used. Han et al. [12] show that the norm of weights is still effective ranking criteria and yields sparse models. The sparse models do not translate to faster inference, but as a neuron ranking criterion, they are effective. Hu et al. [24] explore Average Percentage of Zeros (APoZ) in the activations and use a data-driven threshold to determine the cut-off. Molchanov et al. [9] recommend the second term from the Taylor expansion of the loss function.We provide detail comparison and show results on using these metrics with our training scheme in Section 5.

    神经元排名发现最不显着的神经元/体重的兴趣有很长的历史。LeCun [22]和Hassibiet al。 [23]表明,使用含有二阶导数的Hessian,可以识别弱神经元,并且比使用权重的大小表现更好。计算Hessian是昂贵的,因此没有被广泛使用。韩等人。 [12]表明权重的范数仍然是有效的排名标准,并产生稀疏模型。稀疏模型不能转化为更快的推理,但作为神经元排序标准,它们是有效的。胡等人。 [24]探索激活中零点的平均百分比(APoZ),并使用数据驱动的阈值来确定截止值。Molchanov等。 [9]推荐第二项来自损失函数的泰勒展开。我们提供详细比较,并在第5节中使用我们的培训计划显示使用这些指标的结果。

    Architecture Search Neural architecture search [25, 26, 27] is where the architecture is modified during training, and multiple neural network structures are explored in search of the best architecture for a given dataset. Such methods do not have any benefits if the architecture is fixed ahead of time. Our scheme improves training for a given architecture by making better use of the available parameters.This could be used in conjunction with architecture search if there is flexibility around the final architecture or used on its own when the architecture is fixed due to certified model deployment, memory requirements, or other considerations.

    架构搜索神经架构搜索[25,26,27]是在训练期间修改架构的地方,探索多个神经网络结构以寻找给定数据集的最佳架构。如果架构提前固定,这些方法没有任何好处。我们的方案通过更好地利用可用参数来改进给定体系结构的培训。如果由于经过认证的模型部署,内存要求或其他考虑因素而导致体系结构固定,则可以在最终体系结构周围存在灵活性或单独使用时,可以将其与体系结构搜索结合使用。

    Feature correlation A well-known shortcoming of vanilla convolutional networks is their correlated feature maps [5, 28]. Architectures like Inception-Net [29] are motivated by analyzing the correlation statistics of features across layers. They aim to reduce the correlation between the layers by using concatenated features from various sized filters, subsequent research shows otherwise [30]. More recent architectures like ResNet [1] and DenseNet [31] aim to implicitly reduce feature correlations by summing or concatenating activations from previous layers. That said, these models are computationally expensive and require large memory to store previous activations. Our aim is to induce decorrelated features without changing the architecture of the convolutional network. This benefits all the existing implementations of ConvNet without having to change the infrastructure. While our technique performs best with vanilla ConvNet architectures it still marginally improves the performance of modern architectures.

    特征相关香草卷积网络的一个众所周知的缺点是它们的相关特征图[5,28]。像Inception-Net [29]这样的架构是通过分析跨层特征的相关统计来激发的。他们的目标是通过使用来自不同尺寸滤波器的连接特征来减少层之间的相关性,随后的研究表明[30]。最新的体系结构如ResNet [1]和DenseNet [31]旨在通过对先前层的激活进行求和或连接来隐式减少特征相关性。也就是说,这些模型的计算成本很高,并且需要大量内存来存储以前的激活。我们的目标是在不改变卷积网络架构的情况下诱导去相关的特征。这有利于ConvNet的所有现有实施,而无需更改基础架构。虽然我们的技术在vanilla ConvNet架构中表现最佳,但它仍然略微提高了现代架构的性能。

    3. Motivation for Orthogonal Features

    3.正交特征的动机

    A feature for a convolutional filter is defined as the pointwise sum of the activations from individual kernels of the filter. A feature is considered useful if it helps to improve the generalization of the model.A model that has poor generalization usually has features that, in aggregate, capture limited directions in activation space [32]. On the other hand, if a model’s features are orthogonal to one another, they will each capture distinct directions in activation space, leading to improved generalization. For a triviallysized ConvNet, we can compute the maximally expressive filters by analyzing the correlation of features across layers and clustering them into groups [33]. However, this scheme is computationally impractical for the deep ConvNets used in real-world applications. Alternatively, a computationally feasible option is the addition of a regularization term to the loss function used in standard SGD training which encourages the minimization of the covariance of the activations, but this produces only limited improvement in model performance [34, 5]. A similar method, in which the regularization term instead encourages the orthogonality of filter weights, has also produced marginal improvements [35, 36, 37, 38]. Shang et al. [39] discovered the low-level filters are duplicated with opposite phase. Forcing filters to be orthogonal will minimize this duplication without changing the activation function. In addition to improvements in performance and generalization, Saxe et al. [40] show that the orthogonality of weights also improves the stability of network convergence during training. The authors of [38, 41] further demonstrate the value of orthogonal weights to the efficient training of networks. Orthogonal initialization is common practice for Recurrent Neural Networks due to their increased sensitivity to initial conditions [42], but it has somewhat fallen out of favor for ConvNets. These factors shape our motivation for encouraging orthogonality of features in the ConvNet and form the basis of our ranking criteria. Because features are dependent on the input data, determining their orthogonality requires computing statistics across the entire training set, and is therefore prohibitive. We instead compute the orthogonality of filter weights as a surrogate. Our experiments show that encouraging weight orthogonality through a regularization term is insufficient to promote the development of features which capture the full space of the input data manifold. Our method of dropping overlapping filters acts as an implicit regularization and leads to the better orthogonality of filters without hampering model convergence.

    卷积滤波器的一个特征被定义为滤波器各个内核激活的逐点和。如果某个功能有助于改进模型的泛化,则该功能被认为是有用的。泛化程度较差的模型通常具有总体上捕获激活空间中有限方向的特征[32]。另一方面,如果模型的特征彼此正交,它们将各自捕获激活空间中的不同方向,从而导致改进的泛化。对于一个简单的ConvNet,我们可以通过分析各层特征的相关性并将它们聚类成组来计算最大表达式滤波器[33]。然而,对于在实际应用中使用的深度ConvNets,该方案在计算上是不切实际的。或者,计算上可行的选项是在标准SGD训练中使用的损失函数中加入正则化项,这有助于最小化激活的协方差,但这仅产生模型性能的有限改进[34,5]。一种类似的方法,其中正则化项反而鼓励过滤器权重的正交性,也产生了微小的改进[35,36,37,38]。尚等人。 [39]发现低水平过滤器是重复相反的。强制滤波器正交将最小化这种重复,而不改变激活功能。除了性能和泛化方面的改进,Saxe等人。 [40]表明权重的正交性也提高了训练期间网络收敛的稳定性。[38,41]的作者进一步证明了正交权重对网络有效训练的价值。正交初始化是回归神经网络的常见做法,因为它们对初始条件的敏感性增加[42],但它有点不受ConvNets的青睐。这些因素决定了我们鼓励ConvNet中功能正交性的动机,并构成了我们排名标准的基础。因为特征取决于输入数据,所以确定它们的正交性需要在整个训练集中计算统计数据,因此是禁止的。我们改为将过滤器权重的正交性计算为替代。我们的实验表明,通过正则化项来鼓励权重正交性不足以促进捕获输入数据流形的整个空间的特征的发展。我们丢弃重叠滤波器的方法充当隐式正则化,并且在不妨碍模型收敛的情况下导致滤波器更好的正交性。

    We use Canonical Correlation Analysis [43] (CCA) to study the overlap of features in a single layer.CCA finds the linear combinations of random variables that show maximum correlation with each other.It is a useful tool to determine if the learned features are overlapping in their representational capacity.Li et al. [44] apply correlation analysis to filter activations to show that most of the well-known ConvNet architectures learn similar representations. Raghu et al. [30] combine CCA with SVD to perform a correlation analysis of the singular values of activations from various layers. They show that increasing the depth of a model does not always lead to a corresponding increase of the model’s dimensionality, due to several layers learning representations in correlated directions. We ask an even more elementary question - how correlated are the activations from various filters within a single layer? In an over-parameterized network like VGG-16, which has several convolutional layers with 512 filters each, it is no surprise that most of the filter activations are highly correlated. As a result, VGG16 has been shown to be easily pruned - more than 50% of the filters can be dropped while maintaining the performance of the full network [9, 44]. Is this also true for significantly smaller convolutional networks, which under-fit the dataset? We will consider a simple network with two convolutional layers of 32 filters each, and a softmax layer at the end. Training this model on CIFAR-10 for 100 epochs with an annealed learning rate results in test set accuracy of 58.2%, far below the 93.5% achieved by VGG-16. In the case of VGG-16, we might expect that correlation between filters is merely an artifact of the over-parameterization of the model - the dataset simply does not have a dimensionality high enough to require every feature to be orthogonal to every other. On the other hand, our small network has clearly failed to capture the full feature space of the training data, and thus any correlation between its filters is due to inefficiencies in training, rather than over-parameterization.

    我们使用Canonical Correlation Analysis [43](CCA)来研究单层特征的重叠。CCA找到随机变量的线性组合,这些随机变量显示出彼此最大的相关性。它是一种有用的工具,用于确定所学习的特征是否在其代表性能力中重叠。李等人。 [44]将相关性分析应用于过滤器激活,以显示大多数着名的ConvNet架构学习类似的表示。Raghu等人。 [30]将CCA与SVD结合起来,对来自不同层的激活的奇异值进行相关分析。他们表明,增加模型的深度并不总是导致模型维度的相应增加,这是由于在相关方向上的几个层学习表示。我们问一个更基本的问题 - 单层内各种过滤器的激活有多相关?在像VGG-16这样的过度参数化网络中,它有几个卷积层,每层有512个滤波器,因此大多数滤波器激活都是高度相关的并不奇怪。因此,VGG16已被证明易于修剪 - 超过50%的滤波器可以被丢弃,同时保持整个网络的性能[9,44]。这对于显着较小的卷积网络是否也是如此,这对于数据集不足?我们将考虑一个简单的网络,其中每个卷积层有32个滤波器,最后是softmax层。使用退火学习率在CIFAR-10上训练该模型100个时期导致测试集准确度为58.2%,远低于VGG-16实现的93.5%。在VGG-16的情况下,我们可能期望滤波器之间的相关性仅仅是模型过度参数化的假象 - 数据集根本没有足够高的维度来要求每个特征彼此正交。另一方面,我们的小型网络显然无法捕获训练数据的完整特征空间,因此其过滤器之间的任何相关性都是由于训练效率低下而非过度参数化。

    image

    Figure 2: Left: Canonical Correlation Analysis of activations from two layers of a ConvNet trained on CIFAR-10. Right: Distribution of change in accuracy when the model is evaluated by dropping one filter at a time.

    图2:左图:经CIFAR-10训练的两层ConvNet激活的典型相关分析。右:通过一次丢弃一个过滤器评估模型时的准确度变化分布。

    Given a trained model, we can evaluate the contribution of each filter to the model’s performance by removing (zeroing out) that filter and measuring the drop in accuracy on the test set. We will call this metric of filter importance the ”greedy Oracle”. We perform this evaluation independently for every filter in the model, and plot the distribution of the resulting drops in accuracy in Figure 2 (right). Most of the second layer filters contribute less than 1% in accuracy and with first layer filters, there is a long tail. Some filters are important and contribute over 4% of accuracy but most filters are around 1%. This implies that even a tiny and under-performing network could be filter pruned without significant performance loss. The model has not efficiently allocated filters to capture wider representations of necessary features. Figure 2 (left) shows the correlations from linear combinations of the filter activations (CCA) at both the layers. It is evident that in both the layers there is a significant correlation among filter activations with several of them close to a near perfect correlation of 1 (bright yellow spots image ). The second layer (upper right diagonal) has lot more overlap of features the first layer (lower right). For a random orthogonal matrix any value above 0.3 (lighter than dark blue image

    ) is an anomaly.The activations are even more correlated if the linear combinations are extended to kernel functions [45] or singular values [30]. Regardless, it suffices to say that standard training for convolutional filters does not maximize the representational potential of the network.

    给定训练有素的模型,我们可以通过移除(归零)滤波器并测量测试集上的精度下降来评估每个滤波器对模型性能的贡献。我们将这个过于重要的度量称为“贪婪的Oracle”。我们对模型中的每个滤波器独立执行此评估,并在图2(右)中绘制所得到的精度分布的准确度。大多数第二层滤波器的准确度不到1%,并且对于第一层滤波器,有一个长尾。有些滤波器很重要,精度超过4%,但大多数滤波器约为1%。这意味着即使是一个微小且性能不佳的网络也可以进行过滤修剪而不会出现明显的性能损失。该模型没有有效地分配过滤器来捕获必要特征的更广泛表示。图2(左)显示了两个层上过滤器激活(CCA)的线性组合的相关性。很明显,在两个层中,滤波器激活之间存在显着的相关性,其中几个接近于1的近似完美相关(亮黄色点 image )。第二层(右上角)与第一层(右下)的特征重叠得多。对于随机正交矩阵,任何高于0.3的值(比深蓝色 image

    轻)都是异常。如果将线性组合扩展到核函数[45]或奇异值[30],则激活甚至更相关。无论如何,它可以说卷积滤波器的标准训练并不能最大化网络的代表性潜力。

    4. Our Training Scheme : RePr

    4.我们的培训计划:RePr

    We modify the training process by cyclically removing redundant filters, retraining the network, re-initializing the removed filters, and repeating. We consider each filter (3D tensor) as a single unit, and represent it as a long vector - (f ). Let M denote a model with F filters spread across L layers. Let image denote a subset of F filters, such that image denotes a complete network whereas, image denotes a sub-network without that image filters. Our training scheme alternates between training the complete network ( image ) and the sub-network ( image ). This introduces two hyper-parameters. First is the number of iterations to train each of the networks before switching over; let this be image for the full network and image

    for the sub-network. These have to be non-trivial values so that each of the networks learns to improve upon the results of the previous network. The second hyper-parameter is the total number of times to repeat this alternating scheme; let it be N. This value has minimal impact beyond certain range and does not require tuning.

    我们通过循环移除冗余过滤器,重新训练网络,重新初始化移除的过滤器以及重复来修改训练过程。我们将每个滤波器(3D张量)视为一个单元,并将其表示为长向量 - (f)。设M表示一个模型,其中F滤波器分布在L层上。让 image 表示F滤波器的子集,使得 image 表示完整网络,而 image 表示没有 image 滤波器的子网络。我们的培训计划在培训完整网络( image )和子网络( image )之间交替进行。这引入了两个超参数。首先是在切换之前训练每个网络的迭代次数;让它成为完整网络的 image 和子网络的 image

    。这些必须是非平凡的值,以便每个网络都学会改进先前网络的结果。第二个超参数是重复该交替方案的总次数;让它成为N.该值具有超出特定范围的最小影响,并且不需要调整。

    The most important part of our algorithm is the metric used to rank the filters. Let R be the metric which associates some numeric value to a filter. This could be a norm of the weights or its gradients or our metric - inter-filter orthogonality in a layer. Here we present our algorithm agnostic to the choice of metric. Most sensible choices for filter importance results in an improvement over standard training when applied to our training scheme (see Ablation Study 6).

    我们算法中最重要的部分是用于对滤波器进行排名的度量。设R是将某个数值与滤波器相关联的度量。这可能是权重或其梯度或我们的度量标准 - 层中的中间过滤器正交性的标准。在这里,我们提出我们的算法与度量的选择无关。当应用于我们的训练计划时,对过滤器重要性的最明智选择会导致对标准训练的改进(参见消融研究6)。

    Our training scheme operates on a macro-level and is not a weight update rule. Thus, is not a substitute for SGD or other adaptive methods like Adam [46] and RmsProp [47]. Our scheme works with any of the available optimizers and shows improvement across the board. However, if using an optimizer that has parameters specific learning rates (like Adam), it is important to re-initialize the learning rates corresponding to the weights that are part of the pruned filters ( image ). Corresponding Batch Normalization [48] parameters ( image

    ) must also be re-initialized. For this reason, comparisons of our training scheme with standard training are done with a common optimizer.

    我们的培训计划在宏观层面上运作,而不是权重更新规则。因此,不能替代SGD或其他适应性方法,如Adam [46]和RmsProp [47]。我们的方案适用于任何可用的优化器,并显示全面改进。但是,如果使用具有特定参数学习速率的优化器(如Adam),则重新初始化与作为修剪过滤器( image )一部分的权重相对应的学习速率非常重要。还必须重新初始化相应的批量标准化[48]参数( image

    )。出于这个原因,我们的培训计划与标准培训的比较是通过一个共同的优化器完成的。

    We reinitialize the filters ( image ) to be orthogonal to its value before being dropped and the current value of nonpruned filters ( image

    ). We use the QR decomposition on the weights of the filters from the same layer to find the null-space and use that to find an orthogonal initialization point.

    我们重新初始化滤波器( image ),使其与被丢弃之前的值正交,以及非滤波滤波器的当前值( image

    )。我们对来自同一层的滤波器的权重使用QR分解来找到零空间,并使用它来找到正交初始化点。

    Our algorithm is training interposed with Re-initializing and Pruning - RePr (pronounced: reaper). We summarize our training scheme in Algorithm 1.

    我们的算法是训练插入Re-initializing和Pruning - RePr(发音为:reaper)。我们在算法1中总结了我们的训练方案。

    image

    We use a shallow model to analyze the dynamics of our training scheme and its impact on the train/test accuracy. A shallow model will make it feasible to compute the greedy Oracle ranking for each of the filters. This will allow us to understand the impact of training scheme alone without confounding the results due to the impact of ranking criteria. We provide results on larger and deeper convolutional networks in Section Results 8.

    我们使用浅层模型来分析我们的训练方案的动态及其对列车/测试精度的影响。浅模型可以计算每个过滤器的贪婪Oracle排名。这将使我们能够单独了解培训计划的影响,而不会因排名标准的影响而混淆结果。我们在结果8中提供了更大更深的卷积网络的结果。

    Consider a n layer vanilla ConvNet, without a skip or dense connections, with X filter each, as shown below:

    考虑一个n层香草ConvNet,没有跳过或密集连接,每个都有X滤波器,如下所示:

    image We will represent this architecture as image . Thus, a image has 96 filters, and when trained with SGD with a learning rate of 0.01, achieves test accuracy of 73%. Figure 1 shows training plots for accuracy on the training set (left) and test set (right). In this example, we use a RePr training scheme with image

    and the ranking criteria R as a greedy Oracle. We exclude a separate validation set of 5K images from the training set to compute the Oracle ranking. In the training plot, annotation [A] shows the point at which the filters are first pruned. Annotation [C] marks the test accuracy of the model at this point. The drop in test accuracy at [C] is lower than that of training accuracy at [A], which is not a surprise as most models overfit the training set. However, the test accuracy at [D] is the same as [C] but at this point, the model only has 70% of the filters. This is not a surprising result, as research on filter pruning shows that at lower rates of pruning most if not all of the performance can be recovered [9].

    我们将此架构表示为 image 。因此, image 有96个滤波器,当用SGD训练,学习率为0.01时,测试精度达到73%。图1显示了训练集(左)和测试集(右)的准确性训练图。在这个例子中,我们使用带有 image

    的RePr训练方案和排名标准R作为贪婪的Oracle。我们从训练集中排除了单独的5K图像验证集,以计算Oracle排名。在训练图中,注释[A]显示了首次修剪过滤器的点。注释[C]标记此时模型的测试精度。[C]时测试精度的下降低于[A]时的训练精度,这并不令人意外,因为大多数模型都超过训练集。但是,[D]的测试精度与[C]相同,但此时,模型只有70%的过滤器。这并不是一个令人惊讶的结果,因为对过滤修剪的研究表明,在较低的修剪率下,大多数(如果不是全部)性能都可以恢复[9]。

    What is surprising is that test accuracy at [E], which is only a couple of epochs after re-introducing the pruned filters, is significantly higher than point [C]. Both point [C] and point [E] are same capacity networks, and higher accuracy at [E] is not due to the model convergence. In the standard training (orange line) the test accuracy does not change during this period. Models that first grow the network and then prune [49, 50], unfortunately, stopped shy of another phase of growth, which yields improved performance. In their defense, this technique defeats the purpose of obtaining a smaller network by pruning. However, if we continue RePr training for another two iterations, we see that the point [F], which is still at 70% of the original filters yields accuracy which is comparable to the point [E] (100% of the model size.

    令人惊讶的是,[E]的测试精度,即重新引入修剪后的滤波器后的几个时期,明显高于点[C]。点[C]和点[E]都是相同的容量网络,[E]处的更高精度不是由于模型收敛。在标准培训(橙色线)中,测试精度在此期间不会改变。不幸的是,第一次成长网络然后修剪[49,50]的模型已经不再适应另一个增长阶段,从而提高了性能。在他们的辩护中,这种技术破坏了通过修剪获得更小网络的目的。然而,如果我们继续进行另外两次迭代的RePr训练,我们看到仍然是原始滤波器的70%的点[F]产生的精度与点[E](模型尺寸的100%)相当。

    Another observation we can make from the plots is that training accuracy of RePr model is lower, which signifies some form of regularization on the model. This is evident in the Figure 4 (Right), which shows RePr with a large number of iterations ( image

    ). While the marginal benefit of higher test accuracy diminishes quickly, the generalization gap between train and test accuracy is reduced significantly.

    我们可以从图中得到的另一个观察结果是RePr模型的训练精度较低,这表明模型上存在某种形式的正则化。这在图4(右)中很明显,它显示了具有大量迭代的RePr( image

    )。虽然较高的测试精度的边际收益迅速减少,但列车和测试精度之间的泛化差距显着降低。

    5. Our Metric : inter-filter orthogonality

    5.我们的度量标准:中间过滤器正交性

    The goals of searching for a metric to rank least important filters are twofold - (1) computing the greedy Oracle is not computationally feasible for large networks, and (2) the greedy Oracle may not be the best criteria. If a filter which captures a unique direction, thus not replaceable by a linear combination of other filters, has a lower contribution to accuracy, the Oracle will drop that filter. On a subsequent re-initialization and training, we may not get back the same set of directions.

    搜索度量标准以排列最不重要的过滤器的目标是双重的 - (1)计算贪婪的Oracle对于大型网络而言在计算上不可行,以及(2)贪婪的Oracle可能不是最佳标准。如果捕获唯一方向的滤波器(因此无法通过其他滤波器的线性组合替换)对精度的贡献较低,则Oracle将删除该滤波器。在随后的重新初始化和训练中,我们可能无法返回相同的方向集。

    The directions captured by the activation pattern expresses the capacity of a deep network [51]. Making orthogonal features will maximize the directions captured and thus expressiveness of the network. In a densely connected layer, orthogonal weights lead to orthogonal features, even in the presence of ReLU [42]. However, it is not clear how to compute the orthogonality of a convolutional layer.

    激活模式捕获的方向表达了深度网络的容量[51]。制作正交特征将最大化捕获的方向并因此最大化网络的表现力。在密集连接的层中,正交权重导致正交特征,即使存在ReLU [42]。但是,目前尚不清楚如何计算卷积层的正交性。

    A convolutional layer is composed of parameters grouped into spatial kernels and sparsely share the incoming activations. Should all the parameters in a single convolutional layer be considered while accounting for orthogonality? The theory that promotes initializing weights to be orthogonal is based on densely connected layers (FC-layers) and popular deep learning libraries follow this guide1 by considering convolutional layer as one giant vector disregarding the sparse connectivity. A recent attempt to study orthogonality of convolutional filters is described in [41] but their motivation is the convergence of very deep networks (10K layers) and not orthogonality of the features. Our empirical study suggests a strong preference for requiring orthogonality of individual filters in a layer (inter-filter & intra-layer) rather than individual kernels.

    卷积层由分组到空间内核中的参数组成,并且稀疏地共享传入的激活。是否应考虑单个卷积层中的所有参数,同时考虑正交性?促进初始化权重为正交的理论基于密集连接的层(FC层),并且流行的深度学习库遵循本指南1,将卷积层视为忽略稀疏连接的一个巨型向量。最近在[41]中描述了研究卷积滤波器正交性的尝试,但他们的动机是非常深的网络(10K层)的融合而不是特征的正交性。我们的实证研究表明,强烈要求在一个层(跨层和内层)而不是单个内核中要求单个过滤器的正交性。

    A filter of kernel size image is commonly a 3D tensor of shape image , where c is the number of channels in the incoming activations. Flatten this tensor to a 1D vector of size image , and denote it by f. Let image denote the number of filters in the layer ℓ, where image , and L is the number of layers in the ConvNet. Let image

    be a matrix, such that the individual rows are the flattened filters (f ) of the layer ℓ.

    内核大小 image 的过滤器通常是形状为 image 的3D张量,其中c是传入激活中的通道数。将此张量展平为大小为 image 的1D向量,并用f表示。让 image 表示层ℓ中的过滤器数量,其中 image 和L是ConvNet中的层数。设 image

    是一个矩阵,这样各行就是层l的fl参与过滤器(f)。

    Let image denote the normalized weights. Then, the measure of Orthogonality for filter f in a layer ℓ (denoted by image

    ) is computed as shown in the equations below.

    image 表示标准化权重。然后,计算层l中的滤波器f的正交性度量(由 image

    表示),如下面的等式所示。

    image image is a matrix of size image and image denotes ith row of P. Off-diagonal elements of a row of P for a filter f denote projection of all the other filters in the same layer with f. The sum of a row is minimum when other filters are orthogonal to this given filter. We rank the filters least important (thus subject to pruning) if this value is largest among all the filters in the network. While we compute the metric for a filter over a single layer, the ranking is computed over all the filters in the network. We do not enforce per layer rank because that would require learning a hyper-parameter image

    for every layer and some layers are more sensitive than others. Our method prunes more filters from deeper layers compared to the earlier layers. This is in accordance with the distribution of contribution of each filter in a given network (Figure 2 right).

    image 是一个大小为 image 的矩阵, image 是一个P的第i行。用于滤波器f的P行的非对角线元素表示与f在同一层中的所有其他滤波器的投影。当其他滤波器与该给定滤波器正交时,行的总和是最小的。如果此值在网络中的所有过滤器中最大,我们将过滤器排名最不重要(因此需要修剪)。当我们计算单个层上的过滤器的度量时,将对网络中的所有过滤器计算排名。我们不强制执行每层排名,因为这需要为每个层学习一个超参数 image

    ,而某些层比其他层更敏感。与早期的图层相比,我们的方法可以从更深的层中修剪更多的滤镜。这与给定网络中每个滤波器的贡献分布一致(图2右)。

    Computation of our metric does not require expensive calculations of the inverse of Hessian [22] or the second order derivatives [23] and is feasible for any sized networks. The most expensive calculations are L matrix products of size image

    , but GPUs are designed for fast matrixmultiplications. Still, our method is more expensive than computing norm of the weights or the activations or the Average Percentage of Zeros (APoZ).

    我们的度量的计算不需要昂贵的Hessian [22]或二阶导数[23]的逆的计算,并且对于任何大小的网络都是可行的。最昂贵的计算是尺寸为 image

    的L矩阵产品,但GPU设计用于快速矩阵乘法。尽管如此,我们的方法比计算权重或激活的平均值或零点的平均百分比(APoZ)更昂贵。

    1tensorflow:ops/init ops.py#L543 & pytorch:nn/init.py#L350

    1tensor fl ow:ops / init ops.py#L543&pytorch:nn / init.py#L350

    image Given the choice of Orthogonality of filters, an obvious question would be to ask if adding a soft penalty to the loss function improve this training? A few researchers [35, 36, 37] have reported marginal improvements due to added regularization in the ConvNets used for task-specific models. We experimented by adding image

    to the loss function, but we did not see any improvement. Soft regularization penalizes all the filters and changes the loss surface to encourage random orthogonality in the weights without improving expressiveness.

    考虑到滤波器正交性的选择,一个显而易见的问题是询问是否在损失函数中加入软惩罚可以改善这种训练?一些研究人员[35,36,37]报道了由于在用于任务特定模型的ConvNets中增加正则化而导致的边际改进。我们通过在损失函数中添加 image

    进行了实验,但我们没有看到任何改进。软正规化惩罚所有过滤器并改变损失表面以鼓励权重中的随机正交性而不改善表达性。

    文章引用于 http://tongtianta.site/paper/13375
    编辑 Lornatang
    校准 Lornatang

    相关文章

      网友评论

        本文标题:RePr: Improved Training of Convo

        本文链接:https://www.haomeiwen.com/subject/rvblbqtx.html