论文:
https://arxiv.org/pdf/1409.4842v1.pdf
https://arxiv.org/pdf/1502.03167v3.pdf
https://arxiv.org/pdf/1512.00567v3.pdf
https://arxiv.org/pdf/1602.07261.pdf
参考:
GoogleNet论文翻译——中英文对照
Inception-V3论文翻译——中英文对照
inception-v1,v2,v3,v4----论文笔记
极简解释inception V1 V2 V3 V4
Inception V1,V2,V3,V4 模型总结
如何解析深度学习 Inception 从 v1 到 v4 的演化
A Simple Guide to the Versions of the Inception Network
从Inception v1到Inception-ResNet,一文概览Inception家族的奋斗史
深度学习——分类之Inception v3——factorized convolution
谷歌Inception网络中的Inception-V3到Inception-V4具体作了哪些优化?
inception-v1 : Going deeper with convolutions -2014 Christian Szegedy,Vincent Vanhoucke
深入理解GoogLeNet结构(原创)
inception(也称GoogLeNet)是2014年Christian Szegedy提出的一种全新的深度学习结构,在这之前的AlexNet、VGG等结构都是通过增大网络的深度(层数)来获得更好的训练效果,但层数的增加会带来很多负作用,比如overfit、梯度消失、梯度爆炸等。inception的提出则从另一种角度来提升训练结果:能更高效的利用计算资源,在相同的计算量下能提取到更多的特征,从而提升训练结果。
核心思想:inception模块的基本机构如下图,整个inception结构就是由多个这样的inception模块串联起来的。inception结构的主要贡献有两个:一是使用1x1的卷积来进行升降维;二是在多个尺寸上同时进行卷积再聚合。
The main hallmark of this architecture is the improved utilization of the computing resources inside the network. This was achieved by a carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing.
In the last three years, mainly due to the advances of deep learning, more concretely convolutional networks [10], the quality of image recognition and object detection has been progressing at a dramatic pace. One encouraging news is that most of this progress is not just the result of more powerful hardware, larger datasets and bigger models, but mainly a consequence of new ideas, algorithms and improved network architectures.
Another notable factor is that with the ongoing traction of mobile and embedded computing, the efficiency of our algorithms – especially their power and memory use – gains importance. It is noteworthy that the considerations leading to the design of the deep architecture presented in this paper included this factor rather than having a sheer fixation on accuracy numbers. For most of the experiments, the models were designed to keep a computational budget of 1.5 billion multiply-adds at inference time, so that the they do not end up to be a purely academic curiosity, but could be put to real world use, even on large datasets, at a reasonable cost.
In this paper, we will focus on an efficient deep neural network architecture for computer vision, codenamed Inception.In our case, the word “deep” is used in two different meanings: first of all, in the sense that we introduce a new level of organization in the form of the “Inception module” and also in the more direct sense of increased network depth.
点评:不失性能(computational budget、computational resources)的前提下,加深网络深度、提高准确率。
- Motivation and High Level Considerations
- inception-v1 Architectural Details
- GoogLeNet
- Training Methodology
- ILSVRC 2014 Classification Challenge Setup and Results
Motivation and High Level Considerations - 最初动机,忽忘初心......
The most straightforward way of improving the performance of deep neural networks is by increasing their size. This includes both increasing the depth – the number of levels – of the network and its width: the number of units at each level. This is as an easy and safe way of training higher quality models, especially given the availability of a large amount of labeled training data. However this simple solution comes with two major drawbacks.
Bigger size typically means a larger number of parameters, which makes the enlarged network more prone to overfitting, especially if the number of labeled examples in the training set is limited. This can become a major bottleneck, since the creation of high quality training sets can be tricky and expensive.
Another drawback of uniformly increased network size is the dramatically increased use of computational resources. For example, in a deep vision network, if two convolutional layers are chained, any uniform increase in the number of their filters results in a quadratic increase of computation.Since in practice the computational budget is always finite, an efficient distribution of computing resources is preferred to an indiscriminate increase of size, even when the main objective is to increase the quality of results.
The fundamental way of solving both issues would be by ultimately moving from fully connected to sparsely connected architectures, even inside the convolutions.
Their main result states that if the probability distribution of the dataset is representable by a large, very sparse deep neural network, then the optimal network topology can be constructed layer by layer by analyzing the correlation statistics of the activations of the last layer and clustering neurons with highly correlated outputs.
On the downside, todays computing infrastructures are very inefficient when it comes to numerical calculation on non-uniform sparse data structures(非均匀的稀疏数据结构). Even if the number of arithmetic operations is reduced by 100×, the overhead of lookups and cache misses is so dominant that switching to sparse matrices would not pay off.(即使算法运算的数量减少100倍,查询和缓存丢失上的开销仍占主导地位:切换到稀疏矩阵可能是不可行的。) The gap is widened even further by the use of steadily improving, highly tuned, numerical libraries that allow for extremely fast dense matrix multiplication, exploiting the minute details of the underlying CPU or GPU hardware [16, 9]. Also, non-uniform sparse models require more sophisticated engineering and computing infrastructure. Most current vision oriented machine learning systems utilize sparsity in the spatial domain just by the virtue of employing convolutions.
However, convolutions are implemented as collections of dense connections to the patches in the earlier layer. ConvNets have traditionally used random and sparse connection tables in the feature dimensions since [11] in order to break the symmetry and improve learning, the trend changed back to full connections with [9] in order to better optimize parallel computing. The uniformity of the structure and a large number of filters and greater batch size allow for utilizing efficient dense computation.
然而,卷积被实现为对上一层块的密集连接的集合。为了打破对称性,提高学习水平,从论文[11]开始,ConvNets习惯上在特征维度使用随机的稀疏连接表,然而为了进一步优化并行计算,论文[9]中趋向于变回全连接。目前最新的计算机视觉架构有统一的结构。更多的滤波器和更大的批大小要求密集计算的有效使用。
点评:用卷积网络的天然稀疏特征,拟合为密集矩阵。
This raises the question of whether there is any hope for a next, intermediate step: an architecture that makes use of filter-level sparsity, as suggested by the theory, but exploits our current hardware by utilizing computations on dense matrices. The vast literature on sparse matrix computations (e.g. [3]) suggests that clustering sparse matrices into relatively dense submatrices tends to give competitive performance for sparse matrix multiplication. It does not seem far-fetched to think that similar methods would be utilized for the automated construction of non-uniform deep-learning architectures in the near future.
这提出了下一个中间步骤是否有希望的问题:一个架构能利用滤波器水平的稀疏性,正如理论所认为的那样,但能通过利用密集矩阵计算来利用我们目前的硬件。稀疏矩阵乘法的大量文献(例如[3])认为对于稀疏矩阵乘法,将稀疏矩阵聚类为相对密集的子矩阵会有更佳的性能。在不久的将来会利用类似的方法来进行非均匀深度学习架构的自动构建,这样的想法似乎并不牵强。
The Inception architecture started out as a case study for assessing the hypothetical output of a sophisticated network topology construction algorithm that tries to approximate a sparse structure implied by [2] for vision networks and covering the hypothesized outcome by dense, readily available components.
Inception架构开始是作为案例研究,用于评估一个复杂网络拓扑构建算法的假设输出,该算法试图近似[2]中所示的视觉网络的稀疏结构,并通过密集的、容易获得的组件来覆盖假设结果。
Despite being a highly speculative undertaking, modest gains were observed early on when compared with reference networks based on [12 Network in network]. With a bit of tuning the gap widened and Inception proved to be especially useful in the context of localization and object detection as the base network for [6] and [5].
One must be cautious though: although the Inception architecture has become a success for computer vision, it is still questionable whether this can be attributed to the guiding principles that have lead to its construction. Making sure of this would require a much more thorough analysis and verification.
然而必须谨慎:尽管Inception架构在计算机上领域取得成功,但这是否可以归因于构建其架构的指导原则仍是有疑问的。确保这一点将需要更彻底的分析和验证。
At very least, the initial success of the Inception architecture yields firm motivation for exciting future work in this direction.
为什么提出Inception
提高网络最简单粗暴的方法就是提高网络的深度和宽度,即增加隐层和以及各层神经元数目。但这种简单粗暴的方法存在一些问题:
- 会导致更大的参数空间,更容易过拟合
- 需要更多的计算资源
- 网络越深,梯度容易消失,优化困难(这时还没有提出BN时,网络的优化极其困难)
- 基于此,我们的目标就是,提高网络计算资源的利用率,在计算量不变的情况下,提高网络的宽度和深度。
作者认为,解决这种困难的方法就是,把全连接改成稀疏连接,卷积层也是稀疏连接,但是不对称的稀疏数据数值计算效率低下,因为硬件全是针对密集矩阵优化的,所以,我们要找到卷积网络可以近似的最优局部稀疏结构,并且该结构下可以用现有的密度矩阵计算硬件实现,产生的结果就是Inception。
inception-v1 Architectural Details
The main idea of the Inception architecture is based on finding out how an optimal local sparse structure in a convolutional vision network can be approximated and covered by readily available dense components. Note that assuming translation invariance means that our network will be built from convolutional building blocks. All we need is to find the optimal local construction and to repeat it spatially.
实践出真理
In general, an Inception network is a network consisting of modules of the above type stacked upon each other, with occasional max-pooling layers with stride 2 to halve the resolution of the grid. For technical reasons (memory efficiency during training), it seemed beneficial to start using Inception modules only at higher layers while keeping the lower layers in traditional convolutional fashion. This is not strictly necessary, simply reflecting some infrastructural inefficiencies in our current implementation.(否定之否定......)
One of the main beneficial aspects of this architecture is that it allows for increasing the number of units at each stage significantly without an uncontrolled blow-up in computational complexity. The ubiquitous use of dimension reduction allows for shielding the large number of input filters of the last stage to the next layer, first reducing their dimension before convolving over them with a large patch size.
Another practically useful aspect of this design is that it aligns with the intuition that visual information should be processed at various scales and then aggregated so that the next stage can abstract features from different scales simultaneously.
该架构的一个有用的方面是它允许显著增加每个阶段的单元数量,而不会在后面的阶段出现计算复杂度不受控制的爆炸。这是在尺寸较大的块进行昂贵的卷积之前通过普遍使用降维实现的。此外,设计遵循了实践直觉,即视觉信息应该在不同的尺度上处理然后聚合,为的是下一阶段可以从不同尺度同时抽象特征。
GoogLeNet
Training Methodology
ILSVRC 2014 Classification Challenge Setup and Results
inception论文摘录:
https://www.jianshu.com/p/cd1b65e8dad1
inception_v1
image既然变小,为什么不直接使用
1*1
的,而还需要3*3
和5*5
的呢,其实这样还是为了适应更多的尺度,保证输入图像即使被缩放也还是可以正常工作,毕竟相当于有个金字塔去检测了嘛。
image
从Inception v1到Inception-ResNet,一文概览Inception家族的奋斗史
VGG-Net 的泛化性能非常好,常用于图像特征的抽取目标检测候选框生成等。VGG最大的问题就在于参数数量,VGG-19基本上是参数量最多的卷积网络架构。这一问题也是第一次提出 Inception 结构的 GoogLeNet 所重点关注的,它没有如同 VGG-Net那样大量使用全连接网络,因此参数量非常小。
GoogLeNet 最大的特点就是使用了 Inception模块,它的目的是设计一种具有优良局部拓扑结构的网络,即对输入图像并行地执行多个卷积运算或池化操作,并将所有输出结果拼接为一个非常深的特征图。因为 1*1
、3*3
或 5*5
等不同的卷积运算与池化操作可以获得输入图像的不同信息,并行处理这些运算并结合所有结果将获得更好的图像表征
The main hallmark of this architecture is the improved utilization of the computing resources inside the network. This was achieved by a carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing.
该架构的主要特点是更好地利用网络内部的计算资源,这通过一个精心制作的设计来实现,该设计允许增加网络的深度和宽度,同时保持计算预算不变。为了优化质量,架构决策基于赫布原则和多尺度处理。
image
如前所述,深度神经网络需要耗费大量计算资源。为了降低算力成本,作者在 3x3 和 5x5 卷积层之前添加额外的 1x1 卷积层,来限制输入信道的数量。尽管添加额外的卷积操作似乎是反直觉的,但是 1x1 卷积比 5x5 卷积要廉价很多,而且输入信道数量减少也有利于降低算力成本。不过一定要注意,1x1 卷积是在最大池化层之后,而不是之前。
image[图片上传失败...(image-8cb885-1550227032715)]
inception-v2,v3 : Rethinking the Inception Architecture for Computer Vision
Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in vari- ous benchmarks. Although increased model size and com- putational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling factors for various use cases such as mobile vision and big-data scenarios. Here we are exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization.
Introduction(作者们写的太可爱了~)
花式怼vgg......
- Although VGGNet [18] has the compelling feature of architectural simplicity, this comes at a high cost: evaluating the network requires a lot of computation.
- On the other hand, the Inception architecture of GoogLeNet [20] was also designed to perform well even under strict constraints on memory and computational budget.
- For example, GoogleNet employed only 5 million parameters, which represented a 12× reduction with respect to its predeces- sor AlexNet, which used 60 million parameters. Further- more, VGGNet employed about 3x more parameters than AlexNet.
自我表扬
- The computational cost of Inception is also much lower than VGGNet or its higher performing successors [6].
- This has made it feasible to utilize Inception networks in big-data scenarios[17], [13], where huge amount of data needed to be processed at reasonable cost or scenarios where memory or computational capacity is inherently limited, for example in mobile vision settings.
It is certainly possible to mitigate parts of these issues by applying specialized solutions to tar- get memory use [2], [15] or by optimizing the execution of certain operations via computational tricks [10]. However, these methods add extra complexity. Furthermore, these methods could be applied to optimize the Inception architecture as well, widening the efficiency gap again.(牛.....)
自我批评
- Still, the complexity of the Inception architecture makes it more difficult to make changes to the network. If the ar- chitecture is scaled up naively, large parts of the computational gains can be immediately lost.
- Also, [20] does not provide a clear description about the contributing factors that lead to the various design decisions of the GoogLeNet architecture. This makes it much harder to adapt it to new use-cases while maintaining its efficiency.
- For example, if it is deemed necessary to increase the capacity of some Inception-style model, the simple transformation of just doubling the number of all filter bank sizes will lead to a 4x increase in both computational cost and number of parameters. This might prove prohibitive or unreasonable in a lot of practical scenarios, especially if the associated gains are modest.
In this paper, we start with describing a few general principles and optimization ideas that that proved to be useful for** scaling up convolution networks in efficient ways**(以有效的方式扩展卷积网络). Although our principles are not limited to Inception- type networks, they are easier to observe in that context as the generic structure of the Inception style building blocks is flexible enough to incorporate those constraints naturally(Inception类型构建块的通用结构足够灵活,可以自然地合并这些约束). This is enabled by the generous use of dimensional reduction and parallel structures of the Inception modules which allows for mitigating the impact of structural changes on nearby components.
- generous use of dimensional reduction
- parallel structures of the Inception modules which allows for mitigating the impact of structural changes on nearby components.
Still, one needs to be cautious about doing so, as some guiding principles should be observed to maintain high quality of the models.(连续Googlenet 论文严谨风格)
通用设计原则
Here we will describe a few design principles based on large-scale experimentation with various architectural choices with convolutional networks.
At this point, the utility of the principles below are speculative and additional future experimental evidence will be necessary to assess their accuracy and domain of validity. Still, grave deviations from these principles tended to result in deterioration in the quality of the networks and fixing situations where those deviations were detected resulted in improved architectures in general.(先抑后扬)
- bottleneck可以降维降低计算量,但是也会丢失信息,所以要合适地降,别降太猛丢信息了;
- 更高维度的特征表示,更稀疏,耦合更少,更好训练;
- 空间聚合,比如用1x1卷积降低通道数,信息丢失很少,可能是因为通道间信息相关性比较高;
- 增加神经网络的深度和宽度,要协调起来,同时增加,效果最好;
上面仅作参考,实际上使用还是要各种尝试,不过一般来说,就是网络看起来比较『漂亮』,网络形状比较好看,没有太宽或太深,也没有突然来个猛的降维,大致就没问题。
1.理论上,信息内容不能仅通过表示的维度来评估,因为它丢弃了诸如相关结构的重要因素;维度仅提供信息内容的粗略估计。
Avoid representational bottlenecks, especially early in the network.
One should avoid bottlenecks with extreme compression. In general the representation size should gently decrease from the in- puts to the outputs before reaching the final represetation used for the task at hand.
Theoretically, information content can not be assessed merely by the di- mensionality of the representation as it discards important factors like correlation structure; the dimensionality merely provides a rough estimate of information content.
2.Higher dimensional representations are easier to pro- cess locally within a network. Increasing the activa- tions per tile in a convolutional network allows for more disentangled features. The resulting networks will train faster.
更高维度的表示在网络中更容易局部处理。在卷积网络中增加每个图块的激活允许更多解耦的特征。所产生的网络将训练更快。
3.空间聚合可以在较低维度嵌入上完成,而不会在表示能力上造成许多或任何损失。例如,在执行更多展开(例如3×3)卷积之前,可以在空间聚合之前减小输入表示的维度,没有预期的严重不利影响。我们假设,如果在空间聚合上下文中使用输出,则相邻单元之间的强相关性会导致维度缩减期间的信息损失少得多。鉴于这些信号应该易于压缩,因此尺寸减小甚至会促进更快的学习。
Spatial aggregation can be done over lower dimensional embeddings without much or any loss in representational power.
For example, before performing a more spread out (e.g. 3 × 3) convolution, one can reduce the dimension of the input representation before the spatial aggregation without expecting serious adverse effects.
We hypothesize that the reason for that is the strong correlation between adjacent unit results in much less loss of information during dimension reduction,if the outputs are used in a spatial aggregation context.
Given that these signals should be easily compressible, the dimension reduction even promotes faster learning.
4.虽然这些原则可能是有意义的,但并不是开箱即用的直接使用它们来提高网络质量。我们的想法是仅在不明确的情况下才明智地使用它们。
Although these principles might make sense, it is not straightforward to use them to improve the quality of networks out of box. The idea is to use them judiciously in ambiguous situations only.
网友评论