作者:Geoffrey E. Hinton && Sara Sabour && Nicholas Frosst
译者:恰恰
Abstract
摘要
A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or an object part.
胶囊是一组神经元,其活动向量表示特定类型的实体(例如对象或部分对象)的实例化参数。
We use the length of the activity vector to represent the probability that the entity exists and its orientation to represent the instantiation parameters.
我们用活动向量的长度表示实体存在的可能性,用它的方向表示实例化参数。
Active capsules at one level make predictions, via transformation matrices, for the instantiation parameters of higher-level capsules.
同一层的活性胶囊通过转换矩阵预测更高层胶囊的实例化参数。
When multiple predictions agree, a higher level capsule becomes active.
当多数预测一致时,高层胶囊被激活。
We show that a discrimininatively trained, multi-layer capsule system achieves state-of-the-art performance on MNIST and is considerably better than a convolutional net at recognizing highly overlapping digits.
我们展示了一个经过区别训练的多层胶囊系统在MNIST(手写数字的数据集)上实现了最先进的性能,并且在识别高度重叠的数字时表现地比卷积网络好。
To achieve these results we use an iterative routing-by-agreement mechanism: A lower-level capsule prefers to send its output to higher level capsules whose activity vectors have a big scalar product with the prediction coming from the lower-level capsule.
为了实现这些结果,我们使用迭代路由协议机制:低层胶囊倾向于将它的输出发送给拥有更大标量且预测结果来自较低层胶囊的高层胶囊。
1. Introduction
引言
Human vision ignores irrelevant details by using a carefully determined sequence of fixation points to ensure that only a tiny fraction of the optic array is ever processed at the highest resolution.
人类视觉通过使用精心确定的一系列注视点来忽略不相关的细节,以确保仅以最高分辨率处理光学阵列的一小部分。
Introspection is a poor guide to understanding how much of our knowledge of a scene comes from the sequence of fixations and how much we glean from a single fixation, but in this paper we will assume that a single fixation gives us much more than just a single identified object and its properties.
对于理解我们对场景的认知有多少来自于一系列注视,以及我们从单一注视中收集了多少内容,自我检查并不具有指导意义,但在本文中我们将假设单一注视不只是给我们一个被识别出的对象和其属性。
We assume that our multi-layer visual system creates a parse tree-like structure on each fixation, and we ignore the issue of how these single-fixation parse trees are coordinated over multiple fixations.
我们假设多层视觉系统在每次注视时都创建了一个类似解析树的结构,并且忽视这些单次注视解析树时如何与多个注视协同的问题。
Parse trees are generally constructed on the fly by dynamically allocating memory. Following Hinton et al. [2000], however, we shall assume that, for a single fixation, a parse tree is carved out of a fixed multilayer neural network like a sculpture is carved from a rock.
通常,解析树通过动态分配内存来动态构建。根据Hinton等人的论文[2000]。然而我们假设对于单次注视,解析树雕刻自一个固定的多层神经网络,就像雕塑雕刻自岩石一样。
Each layer will be divided into many small groups of neurons called “capsules” (Hinton et al. [2011]) and each node in the parse tree will correspond to an active capsule.
每一层都将被划分成许多小的神经组,称为“胶囊”(Hinton 等人[2011]), 且解析树中的每个节点都对应一个活跃的胶囊。
Using an iterative routing process, each active capsule will choose a capsule in the layer above to be its parent in the tree.
在使用迭代路由的进程中,树中每一个活跃的胶囊都将选择一个上层的胶囊作为父节点。
For the higher levels of a visual system, this iterative process will be solving the problem of assigning parts to wholes.
对于视觉系统的更高层,这个迭代进程将解决将部分分配给整体的问题。
The activities of the neurons within an active capsule represent the various properties of a particular entity that is present in the image.
一个活跃胶囊内部的神经元活动表示了图片中呈现的特定实体的多种属性。
These properties can include many different types of instantiation parameter such as pose (position, size, orientation), deformation, velocity, albedo, hue, texture, etc.
这些属性包括许多不同类型的实例化参数,比如姿态(位置,尺寸,方向),形变,速度,白度,色调,质地等等。
One very special property is the existence of the instantiated entity in the image. An obvious way to represent existence is by using a separate logistic unit whose output is the probability that the entity exists.
有一个非常特殊的属性就是实例化的实体在图片中是否存在。表示是否存在的一种显著的方法是使用单独的逻辑单元来输出实体存在的概率。
In this paper we explore an interesting alternative which is to use the overall length of the vector of instantiation parameters to represent the existence of the entity and to force the orientation of the vector to represent the properties of the entity1.
在本文中,我们探索了一个有趣的替代方案,使用实例化参数向量的总长度来表示实体的存在,并强制向量的方向来表示实体1的属性。
We ensure that the length of the vector output of a capsule cannot exceed 1 by applying a non-linearity that leaves the orientation of the vector unchanged but scales down its magnitude.
我们通过应用非线性函数来确保胶囊输出的矢量长度不超过1,该非线性函数保持矢量方向不变,但按比例缩小其大小。
The fact that the output of a capsule is a vector makes it possible to use a powerful dynamic routing mechanism to ensure that the output of the capsule gets sent to an appropriate parent in the layer above.
事实上,由于胶囊的输出是矢量使得使用强大的动态路由机制来确保胶囊的输出可以传递给上层合适的父节点成为可能。
Initially, the output is routed to all possible parents but is scaled down by coupling coefficients that sum to 1.
最初,输出被路由到所有可能的父节点,只是被总和为一的耦合系数缩小了。
For each possible parent, the capsule computes a “prediction vector” by multiplying its own output by a weight matrix.
对于每一个可能的父节点,胶囊会通过将自身输出与权重矩阵相乘来计算“预测向量”。
If this prediction vector has a large scalar product with the output of a possible parent, there is top-down feedback which increases the coupling coefficient for that parent and decreasing it for other parents.
如果该预测向量具有大的标量积伴随可能的父节点的输出,那这是个自上而下的反馈,增大该父节点的耦合系数,并降低其他父节点的耦合系数。
This increases the contribution that the capsule makes to that parent thus further increasing the scalar product of the capsule’s prediction with the parent’s output.
这增加了该胶囊对该父节点的贡献,从而进一步增加了该胶囊对父节点输出的预测的标量积。
This type of “routing-by-agreement” should be far more effective than the very primitive form of routing implemented by max-pooling, which allows neurons in one layer to ignore all but the most active feature detector in a local pool in the layer below.
这类“协议路由”远比只允许同层神经元处理下层局部池中最活跃的特征探测器的原始形式的最大池化路由更高效。
We demonstrate that our dynamic routing mechanism is an effective way to implement the “explaining away” that is needed for segmenting highly overlapping objects.
我们证明了动态路由机制是实现分割高度重叠对象所需的“解释”的有效方法。
Convolutional neural networks (CNNs) use translated replicas of learned feature detectors. This allows them to translate knowledge about good weight values acquired at one position in an image to other positions.
卷积神经网络使用学习特征检测器产生的翻译复制品。这允许它们将在一处获得的好的权重值复用到别的位置。
This has proven extremely helpful in image interpretation. Even though we are replacing the scalar-output feature detectors of CNNs with vector-output capsules and max-pooling with routing-by-agreement, we would still like to replicate learned knowledge across space.
事实证明,这在图像解释方面非常有用。尽管我们将卷积神经网络输出标量的特征探测器替换为输出矢量的胶囊,并且将最大池化替换为协议路由,我们依然想要将习得的知识复用到别处。
To achieve this, we make all but the last layer of capsules be convolutional. As with CNNs, we make higher-level capsules cover larger regions of the image.
为了达到这个目的,我们设定了除最后一层胶囊以外的所有胶囊都是卷积的。与CNN一样,我们设置更高层的胶囊覆盖更大的图像区域。
Unlike max-pooling however, we do not throw away information about the precise position of the entity within the region. For low level capsules, location information is “place-coded” by which capsule is active.
然而,与max-pooling不同,我们不会丢弃有关该区域内实体的精确位置的信息。对于低级别的胶囊,位置信息是通过活跃胶囊“位置编码”的。
As we ascend the hierarchy, more and more of the positional information is “rate-coded” in the real-valued components of the output vector of a capsule.
随着我们提升层次结构,越来越多的位置信息在胶囊的输出矢量的实值分量中被“速率编码”。
This shift from place-coding to rate-coding combined with the fact that higher-level capsules represent more complex entities with more degrees of freedom suggests that the dimensionality of capsules should increase as we ascend the hierarchy.
从位置编码到速率编码的这种转变,结合高层胶囊代表具有更多自由度的更复杂实体的事实,表明胶囊的维度应当随着我们提升层次结构而增加。
2. How the vector inputs and outputs of a capsule are computed
如何计算胶囊的矢量输入与输出
There are many possible ways to implement the general idea of capsules. The aim of this paper is not to explore this whole space but simply to show that one fairly straightforward implementation works well and that dynamic routing helps.
实现胶囊基本概念的方式有许多种。本文的目的不是探索所有可能的方法,而是简单地展示一个相当简洁的方案也非常有效且动态路由真的非常有用。
We want the length of the output vector of a capsule to represent the probability that the entity represented by the capsule is present in the current input.
我们希望胶囊的输出矢量的长度表示由胶囊表示的实体存在于当前输入中的概率。
We therefore use a non-linear "squashing" function to ensure that short vectors get shrunk to almost zero length and long vectors get shrunk to a length slightly below 1.
因此我们使用了一个非线性的压缩函数使得短矢量被压缩到接近于0的长度,长矢量被压缩到比1小一点的长度。
We leave it to discriminative learning to make good use of this non-linearity.
我们将它应用于区别学习以充分利用其非线性。
where vj is the vector output of capsule j and sj is its total input. For all but the first layer of capsules, the total input to a capsule sj is a weighted sum over all “prediction vectors” uj|i from the capsules in the layer below and is produced by multiplying the output ui of a capsule in the layer below by a weight matrix Wij
vj 是胶囊j的矢量输出,sj是它的总输入。对于除第一层胶囊之外的所有胶囊,胶囊sj的总输入来自下层胶囊的所有“预测矢量”uj|i的加权和,uj|i由下层胶囊的输出ui与权重矩阵Wij相乘产生。
where the cij are coupling coefficients that are determined by the iterative dynamic routing process.
cij是由迭代的动态路由进程产生的耦合系数。
The coupling coefficients between capsule i and all the capsules in the layer above sum to 1 and are determined by a “routing softmax” whose initial logits bij are the log prior probabilities that capsule i should be coupled to capsule j.
胶囊i和层中的所有胶囊之间的耦合系数总和为1并且由“路由softmax”确定,其初始对数bij是胶囊i应该耦合到胶囊j的对数先验概率。
The log priors can be learned discriminatively at the same time as all the other weights. They depend on the location and type of the two capsules but not on the current input image2.
对数先验像其他所有的权重那样同时由区别学习得到。它们取决于两个胶囊的位置和类型而不是当前的输入图像2。
The initial coupling coefficients are then iteratively refined by measuring the agreement between the current output vj of each capsule, j, in the layer above and the prediction uj|i made by capsule i.
然后通过测量上面的层中的每个胶囊j的当前输出vj与胶囊i做出的预测uj | i之间的一致性来迭代地细化初始耦合系数。
The agreement is simply the scalar product aij = vj * uj|i. This agreement is treated as if it was a log likelihood and is added to the initial logit, bij before computing the new values for all the coupling coefficients linking capsule i to higher level capsules.
协议就是简单的标量积 aij = vj * uj|i 。该协议被视为是对数概率,并且在计算将胶囊i连接到更高级别胶囊的所有耦合系数的新值之前被添加到初始对数bij。
In convolutional capsule layers, each capsule outputs a local grid of vectors to each type of capsule in the layer above using different transformation matrices for each member of the grid as well as for each type of capsule.
在卷积胶囊层中,每个胶囊使用针对网格的每个成员以及每种类型的胶囊的不同变换矩阵向上层每种类型的胶囊输出局部网格向量。
3. Margin loss for digit existence
数字存在的极限损失
We are using the length of the instantiation vector to represent the probability that a capsule’s entity exists. We would like the top-level capsule for digit class k to have a long instantiation vector if and only if that digit is present in the image. To allow for multiple digits, we use a separate margin loss, Lk for each digit capsule, k:
我们使用实例化矢量的长度来表示胶囊中的实体是否存在。如果图像中存在该数字,我们希望表示数字类型k的顶层胶囊具有长实例化矢量。为了识别多种数字,我们对每个数字胶囊k使用不同的极限损失Lk:
where Tk = 1 iff a digit of class k is present 3 and m+ = 0.9 and m− = 0.1. The λ down-weighting of the loss for absent digit classes stops the initial learning from shrinking the lengths of the activity vectors of all the digit capsules. We use λ = 0.5. The total loss is simply the sum of the losses of all digit capsules.
当且仅当k类数字存在时Tk = 1,且m+ = 0.9,m- = 0.1。不存在的数字类的降权损失λ通过缩小所有数字胶囊的活动矢量的长度来停止初始化学习。我们使用λ = 0.5。总损失就是所有数字胶囊损失的总和。
4. CapsNet architecture
胶囊网络的结构
A simple CapsNet architecture is shown in Fig. 1.
图1展示的是一个简单的胶囊网络的结构。
图1The architecture is shallow with only two convolutional layers and one fully connected layer. Conv1 has 256, 9 × 9 convolution kernels with a stride of 1 and ReLU activation. This layer converts pixel intensities to the activities of local feature detectors that are then used as inputs to the primary capsules.
该结构是一个浅层网络,只有两个卷积层和一个全连接层。卷积层1是一个形状为(9,9,256)的卷积核,步长为1,ReLU激活。该层将像素亮度转换为局部特征检测器的活度,然后将其用作初级胶囊的输入。
The primary capsules are the lowest level of multi-dimensional entities and, from an inverse graphics perspective, activating the primary capsules corresponds to inverting the rendering process. This is a very different type of computation than piecing instantiated parts together to make familiar wholes, which is what capsules are designed to be good at.
初级胶囊是最低级别的多维实体,并且从反向图形的角度来看,激活初级胶囊对应于反转渲染过程。这是一种非常不同的计算类型,不同于将实例化的部分拼接在一起以制作熟悉的整体,这就是胶囊被设计擅长的地方。
The second layer (PrimaryCapsules) is a convolutional capsule layer with 32 channels of convolutional 8D capsules (i.e. each primary capsule contains 8 convolutional units with a 9 × 9 kernel and a stride of 2).
第二层(初级胶囊)是个12信道8D的卷积胶囊层(每个初级胶囊都包含8个核心为9 x 9,步长为2的卷积单元)。
Each primary capsule output sees the outputs of all 256 × 81 Conv1 units whose receptive fields overlap with the location of the center of the capsule. In total PrimaryCapsules has [32 × 6 × 6] capsule outputs (each output is an 8D vector) and each capsule in the [6 × 6] grid is sharing their weights with each other.
每个初级胶囊的输出都能看到全部256 x 81卷积1单元们的输出。
One can see PrimaryCapsules as a Convolution layer with Eq. 1 as its block non-linearity. The final Layer (DigitCaps) has one 16D capsule per digit class and each of these capsules receives input from all the capsules in the layer below.
We have routing only between two consecutive capsule layers (e.g. PrimaryCapsules and DigitCaps). Since Conv1 output is 1D, there is no orientation in its space to agree on. Therefore, no routing is used between Conv1 and PrimaryCapsules.
All the routing logits (bij) are initialized to zero. Therefore, initially a capsule output (ui) is sent to all parent capsules (v0:::v9) with equal probability (cij). Our implementation is in TensorFlow (Abadi et al. [2016]) and we use the Adam optimizer (Kingma and Ba [2014]) with its TensorFlow default parameters, including the exponentially decaying learning rate, to minimize the sum of the margin losses in Eq. 4.
To be continue...
网友评论