2017.11
论文地址:https://arxiv.org/abs/1711.10370
文章参考部分网文,再次表示感谢,若侵权请联系
文章提出一种部分监督学习分割的方法。简单来说就是学习一个将检测参数迁移为分割参数的函数,从而实现在分割信息不完整的数据集上训练,并且测试阶段能对训练阶段没有分割信息的类别也能预测其实例分割掩膜。
论文翻译
Abstract
Most methods for object instance segmentation require all training examples to be labeled with segmentation masks. This requirement makes it expensive to annotate new categories and has restricted instance segmentation models to ∼100 well-annotated classes. The goal of this paper is to propose a new partially supervised training paradigm, together with a novel weight transfer function, that enablestraining instance segmentation models on a large set of categories all of which have box annotations, but only a small fraction of which have mask annotations. These contributions allow us to train Mask R-CNN to detect and segment 3000 visual concepts using box annotations from the Visual Genome dataset and mask annotations from the 80 classes in the COCO dataset. We evaluate our approach in a controlled study on the COCO dataset. This work is a first step towards instance segmentation models that have broad comprehension of the visual world
大多数对象实例分割方法都需要使用分段蒙版标记所有训练示例。此要求使得对新类别进行分号变得昂贵,并将实例分段模型限制为 100 个带充分分号的类。本文的目标是提出一种新的部分监督训练模式,以及一种新的权重转移功能,使训练实例分割模型在一大组类别中全部具有框注释,但只有一个小其中部分具有蒙版注释。这些贡献使我们能够训练mask R-CNN,使用Visual Genome数据集中的框注释和 COCO 数据集中 80 个类的掩码注释来检测和分割 3000 个视觉概念。我们在对 COCO 数据集的受控研究中评估我们的方法。这项工作是迈向实例分割模型的第一步,该模型对视觉世界有广泛的理解
1. Introduction
Object detectors have become significantly more accurate (e.g., [10, 34]) and gained important new capabilities. One of the most exciting is the ability to predict a foreground segmentation mask for each detected object (e.g., [15]), a task called instance segmentation. In practice, typical instance segmentation systems are restricted to a narrow slice of the vast visual world that includes only around 100 object categories.
物体探测器已经变得更加精确(例如,[10,34]),并获得了重要的新功能。最令人兴奋的是能够预测每个检测到的对象(例如 [15])的前景分割掩码,这项任务称为实例分割。实际上,典型的实例分割系统仅限于广阔的视觉世界中的一个狭窄部分,其中仅包含大约 100 个对象类别。
A principle reason for this limitation is that state-of-theart instance segmentation algorithms require strong supervision and such supervision may be limited and expensive to collect for new categories [23]. By comparison, bounding box annotations are more abundant and less expensive [4]. This fact raises a question: Is it possible to train highquality instance segmentation models without complete instance segmentation annotations for all categories? With this motivation, our paper introduces a new partially supervised instance segmentation task and proposes a novel transfer learning method to address it.
造成这种限制的一个主要原因是,最先进的实例分割算法需要强有力的监督,这种监督可能有限,而且对于新类别的收集成本可能很高[23]。相比之下,边界框注释更丰富,成本更低 [4]。这一事实提出了一个问题:能否在没有所有类别的完整实例分段注释的情况下训练高质量的实例分段模型?基于这一动机,本文介绍了一种新的部分监督实例分割任务,并提出了一种新的转移学习方法来解决。
图 1.我们探索了具有部分监督的训练实例分割模型:在训练期间,类(绿色框)的子集具有实例掩码注释;其余类(红色框)只有边界框注释。此图像显示了从 Visual Genome 训练为 3000 个类的模型的输出,使用来自 COCO 中仅 80 个类的掩码注释。We formulate the partially supervised instance segmentation task as follows: (1) given a set of categories of interest, a small subset has instance mask annotations, while the other categories have only bounding box annotations; (2) the instance segmentation algorithm should utilize this data to fit a model that can segment instances of all object categories in the set of interest. Since the training data is a mixture of strongly annotated examples (those with masks) and weakly annotated examples (those with only boxes), we refer to the task as partially supervised.
我们制定部分受监督的实例分段任务如下:(1) 给定一组感兴趣的类别,一个小子集具有实例掩码注释,而其他类别仅具有边界框注释;(2) 实例分割算法应利用此数据来拟合一个模型,该模型可以分割感兴趣集中的所有对象类别的实例。由于训练数据是强注释(带有掩码的示例)和弱注释(仅带框的示例)的混合,因此我们将任务称为部分监督。
The main benefit of partially supervised vs. weaklysupervised training (c.f . [18]) is it allows us to build a largescale instance segmentation model by exploiting both types of existing datasets: those with bounding box annotations over a large number of classes, such as Visual Genome [20], and those with instance mask annotations over a small number of classes, such as COCO [23]. As we will show, this enables us to scale state-of-the-art instance segmentation methods to thousands of categories, a capability that is critical for their deployment in real world uses.
部分监督与弱监督培训的主要好处(c.f 。[18])是它允许我们利用两种类型的现有数据集来构建大型实例分段模型:那些具有边界框注释的大量类(如 Visual Genome [20]),以及具有实例掩码注释的数据集。类数,如 COCO [23]。正如我们将展示的那样,这使我们能够将最先进的实例分段方法扩展到数千个类别,这种功能对于它们在现实世界中的部署至关重要。
To address partially supervised instance segmentation, we propose a novel transfer learning approach built on Mask R-CNN [15]. Mask R-CNN is well-suited to our task because it decomposes the instance segmentation problem into the subtasks of bounding box object detection and mask prediction. These subtasks are handled by dedicated network ‘heads’ that are trained jointly. The intuition behind our approach is that once trained, the parameters of the bounding box head encode an embedding of each object category that enables the transfer of visual information for that category to the partially supervised mask head.
为了解决部分受监督的实例分割问题,我们提出了一种基于Mask R-CNN [15] 的新型转移学习方法。MaskR-CNN非常适合我们的任务,因为它将实例分割问题分解为边界框对象检测和掩码预测的子任务。这些子任务由联合训练的专用网络"头"处理。我们方法背后的直觉是,一旦经过训练,边界框头的参数将编码每个对象类别的嵌入,从而能够将该类别的视觉信息传输到部分受监督的掩码head。
We materialize this intuition by designing a parameterized weight transfer function that is trained to predict a category’s instance segmentation parameters as a function of its bounding box detection parameters. The weight transfer function can be trained end-to-end in Mask R-CNN using classes with mask annotations as supervision. At inference time, the weight transfer function is used to predict the instance segmentation parameters for every category, thus enabling the model to segment all object categories, including those without mask annotations at training time.
我们通过设计一个参数化的权重转移函数来实现这种直觉,该函数经过训练,可以预测类别的实例分割参数,作为其边界框检测参数的函数。权重转移功能可以在mask R-CNN 中使用带有掩码注释的类作为监督进行端到端训练。在推理时,权重转移函数用于预测每个类别的实例分割参数,从而使模型能够对所有对象类别进行分段,包括训练时没有掩码注释的对象类别。
We explore our approach in two settings. First, we use the COCO dataset [23] to simulate the partially supervised instance segmentation task as a means of establishing quantitative results on a dataset with high-quality annotations and evaluation metrics. Specifically, we split the full set of COCO categories into a subset with mask annotations and a complementary subset for which the system has access to only bounding box annotations. Because the COCO dataset involves only a small number (80) of semantically wellseparated classes, quantitative evaluation is precise and reliable. Experimental results show that our method improves results over a strong baseline with up to a 40% relative increase in mask AP on categories without training masks.
我们在两种设置中探索我们的方法。首先,我们使用 COCO 数据集 [23] 来模拟部分受监督的实例分段任务,作为在具有高质量注释和评估指标的数据集上建立定量结果的方法。具体来说,我们将整个 COCO 类别集拆分为包含掩码注释的子集和一个补充子集,系统只能访问边界框注释。由于 COCO 数据集仅涉及少量(80)语义良好的类,因此定量评估是精确和可靠的。实验结果表明,该方法在强基线上提高了结果,在没有训练掩膜的类别上,掩膜AP的相对增加高达40%。
In our second setting, we train a large-scale instance segmentation model on 3000 categories using the Visual Genome (VG) dataset [20]. VG contains bounding box annotations for a large number of object categories, however quantitative evaluation is challenging as many categories are semantically overlapping (e.g., near synonyms) and the annotations are not exhaustive, making precision and recall difficult to measure. Moreover, VG is not annotated with instance masks. Instead, we use VG to provide qualitative output of a large-scale instance segmentation model. Output of our model is illustrated in Figure 1 and 5
在第二个设置中,我们使用Visual Genome(VG) 数据集 [20] 对 3000 个类别训练大规模实例分割模型。VG 包含大量对象类别的边界框注释,但定量评估具有挑战性,因为许多类别在语义上是重叠的(例如,接近同义词),注释并不详尽,因此精度和召回难以衡量。此外,VG 不会使用实例掩码进行加码。相反,我们使用 VG 来提供大规模实例分段模型的定性输出。图 1 和图 5 显示了模型的输出
2. Related Work
Instance segmentation. Instance segmentation is a highly active research area [12, 13, 5, 32, 33, 6, 14, 21, 19, 2], with Mask R-CNN [15] representing the current state-of-the-art. These methods assume a fully supervised training scenario in which all categories of interest have instance mask annotations during training. Fully supervised training, however, makes it difficult to scale these systems to thousands of categories. The focus of our work is to relax this assumption and enable training models even when masks are available for only a small subset of categories. To do this, we develop a novel transfer learning approach built on Mask R-CNN
实例分段。实例分割是一个高度活跃的研究领域[12, 13, 5, 32, 33, 6, 14, 21, 19, 2], Mask R-CNN [15] 代表当前最先进的。这些方法假定一个完全受监督的训练场景,其中所有感兴趣的类别在训练期间都有实例掩码注释。然而,完全监督的培训使得很难将这些系统扩展到数千个类别。我们的工作重点是放宽这一假设,并启用培训模型,即使mask只可用于一小部分类别。为此,我们开发了一种基于Mask R-CNN 的新型转移学习方法
Weight prediction and task transfer learning. Instead of directly learning model parameters, prior work has explored predicting them from other sources (e.g., [11]). In [8], image classifiers are predicted from the natural language description of a zero-shot category. In [38], a model regression network is used to construct the classifier weights from few-shot examples, and similarly in [27], a small neural network is used to predict the classifier weights of the composition of two concepts from the classifier weights of each individual concept. Here, we design a model that predicts the class-specific instance segmentation weights used in Mask R-CNN, instead of training them directly, which is not possible in our partially supervised training scenario.
权重预测和任务转移学习。以前的工作不是直接学习模型参数,而是探索从其他来源预测它们(例如[11])。在 [8] 中,图像分类器是从零镜头类别的自然语言描述中预测的。在 [38] 中,模型回归网络用于从几个镜头示例构造分类器权重,同样在 [27] 中,使用小型神经网络从每个示例的分类器权重预测两个概念组合的分类器权重个人概念。在这里,我们设计了一个模型来预测mask R-CNN 中使用的特定于类的实例分段权重,而不是直接训练它们,这在我们的部分监督训练场景中是不可能的。
Our approach is also a type of transfer learning [28] where knowledge gained from one task helps with another task. Most related to our work, LSDA [17] transforms whole-image classification parameters into object detection parameters through a domain adaptation procedure. LSDA can be seen as transferring knowledge learned on an image classification task to an object detection task, whereas we consider transferring knowledge learned from bounding box detection to instance segmentation.
我们的方法也是一种转移学习[28],从一项任务中获得的知识有助于完成另一项任务。与我们的工作最相关的是,LSDA [17] 通过域自适应过程将整个图像分类参数转换为对象检测参数。LSDA 可以被视为将图像分类任务学到的知识转移到对象检测任务,而我们考虑将从边界框检测学到的知识转移到实例分段。
Weakly supervised instance segmentation is addressed in [18] by training an instance segmentation model over the bottom-up GrabCut [35] foreground segmentation results from the bounding boxes. Unlike [18], we aim to exploit all existing labeled data rather than artificially limiting it. Our work is also complementary in the sense that bottom-up segmentation methods may be used to infer training masks for our weakly-labeled examples. We leave this extension to future work.
在 [18] 中,通过在自下而上的 GrabCut [35] 前景分割结果上训练实例分段模型,在 [18] 中解决了弱监管实例分割问题。与[18]不同,我们的目标是利用所有现有的标记数据,而不是人为地限制它。我们的工作也是互补的,因为自下而上的分割方法可以用来推断我们标记弱的示例的训练面具。我们将此扩展留给未来的工作。
[18] A. Khoreva, R. Benenson, J. Hosang, M. Hein, and B. Schiele. Simple does it: Weakly supervised instance and semantic segmentation. In CVPR, 2017.
Visual embeddings. Object categories may be modeled by continuous ‘embedding’ vectors in a visual-semantic space, where nearby vectors are often close in appearance or semantic ontology. Class embedding vectors may be obtained via natural language processing techniques (e.g. word2vec [26] and GloVe [31]), from visual appearance information (e.g. [7]), or both (e.g. [37]). In our work, the parameters of Mask R-CNN’s box head contain class-specific appearance information and can be seen as embedding vectors learned by training for the bounding box object detection task. The class embedding vectors enable transfer learning in our model by sharing appearance information between visually related classes. We also compare with the NLPbased GloVe embeddings [31] in our experiments.
可视化嵌入。对象类别可以通过在视觉语义空间中的连续"嵌入"矢量进行建模,其中附近的矢量在外观或语义本体学中通常非常接近。类嵌入向量可以通过自然语言处理技术(例如 word2vec [26] 和 GloVe [31])、从视觉外观信息(例如 [7])或两者(例如 [37])获得。在我们的工作中,maskR-CNN的框头的参数包含类特定的外观信息,并可以看作是通过训练边界框对象检测任务所学的嵌入向量。类嵌入向量通过在视觉相关类之间共享外观信息,在我们的模型中实现传输学习。我们还比较了实验中基于NLP的GloVe嵌入[31]。
图 2.详细的插图,我们的MaskX R-CNN方法。MaskX R-CNN没有直接学习掩码预测参数wseg,而是使用学习的权重转移函数T预测一个类别的分割参数wseg。对于培训,T 只需要集 A 中的类的掩码数据,但它可以在测试时应用于集 A + B 中的所有类。我们还使用互补的完全连接的多层感知器 (MLP) 来增强掩模头。3. Learning to Segment Every Thing
Let C be the set of object categories (i.e., ‘things’ [1]) for which we would like to train an instance segmentation model. Most existing approaches assume that all training examples in C are annotated with instance masks. We relax this requirement and instead assume that C = A [ B where examples from the categories in A have masks, while those in B have only bounding boxes. Since the examples of the B categories are weakly labeled w.r.t. the target task (instance segmentation), we refer to training on the combination of strong and weak labels as a partially supervised learning problem. Noting that one can easily convert instance masks to bounding boxes, we assume that bounding box annotations are also available for classes in A.
Given an instance segmentation model like Mask RCNN that has a bounding box detection component and a mask prediction component, we propose the MaskX RCNN method that transfers category-specific information from the model’s bounding box detectors to its instance mask predictors.
让 C 成为我们想要为其训练实例分段模型的对象类别集(即"事物"[1])。大多数现有方法都假定 C 中的所有训练示例都使用实例掩码进行说明。我们放宽此要求,而是假设 C = A ∪ B,其中 A 类别中的示例具有掩码,而 B 中的示例仅具有边界框。由于 B 类的示例在 w.r.t. 目标任务(实例分割)上标注得很弱,因此我们将强和弱标签的组合培训称为部分监督学习问题。注意到人们可以轻松地将实例掩码转换为边界框,我们假设边界框注释也可用于 A 中的类。
给定一个实例分割模型,如Mask RCNN 具有边界框检测组件和掩码预测组件,我们建议使用 MaskX RCNN 方法,将类别特定信息从模型的边界框检测器传输到其实例掩码预测变量。
3.1. Mask Prediction Using Weight Transfer
Our method is built on Mask R-CNN [15], because it is a simple instance segmentation model that also achieves state-of-the-art results. In brief, Mask R-CNN can be seen as augmenting a Faster R-CNN [34] bounding box detection model with an additional mask branch that is a small fully convolutional network (FCN) [24]. At inference time, the mask branch is applied to each detected object in order to predict an instance-level foreground segmentation mask. During training, the mask branch is trained jointly and in parallel with the standard bounding box head found in Faster R-CNN.
我们的方法基于mask R-CNN [15],因为它是一个简单的实例分割模型,也实现了最先进的结果。简而言之,mask R-CNN 可以被视为增加了Faster R-CNN [34] 边界盒检测模型与一个额外的掩码分支,这是一个小型的完全卷积网络 (FCN) [24]。在推理时,掩码分支应用于每个检测到的对象,以预测实例级前景分割掩码。在培训期间,mask分支与"Faster R-CNN"中的标准边界框头联合训练并平行。
In Mask R-CNN, the last layer in the bounding box branch and the last layer in the mask branch both contain category-specific parameters that are used to perform bounding box classification and instance mask prediction, respectively, for each category. Instead of learning the category-specific bounding box parameters and mask parameters independently, we propose to predict a category’s mask parameters from its bounding box parameters using a generic, category-agnostic weight transfer function that can be jointly trained as part of the whole model.
在Mask R-CNN 中,边界框分支中的最后一个图层和掩码分支中的最后一层都包含类别特定的参数,分别用于对每个类别执行边界框分类和实例掩码预测。我们提出使用与类别无关的权重转移函数,从类别边界框参数中预测类别的掩码参数,而不是独立学习特定于类别的边界框参数。联合训练作为整个模型的一部分。
For a given category c, let wdet c be the class-specific object detection weights in the last layer of the bounding box head, and wc seg be the class-specific mask weights in the mask branch. Instead of treating wseg c as model parameters, wc seg is parameterized using a generic weight prediction function T (·):
对于给定的 c 类别,让 wdet^c 成为边界框头最后一层中特定于类的对象检测权重,而 wc seg 是掩码分支中特定于类的掩码权重。wc seg 不是将 wseg c 视为模型参数,而是使用通用权重预测函数 T (*) 进行参数化:
其中 theta是类无关的、已学的参数。The same transfer function T (·) may be applied to any category c and, thus, θ should be set such that T generalizes to classes whose masks are not observed during training. We expect that generalization is possible because the class-specific detection weights wdet c can be seen as an appearance-based visual embedding of the class.
相同的转移函数 T (*) 可应用于任何类别 c,因此,应设置 α,以便 T 将概括为训练期间未观察到掩膜的类。我们期望可以进行概括,因为类特定的检测权重 wdet^c 可以被视为类基于外观的可视化嵌入。
T (·) can be implemented as a small fully connected neural network. Figure 2 illustrates how the weight transfer function fits into Mask R-CNN to form MaskX R-CNN. As a detail, note that the bounding box head contains two types of detection weights: the RoI classification weights wcls c and the bounding box regression weights wbox c . We experiment with using either only a single type of detection weights (i.e. wc det = wcls c or wdet c = wbox c ) or using the concatenation of the two types of weights (i.e. wdet c = [wcls^c ; wbox^c ]).
T (*) 可以作为一个小型完全连接的神经网络实现。图 2 说明了权重转移功能如何适合mask R-CNN 以形成 MaskX R-CNN。作为详细信息,请注意边界框头包含两种类型的检测权重:RoI 分类权重 wcls^c 和边界框回归权重 wbox^c 。我们尝试只使用一种类型的检测权重(即 wc det = wcls^c 或 wdet c = wbox^c),或者使用两种类型的权重的串联(即 wdet c = [wcls_c; wbox_c])。
3.2. Training
During training, we assume that for the two sets of classes A and B, instance mask annotations are available only for classes in A but not for classes in B, while all classes in A and B have bounding box annotations available. As shown in Figure 2, we train the bounding box head using the standard box detection losses on all classes in A [ B, but only train the mask head and the weight transfer function T (·) using a mask loss on the classes in A. Given these losses, we explore two different training procedures: stage-wise training and end-to-end training.
在训练期间,我们假设对于两组类 A 和 B,实例掩码注释仅适用于 A 中的类,不适用于 B 类,而 A 和 B 中的所有类都具有边界框注释。如图 2 所示,我们使用 A + B 中所有类的标准框检测损耗来训练边界框头,但仅使用 A 类上的掩码损耗训练掩码头和重量转移函数 T (*)。鉴于这些损失,我们探索了两种不同的培训程序:阶段性(stage-wise)培训和端到端培训。
Stage-wise training. As Mask R-CNN can be seen as augmenting Faster R-CNN with a mask head, a possible training strategy is to separate the training procedure into detection training (first stage) and segmentation training (second stage). In the first stage, we train a Faster R-CNN using only the bounding box annotations of the classes in A [ B, and then in the second stage the additional mask head is trained while keeping the convolutional features and the bounding box head fixed. In this way, the class-specific detection weights wdet c of each class c can be treated as fixed class embedding vectors that do not need to be updated when training the second stage. This approach has the practical benefit of allowing us to train the box detection model once and then rapidly evaluate design choices for the weight transfer function. It also has disadvantages, which we discuss next.
阶段培训。由于mask R-CNN 可以被视为在faster R-CNN 增加mask head, 一个可能的训练策略是分离训练程序到检测训练 (第一阶段) 和分割训练 (第二阶段).在第一阶段,我们只使用 A + B 类的边界框注释训练更快的 R-CNN,然后在第二阶段训练附加掩码头,同时保持卷积特征和边界框头固定。这样,每个类 c 的类特定检测权重 wdet^c 可以被视为固定类嵌入矢量,在训练第二阶段时不需要更新。这种方法的实用好处是,使我们能够训练箱检测模型一次,然后快速评估重量转移功能的设计选择。它也有缺点,我们接下来将讨论。
End-to-end joint training. It was shown that for Mask RCNN, multi-task training can lead to better performance than training on each task separately. The aforementioned stage-wise training mechanism separates detection training and segmentation training, and may result in inferior performance. Therefore, we would also like to jointly train the bounding box head and the mask head in an end-toend manner. In principle, one can directly train with backpropagation using the box losses on classes in A [ B and the mask loss on classes in A. However, this may cause a discrepancy in the class-specific detection weights wdet c between set A and B, since only wdet c for c 2 A will receive gradients from the mask loss through the weight transfer function T (·). We would like wdet c to be homogeneous between A and B so that the predicted wseg c = T (wdet c ; θ) trained on A can better generalize to B.
端到端联合培训。结果表明,对于mask RCNN,多任务训练可以单独获得比每个任务训练更好的性能。上述阶段性训练机制将检测训练与分割训练分开,可能导致性能下降。因此,我们还要以端到端的方式共同训练边界框头和面罩头。原则上,使用 A + B 类上的框损耗和 A 类上的掩码损耗,可以直接使用反向传播进行训练。但是,这可能导致集 A 和 B 之间特定于类的检测权重 wdet c 的差异,因为只有 c ∈ A 的 wdet^c 才会通过重量转移函数 T (*) 从掩码损失接收梯度。我们希望 wdet c 在 A 和 B 之间是同质的,以便预测在 A 上训练的 wseg^c = T (wdet c; theta) 可以对B更好地概括。
To address this discrepancy, we take a simple approach: when back-propagating the mask loss, we stop the gradient with respect to wdet c , that is, we only compute the gradient of the predicted mask weights T (wdet c ; θ) with respect to transfer function parameter θ but not bounding box weight wdet c . This can be implemented as wseg c = T (stop grad(wdet c ); θ) in most neural network toolkits.
为了解决这种差异,我们采取一个简单的方法:当回传掩码损耗时,我们停止相对于 wdet^c 的梯度,也就是说,我们只计算预测掩码权重 T (wdet^c; theta) 相对于传输函数参数的梯度 ,但没有边界箱权重 wdet^c .这可以在大多数神经网络工具包中作为 wseg^c + T(stop_grad (wdet c );theta) 实现。
3.3. Baseline: Class-Agnostic Mask Prediction
DeepMask [32] established that it is possible to train a deep model to perform class-agnostic mask prediction where an object mask is predicted regardless of the category. A similar result was also shown for Mask R-CNN with only a small loss in mask quality [15]. In additional experiments, [32] demonstrated if the class-agnostic model is trained to predict masks on a subset of the COCO categories (specifically the 20 from PASCAL VOC [9]) it can generalize to the other 60 COCO categories at inference time. Based on these results, we use Mask R-CNN with a class-agnostic FCN mask prediction head as a baseline. Evidence from [32] and [15] suggest that this is a strong baseline. Next, we introduce an extension that can improve both the baseline and our proposed weight transfer function. We also compare with a few other baselines for unsupervised or weakly supervised instance segmentation in x4.3.
DeepMask [32] 确定可以训练深度模型以执行类无关掩码预测,其中无论类别如何预测对象掩码。mask R-CNN也显示了类似的结果,在掩码质量上只出现了很小的损失[15]。在其他实验中,[32] 演示了类不可知模型是否经过训练,以预测 COCO 类别子集上的掩码(特别是 PASCAL VOC [9]中的 20 个),该模型可以在推理时概括到其他 60 个 COCO 类别。基于这些结果,我们使用mask R-CNN 与类无关 FCN 掩码预测头作为基线。[32]和[15]的证据表明,这是一个强有力的基线。接下来,我们引入一个扩展,可以改进基线和我们建议的权重转移功能。我们还比较了 x4.3 中一些无监督或弱监督的实例分割的其他基线。
3.4. Extension: Fused FCN+MLP Mask Heads
Two types of mask heads are considered for Mask RCNN in [15]: (1) an FCN head, where the M × M mask is predicted with a fully convolutional network, and (2) an MLP head, where the mask is predicted with a multi-layer perceptron consisting of fully connected layers, more similar to DeepMask. In Mask R-CNN, the FCN head yields higher mask AP. However, the two designs may be complementary. Intuitively, the MLP mask predictor may better capture the ‘gist’ of an object while the FCN mask predictor may better capture the details (such as the object boundary). Based on this observation, we propose to improve both the baseline class-agnostic FCN and our weight transfer function (which uses an FCN) by fusing them with predictions from a class-agnostic MLP mask predictor. Our experiments will show that this extension brings improvements to both the baseline and our transfer approach.
在 [15] 中考虑为mask RCNN 的两种类型的掩膜头:(1) FCN 头,其中 M + M 掩码使用完全卷积网络进行预测;(2) MLP 头,其中掩码通过由完全连接的层组成的多层感知器进行预测,更类似于deepmask。在mask R-CNN 中,FCN 头生成更高的掩码 AP。但是,这两种设计可能是互补的。直观地讲,MLP 掩码预测变量可以更好地捕获对象的"要点",而 FCN 掩码预测变量可以更好地捕获细节(如对象边界)。基于此观察,我们建议改进基线类无关 FCN 和权重转移函数(使用 FCN),将它们与类无关 MLP 掩码预测变量的预测融合在一起。我们的实验将表明,此扩展带来了基线和传输方法的改进。
When fusing class-agnostic and class-specific mask predictions for K classes, the two scores are added into a final K × M × M output, where the class-agnostic mask scores (with shape 1×M×M) are tiled K times and added to every class. Then, the K×M×M mask scores are turned into perclass mask probabilities through a sigmoid unit, and resized to the actual bounding box size as final instance mask for that bounding box. During training, binary cross-entropy loss is applied on the K × M × M mask probabilities.
当融合 K 类的类无关和类特定掩码预测时,两个分数将添加到最终的 K * M * M 输出中,其中类无关的掩码分数(形状为 1xMxM)平铺 K 次并添加到每个类中。然后,KxMxM 掩码分数通过 sigmoid 单位转换为每类掩码概率,并调整到实际边界框大小作为该边界框的最终实例掩码。在训练期间,二进制交叉熵损耗应用于 K x M x M 掩码概率。
网友评论