Visual Relationship相关论文整理
Visual Relationship Problem
视觉关系识别/检测任务不仅需要识别出图像中的物体以及他们的位置(detection),还要识别物体之间的关系(relationship)。例子如下图所示,输入为一张图片,输出为objects和bounding boxes,以及objects之间的关系,如<person on motorcycle>。
一个关系Relationship表示为一个三元组Triplet——Rel: <object1 - predicate - object1>,也就是两个对象,和他们的关系(谓词)。关系可以是空间关系(above,next to,below),动词verb(wear),介词(with),比较(taller than)等等。
视觉关系识别是图像理解的基础,可以应用在
- 物体检测——利用物体间的关系,所处场景来提高物体检测的准确率;
- Image Captioning——图像描述中往往就包含着物体之间的关系的描述,如一个人骑着摩托;
- VQA(Visual Question Answer)——视觉问答的一些问题中也是会包含着物体关系的问题,如桌子上面的是什么?;
- Image Retrieval——图像检索任务也会根据自然语言描述或者图像来检索相关的图像;
- Image Generation——根据自然语言来生成图像的任务中需要理解自然语言中的物体,以及物体之间的关系,才能更好的生成符合描述的图像;
-
等等
挑战:
- 需要学习的关系,元组数量多——假设对象类型有N类,关系有R种,那么元组类型就会有NRN种(所以一般对象和关系会分开学习,这样学习的就大大减少了);
- 同一个关系的视觉外观差别很大,如ride,骑马,骑单车;
- 有些relation元组是比较罕见的,或者数量巨大,所以有一些关系元组是不会出现在训练集中,这时候就需要模型能够有迁移拓展的能力(zero-shot);
这篇文章将整理与视觉关系相关的论文,并作简要的介绍。论文列表:
- Visual Relationship Detection with Language Priors, Cewu Lu, et.al, ECCV 2016
- Visual Translation Embedding Network for Visual Relation Detection, Hanwang Zhang, et.al, CVPR 2017
- Representation Learning for Scene Graph Completion via Jointly Structural and Visual Embedding, Hai Wan, et.al, IJCAI 2018
- Scene Graph Generation by Iterative Message Passing, Danfei Xu, et.al., CVPR 2017
- Tensorize, Factorize and Regularize: Robust Visual Relationship Learning, Seong Jae Hwang, et.al., CVPR 2018
- Structure Inference Net: Object Detection Using Scene-Level Context and Instance-Level Relationships, Yong Liu, et.al., CVPR 2018
- Referring Relationships, Ranjay Krishna, et.al., CVPR 2018
- Image Generation from Scene Graphs, Justin Johnson, et.al., CVPR 2018
- R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering, Pan Lu, et.al., Arxiv
Visual Relationship Detection with Language Priors, Cewu Lu, et.al, ECCV 2016
Abstract. Visual relationships capture a wide variety of interactions between pairs of objects in images (e.g. “man riding bicycle” and “man pushing bicycle”). Consequently, the set of possible relationships is extremely large and it is difficult to obtain sufficient training examples for all possible relationships. Because of this limitation, previous work on visual relationship detection has concentrated on predicting only a handful of relationships. Though most relationships are infrequent, their objects (e.g. “man” and “bicycle”) and predicates (e.g. “riding” and “pushing”) independently occur more frequently. We propose a model that uses this insight to train visual models for objects and predicates individually and later combines them together to predict multiple relationships per image. We improve on prior work by leveraging language priors from semantic word embeddings to finetune the likelihood of a predicted relationship. Our model can scale to predict thousands of types of relationships from a few examples. Additionally, we localize the objects in the predicted relationships as bounding boxes in the image. We further demonstrate that understanding relationships can improve content based image retrieval.
第一篇是比较经典的论文,提出了一个数据集VRD和一个结合语言先验的关系预测模型。
-
数据集:
论文给出了一个数据集VRD,包含5000张图片,六千多关系,平均每个对象有24个关系谓词,与数据集Scene graph,VIsual phrases的对比如下表:
Visual Phrases只有13个类型,Scene Graph 有两万多关系,但是它平均每个对象只有大约2个谓词关系。除了这三个数据集,还有有名的VIsual Genome大数据集,包含99658张图片,19237个关系,标注了物体类型,位置,属性和物体间的关系(场景图),还有caption,qa。虽然数据量大了,但是数据集的标注还是会有一些没有被标注的,毕竟组合多。
-
方法:
从摘要中我们知道,因为关系元组<>数量很大,每个元组出现的次数不多,所以直接学习元组是不现实的。Visual Phase就是直接分类目标为关系元组,因为它的数据只有13个关系元组。虽然关系元组是不频繁的,但是单独的object和predicate是频繁的,所以这篇论文就是单独地预测对象和关系谓词的,这也是之后的论文一直采用的。除此之外,论文还利用了语言先验——词向量,使得相近对象有相近的关系分布,如人骑马与人骑大象等其他动物是相近的,提高zero-shot学习能力。整个框架图如下:
首先图像通过物体检测网络RCNN得到object proposals,然后对于每一对object pair,经过视觉模型和语言模型,得到关系似然度V和f,相乘得到最后的关系元组似然度,最大的即为object pair的关系。
- 视觉模型:学习关系的视觉形态,用一个CNN提取object-pair联合的box的特征,再结合他们的类似然度得到关系元组的视觉似然度;
- 语言模型:语言模型是为了让关系元组能够在语义上相关联,这样使得模型具有拓展迁移,zero-shot能力,如person-ride-horse可以迁移到person-ride-elephant等上。具体做法是将object-pair的两个对应的word embeding投射到一个k维向量,k是predicate的数量,每个值代表两个对象间的关系的语言上的似然度。这个向量两个目标:
(1)相似的关系元组能够相近,不同的也要具有一定距离,这里使用的是词embeding的距离,记为目标K;
(2)频繁出现的元组应该有高的似然度,比较少发生的似然度低,记为目标L;
实验:
评估指标为Recall@50/100,表示似然度前50/100里的召回率有多少。一共有三个任务:
(1)Phrase Detection:输出为关系元组以及对应的joint bounding box;
(2)Relationship Detection:输出为关系元组以及对象各自的bounding box;
(3)Predicate Detection:这个任务是单独检验关系预测的准确性,就是给定ground truth的对象,输出为它们的关系;
结果如下:
结果表明加了语言先验准确率的提升很大,尤其是加了目标K,使相似的关系元组相近,可见语言上的先验对关系的检测有着很大的帮助。
例子:
思考:论文利用了语言先验word embedding,对预测起到了很大的帮助,但是先验知识可能会使得关系预测倾向于频繁的关系,而忽略了视觉方面的信息。一个解决方案是先预训练视觉模型。然而,真正合理的融合先验的方式我觉得不是简单的乘法(先验可能会误导),是一个思考的点。
Visual Translation Embedding Network for Visual Relation Detection, Hanwang Zhang, et.al, CVPR 2017
Abstract. Visual relations, such as “person ride bike” and “bike next to car”, offer a comprehensive scene understanding of an image, and have already shown their great utility in connecting computer vision and natural language. However, due to the challenging combinatorial complexity of modeling subject-predicate-object relation triplets, very little work has been done to localize and predict visual relations. Inspired by the recent advances in relational representation learning of knowledge bases and convolutional object detection networks, we propose a Visual Translation Embedding network (VTransE) for visual relation detection. VTransE places objects in a low-dimensional relation space where a relation can be modeled as a simple vector translation, i.e., subject + predicate ≈ object. We propose a novel feature extraction layer that enables object-relation knowledge transfer in a fully-convolutional fashion that supports training and inference in a single forward/backward pass. To the best of our knowledge, VTransE is the first end-toend relation detection network. We demonstrate the effectiveness of VTransE over other state-of-the-art methods on two large-scale datasets: Visual Relationship and Visual Genome. Note that even though VTransE is a purely visual model, it is still competitive to the Lu’s multi-modal model with language priors.
**Motivation: **这篇论文的启发是来源于知识图谱中,使用转移向量(translation vector)来表示实体之间的关系(见Trans系列的知识表示)。在视觉关系中,通过将对象的视觉特征映射到低维的关系空间中,然后用对象间的转移向量来表示对象之间的关系,比如person+ride=bike。如下图所示:
论文提出的VTransE是一个端到端的模型,由目标检测模型和关系预测模型组成:
关系预测模型将两个对象的类似然度,位置信息,视觉特征连接,得到特征
在实验中,单从在VRD数据集上的predicate预测,与上一篇论文Lu对比是没有提升的(44<47),这是这篇论文中没有说明的,是我从两篇论文的实验数据中发现的。这篇论文在另外两个任务上效果比Lu的好些,我觉得有可能是用了Faster RCNN的缘故。
除了这三个任务的实验对比,还加了图像检索,zero-shot关系检测(没有Lu的好),特征重要性分析的实验。实验也表明了关系检测任务对目标检测任务的准确率的提升,不过其实很少。
更多相关的可参考原论文。
思考:论文用TransE来表示关系空间中对象与predicate的关系,如何映射到关系空间,更好的表达对象的联系,甚至predicate间的关系,是值得研究的一个点。(比如结合语言先验等,因为我觉的它的效果其实应该比不上加了语言先验的)
Representation Learning for Scene Graph Completion via Jointly Structural and Visual Embedding, Hai Wan, et.al, IJCAI 2018
Abstract. This paper focuses on scene graph completion which aims at predicting new relations between two entities utilizing existing scene graphs and images. By comparing with the well-known knowledge graph, we first identify that each scene graph is associated with an image and each entity of a visual triple in a scene graph is composed of its entity type with attributes and grounded with a bounding box in its corresponding image. We then propose an end-to-end model named Representation Learning via Jointly Structural and Visual Embedding (RLSV) to take advantages of structural and visual information in scene graphs. In RLSV model, we provide a fully-convolutional module to extract the visual embeddings of a visual triple and apply hierarchical projection to combine the structural and visual embeddings of a visual triple. In experiments, we evaluate our model in two scene graph completion tasks: link prediction and visual triple classification, and further analyze by case studies. Experimental results demonstrate that our model outperforms all baselines in both tasks, which justifies the significance of combining structural and visual information for scene graph completion.
这篇论文跟上一篇论文类似,都是将<subject, rel, object>中的subject和object映射到一个空间中,他们间的关系表示为 .上一篇是基于知识图谱embedding的TransE(NIPS2013,Translating embeddings for modeling multi-relational data),而这一篇是基于TransD(ACL2015,Knowledge graph embedding via dynamic mapping matrix)。这是一个研究的方向,怎么将object,relationship很好的在embedding空间中表示。
论文的整个框架如图:
图片经过目标检测模型后,对每一对目标h, t,经过卷积得到视觉特征
entity和relation的embedding空间的可视化
思考:这也是篇关于投射对象和关系到另一空间的论文,不过任务稍有不同,效果也比上一篇好些。同上,embedding也是可研究的一个方向。
Scene Graph Generation by Iterative Message Passing, Danfei Xu, et.al., CVPR 2017
Abstract. Understanding a visual scene goes beyond recognizing individual objects in isolation. Relationships between objects also constitute rich semantic information about the scene. In this work, we explicitly model the objects and their relationships using scene graphs, a visually-grounded graphical structure of an image. We propose a novel endto- end model that generates such structured scene representation from an input image. The model solves the scene graph inference problem using standard RNNs and learns to iteratively improves its predictions via message passing. Our joint inference model can take advantage of contextual cues to make better predictions on objects and their relationships. The experiments show that our model significantly outperforms previous methods for generating scene graphs using Visual Genome dataset and inferring support relations with NYU Depth v2 dataset.
这篇论文使用场景图scene graph来建模图片中对象以及它们的关系,任务是生成场景图:
一般预测关系的方法都是独立地预测每两个对象间的关系(local),而忽略了场景中的其他对象,关系,即周围上下文的信息对预测关系的相互帮助作用,因此,本论文基于这个,提出了使用RNN,节点间消息传递的方法来迭代的提高关系预测能力。
框架:
首先图像经过Faster RCNN的RPN得到object proposal,以及对应的特征。
这里有两种GRU:边GRU:对应关系,输入为联合bounding box的特征;节点GRU:对应对象,输入为对象的特征。
GRU的状态state表示对象/关系的隐含特征,最后的状态将被用来预测对象,bounding box和关系。
每个节点/proposal都会有一个节点GRU(共享权重)表示,同理,节点间的关系由边GRU表示,但是全连接图的复杂度很高,所以论文使用的CRF的mean field,用子图来近似替代推断和消息传递。
消息传递:节点GRU会收到来自入边(inbound edge)和出边的消息,而边GRU会收到来自连接对象传递的消息。论文使用message pooling来对编码消息。消息传递将迭代N次,最后的GRU状态将用于预测对象,bounding box和关系。
这种消息传递体现了场景上下文信息的传递,是论文的核心。
在实验中,对比了Lu(上面的第一篇论文的语言先验的方法),效果提升。
这篇论文的亮点就是利用上下文信息以及消息传递,迭代更新以更好地预测关系。这是一个在场景图层级上的新的预测关系的方式,其消息传递方法等都是可以改进的地方,甚至结合embedding。
Tensorize, Factorize and Regularize: Robust Visual Relationship Learning, Seong Jae Hwang, et.al., CVPR 2018
Abstract. Visual relationships provide higher-level information of objects and their relations in an image – this enables a semantic understanding of the scene and helps downstream applications. Given a set of localized objects in some training data, visual relationship detection seeks to detect the most likely “relationship” between objects in a given image. While the specific objects may be well represented in training data, their relationships may still be infrequent. The empirical distribution obtained from seeing these relationships in a dataset does not model the underlying distribution well — a serious issue for most learning methods. In this work, we start from a simple multi-relational learning model, which in principle, offers a rich formalization for deriving a strong prior for learning visual relationships. While the inference problem for deriving the regularizer is challenging, our main technical contribution is to show how adapting recent results in numerical linear algebra lead to efficient algorithms for a factorization scheme that yields highly informative priors. The factorization provides sample size bounds for inference (under mild conditions) for the underlying Jobject, predicate, objectK relationship learning task on its own and surprisingly outperforms (in some cases) existing methods even without utilizing visual features. Then, when integrated with an end-to-end architecture for visual relationship detection leveraging image data, we substantially improve the state-of-the-art.
这篇论文的主要贡献是使用因式分解的方法来得到信息先验(a factorization scheme that yields highly informative priors),也就是关系的先验分布,即两个object间的predicate分布。
这个分布是通过张量分解的方法得到,具体是:
(1)张量构建Tensorize:关系张量, i, j是对象,k是关系,表示为关系k的矩阵的堆叠,每一个值对象i, j在数据集中有关系k的次数。张量表示可以反映objects间的内在联系,关系分布等。
(2)张量分解Fatorize:因为数据集不可能包含所有relationship组合,所以这个Tensor是稀疏的(大约1%非零)。论文通过张量分解的方式,来学习object,relation的隐含表达(latent representation)【进而推断未知的关系分布(系数张量中的0)】:
整个方法是基于上一篇论文Scene Graph Generation by Iterative Message Passing的方法,加了Relational Learning Module——(1)和(2)的张量构建,分解来学习隐含先验分布,从而regularize Scene Graph Learning Module:
(3)调整关系预测Regularize:
经过(2)得到
图中(a)船被检测为车,如果有场景信息我们可以推断不可能为船;(b)中鼠标没有检测到,如果利用电脑等信息,可能会有帮助作用(虽然我觉得人不会这么看)。
整个算法框架如下:
可以看出,目标检测任务原来是一个proposal直接得到类型和box回归,现在是把所有的proposals,用一个图structure inference来推断,用的是GRU,Iterative Message Passing消息传递——没错,就是类似论文Scene Graph Generation by Iterative Message Passing提到的方法,论文对消息传递进行了一些改变。Note: 这里的对象间的关系R_{i, j}好像没有用到语义上的如ride,只是简单的位置关系
从实验结果看,SIN比Faster RCNN,SSD等有了改进:
Referring Relationships, Ranjay Krishna, et.al., CVPR 2018
Abstract. Images are not simply sets of objects: each image represents a web of interconnected relationships. These relationships between entities carry semantic meaning and help a viewer differentiate between instances of an entity. For example, in an image of a soccer match, there may be multiple persons present, but each participates in different relationships: one is kicking the ball, and the other is guarding the goal. In this paper, we formulate the task of utilizing these “referring relationships” to disambiguate between entities of the same category. We introduce an iterative model that localizes the two entities in the referring relationship, conditioned on one another. We formulate the cyclic condition between the entities in a relationship by modelling predicates that connect the entities as shifts in attention from one entity to another. We demonstrate that our model can not only outperform existing approaches on three datasets — CLEVR, VRD and Visual Genome — but also that it produces visually meaningful predicate shifts, as an instance of interpretable neural networks. Finally, we show that by modelling predicates as attention shifts, we can even localize entities in the absence of their category, allowing our model to find completely unseen categories.
这篇论文做的任务不是关系预测,而是利用关系来消歧关系中的相同类的对象,其实是根据关系元组,来定位对象的位置。比如下图中需要确定人踢球是图中的哪个人,在什么位置。
论文首先用attention到对象object/subject,然后用predicate的卷积核来进行注意力的shift,同时object和subject需要结合。
Image Generation from Scene Graphs, Justin Johnson, et.al., CVPR 2018
Abstract. To truly understand the visual world our models should be able not only to recognize images but also generate them. To this end, there has been exciting recent progress on generating images from natural language descriptions. These methods give stunning results on limited domains such as descriptions of birds or flowers, but struggle to faithfully reproduce complex sentences with many objects and relationships. To overcome this limitation we propose a method for generating images from scene graphs, enabling explicitly reasoning about objects and their relationships. Our model uses graph convolution to process input graphs, computes a scene layout by predicting bounding boxes and segmentation masks for objects, and converts the layout to an image with a cascaded refinement network. The network is trained adversarially against a pair of discriminators to ensure realistic outputs. We validate our approach on Visual Genome and COCO-Stuff, where qualitative results, ablations, and user studies demonstrate our method’s ability to generate complex images with multiple objects.
这又是李飞飞团队做的工作(他们团队做了很多relationship相关的工作,语言先验那篇,迭代消息传递那篇等),做的是语句生成图像,利用了场景图表示语句中对象间的关系/联系,一个很有趣的研究,应该是第一个使用场景图的图像生成尝试了。
Sentence一般包含多个对象,以及对象间关系的描述,是比较复杂的,从上图也可以看出,直接从语句到图像效果是很差的。但是当我们把语句解析为场景图,然后再生成图像,可以更好的生成图像表示对象间的关系。
具体做法大致是根据场景图做布局预测 (layout prediction) 预测对象的位置,最后结合噪声,用生成网络生成图像。具体细节这里就不啰嗦了,列一下最终效果吧。
可以看出,对象的位置基本位于正确的位置,不过生成的图像质量不是很高,所以还是有很大的改进空间的。
R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering, Pan Lu, et.al., Arxiv
Abstract. Recently, Visual Question Answering (VQA) has emerged as one of the most significant tasks in multimodal learning as it requires understanding both visual and textual modalities. Existing methods mainly rely on extracting image and question features to learn their joint feature embedding via multimodal fusion or attention mechanism. Some recent studies utilize external VQA-independent models to detect candidate entities or attributes in images, which serve as semantic knowledge complementary to the VQA task. However, these candidate entities or attributes might be unrelated to the VQA task and have limited semantic capacities. To better utilize semantic knowledge in images, we propose a novel framework to learn visual relation facts for VQA. Specifically, we build up a Relation-VQA (R-VQA) dataset based on the Visual Genome dataset via a semantic similarity module, in which each data consists of an image, a corresponding question, a correct answer and a supporting relation fact. A well-defined relation detector is then adopted to predict visual question-related relation facts. We further propose a multi-step attention model composed of visual attention and semantic attention sequentially to extract related visual knowledge and semantic knowledge. We conduct comprehensive experiments on the two benchmark datasets, demonstrating that our model achieves state-of-the-art performance and verifying the benefit of considering visual relation facts.
这篇论文是Arxiv上今年7月份的论文,利用图像中的对象间的关系和对象属性,做QA任务。关系挖掘根据图像和问题得到一系列相关的fact——关系,对象属性,然后再attention到需要的fact上,联合视觉特征最后得到最终answer。
思考:这种提取fact的方法为QA提供了高层的语义信息,也符合人的思维方式。相比于我之前调研过的方法(一文带你了解VQA),可以认为这是知识的补充,之前的方法有的是只有类,属性信息,或者是额外的文本形式的知识,本论文的方法多了关系的检测,且用一个网络来提取高层语义用于QA,相比直接做数据增强更具解释性。不过论文没有用到那个bottom-up attention,这是我觉得可以改进的地方。
总结
至此,有关VIsual Ralationship的相关问题,方法大家应该有个大致的了解和收获。有什么问题和想法欢迎一起交流学习。
网友评论