Improving Text-based Person Sear

作者: 风就吹吧 | 来源:发表于2018-12-17 10:47 被阅读0次

Improving Text-based Person Sear
Improving Person Re-identificati
Xcode运行go run报警告⚠️
Public Health 公共卫生
人工智能应用研究快讯 2021-08-06
#007_2020-11-03
美金刚改善癫痫患者的认知功能
对抗网络2019-2020速览
Improving Access to and Confiden
UIsearchbar--搜索

此篇论文的工作的Motivation是person search with natural language 存在的两个问题。
一.GNN-RNN考虑的是text-global image之间的affinity，对行人的关键属性的空间分布不敏感。

二.GNN-RNN对于word-image对的匹配程度过于敏感。当一张图片不能够匹配文本中的几个key word，它一定不能跟想要找出的图片很相似，但是如果它跟其他的关键词很匹配，最终也有可能会取得比较高的affinity score。

针对这两个问题，提出了解决方案：
一.Patch-word Matching Model
二.Adaptive Threshold Mechanism

Model

输入image-text pair，输出pair的affinity score，分数高的，说明text和image描述同一张图片的概率要高。进行检索时，输入一个text，令其与候选库中的所有图片计算affinity，选择具有高的affinity score的图片。

网络结构

Patch-word Matching Model

为了获取word和图片的局部区域之间的匹配关系，创建了patch-word patching model，主要分三步走来获取image-text的affinity score。
1.计算图片的局部区域（patch）跟text中每个word的affinity score
2.对于每个word，将其与最匹配的一个patch的affinity score作为其与image之间的affinity score
3.将word-image的affinity score加权求和得到text-image的affinity score

Model主要包括四个部分：image encoder,text encoder,word attention sub-network,computing part to predict the affinity score。
1.image encoder是一个VGG-16，需要在dataset（CUHK-PEDES）上进行预训练。对于每张image，在最后的pooling layer（77512的tensor），将这49个tensor作为image的patch。然后接上了两个m-neuron的全连接层，将m设为512。
2.text-encoder是word-embedding层和LSTM层的连接。对于text中的每个word，先将其映射成一个m维的word embedding feature，然后通过LSTM输出其对应的hidden state。这个hidden state可以看成是包含先前信息的增强版的word feature。然后接上了两个m-neuron的全连接层。
3.computing part对于第j个patch和第i个word，计算它们的feature vector的inner product，对于image-word的affinity score，将该word对应的patch计算得到的affinity score的最大值作为结果。

这样做的优势：
由于LSTM得到的word feature具有记忆功能，可以包含前文信息，比如“yellow shirt”，shirt得到的信息会包含yellow，只有当这两种属性同时出现在一个patch中时，这个patch才能得到最高的affinity score。
4.attention sub-network,LSTM输出的hidden state连接到一个只有一个output neuron的全连接层，输出的值通过sigmoid函数被归到（0，1）之间。这个子网络的参数跟整个网络一起train。

最终的image-text affinity

Training Scheme

正样本对与负样本对之间的比例是1：3
损失函数：

Adam optimizer
learning rate:0.0004
batch size:128

Adaptive Threshold Mechanism

含有相同word的text跟image的affinity可能是不同的，也就是说没有一个统一的标准去约束。因此，设计了adaptive threshold机制，每个word都设置了一个阈值去判断image是否跟它相匹配。如果score小于这个阈值，在计算affinity时对其不予考虑，如果大于这个阈值，要进行压缩，让word-image affinity score靠近word的阈值。这样的话，匹配同一个word的图片它们和word的affinity会比较相近。这就在一定程度上保证了公平。
Adaptive Threshold Mechanism两步走：
1.计算每个word的threshold,基于这个threshold计算image-word affinity score。
先train patch-word matching model，然后对于每个word，按下图方式计算它们的threshold。