美文网首页
Paper | Open-Vocabulary Object D

Paper | Open-Vocabulary Object D

作者: 与阳光共进早餐 | 来源:发表于2023-11-01 05:18 被阅读0次

    1 basic

    • github.com/alirezazareian/ovr-cnn
    • the first paper which proposes the task of "open-vocabulary object detection"

    2 introduction

    OD: each category needs thousands of bounding boxes;

    stage 1: use {image, caption} pairs to learn a visual semantic space;
    stage 2: use annotated boxes for several classes to train object detection;
    stage 3: inference which can detect objects beyond the base classes;

    to summarize, we train a model that takes an image and detects any object within a given target vocabulary VT.

    Task Definition:

    1. test on target vocabulary V_{T};
    2. train on an image-caption dataset with the vocabulary as V_{C}
    3. train on an annotated object detection dataset with the vocabulary as V_{B}
    4. V_{T} is not known during training and can be any subset of the entire vocabulary V_{\omega}.

    **compare with ZSD and WSD: **

    • ZSD: no V_{C};
    • WSD: no V_{B}% and need to knowV_{T}$ before training;
    • OVD is a generalization of ZSD and WSD.
    image.png

    outcome:

    • significant outcome the ZSD and WSD methods;

    3 Method

    OVD framework:

    • meaning of open: the words in the captions are not limited, but in practice, it is not literally "open" as it is limited to pretrained word embeddings. (However, word embeddings are typically trained on very large text corpora such as Wiki pedia that cover nearly every word
    3.1 Learning visual semantic space
    • resembles the PixelBERT
    • use the RN50 as the visual encoder; and the BERT as the text encoder;
    • design a V2L (vision to language) module (mapping the vectors of vision patches to text vectors)
    • use the grounding (main) task to train the RN50 & V2L module.

    specifically,

    1. input image --> RN50 --> features of patches
    2. each patch feature (vision) --> V2L --> patch feature (language) e^{I}_{i}
    3. caption --> Embedding e^{C}_{j}--> BERT --> features of words f^{C}_{j}
    4. patch features (language), words features --> multimodal transformer --> new features for patches and words m^{I}_{i}, m^{C}_{j}.
    5. task: perform weakly supervised grounding using {e^{I}_{i} , e^{C}_{j}}, making the paired {img, caption} be the positive, while the unpaired {img, caption} the negative, and dis between {img, caption} is calculated by average of all e^{I}_{i} and e^{C}_{j}.

    the grounding objectives results in a learned visual backbone and V2L layer that can map regions in the image into words that best describe them.

    besides, to teach the model learn to 1) extract all objects that might be described in captions and 2) determine what word completes the caption best, further introduce the image text matching (ITM) subtask and the Masked Language Matching (not sure about the full name) (MLM) subtask.

    3.2 Learning open-vocabulary detection
    • use faster-rcnn
    1. block1-3 to extract features
    2. RPN --> predict objectness & bounding box coordinates;
    3. non-max suppression (NMS)
    4. region-of-interest pooling (ROI pooling) to get a feature map for each potential object which is typically used for classification in the supervised way;

    However, in the zero-shot setting,

    3.3 testing

    basically the same with the training but for the last step compare the box features after V2L to the target classes.

    相关文章

      网友评论

          本文标题:Paper | Open-Vocabulary Object D

          本文链接:https://www.haomeiwen.com/subject/szbbidtx.html