Zero-Shot Detection (ZSD) is a challenging computer vision problem that enables simultaneous classification and localization of previously unseen objects via auxiliary information. Most of the existing methods learn a biased visual-semantic mapping function, which prefers predicting seen classes during testing, and they only focus on region of interest and ignore contextual information in an image. To tackle these problems, we propose a novel framework for ZSD named Transformer-based Zero-Shot Detection via Contrastive Learning (TZSDC). The proposed TZSDC contains four components: transformer-based backbone, Foreground-Background (FB) separation module, Instance-Instance Contrastive Learning (IICL) module, and Knowledge-Transfer (KT) module. The transformer backbone encodes long-range contextual information with less inductive bias. The FB module separates foreground and background by scoring objectness from images. The IICL module optimizes the visual structure in embedding space to make it more discriminative and the KT module transfers knowledge from seen classes to unseen classes via category similarity. Benefiting from these modules, the accurate alignment between the contextual visual features and semantic features can be achieved. Experiments on MSCOCO well validate the effectiveness of the proposed method for ZSD and generalized ZSD. |
*** Title, author list and abstract as seen in the Camera-Ready version of the paper that was provided to Conference Committee. Small changes that may have occurred during processing by Springer may not appear in this window.