Elsevier

Neurocomputing

Volume 449, 18 August 2021, Pages 117-135
Neurocomputing

Rethinking semantic-visual alignment in zero-shot object detection via a softplus margin focal loss

https://doi.org/10.1016/j.neucom.2021.03.073Get rights and content

Abstract

Zero-shot object detection (ZSD) aims to locate and recognize novel objects without additional training samples. Most existing methods usually map visual features to semantic space, resulting in a hubness problem, and learning an effective feature mapping between the two modalities remains a considerable challenge. In this work, we propose a novel end-to-end framework, Semantic-Visual Auto-Encoder (SVAE) network, to tackle the above issues. Distinct from previous works that utilize fully-connected layers to learn the feature mapping, we implement a 1-dimensional convolution with various shared filters to construct the auto-encoder, which maps semantic features to visual space to alleviate the hubness problem. Specifically, we design a novel loss function, Softplus Margin Focal Loss (SMFL), for object classification channel to align the projected semantic features in visual space and address the class imbalance problem. The SMFL improves the discrimination of projections on positive and negative categories and maintains the property of focal loss. Besides, to promote the localization performance for novel objects, we also provide semantic information for object localization channel and utilize a trainable matrix to align the semantic-visual mapping, considering noises in semantic representations. We conduct extensive experiments on four challenging benchmarks. The experimental results show the competitive performances compared with state-of-the-art approaches. Especially, we achieve 8.39%/6.58% mean average precision (mAP) improvements for ZSD/general-ZSD on Microsoft COCO benchmark.

Introduction

Deep learning has achieved significant success in object detection. However, existing detection models [1], [2], [3], [4], [5] usually rely on large-scale object datasets [6], [7], [8] with fully-annotated locations and categories. It is hard to apply them in the scenarios where data is scarcely labeled from novel categories. One solution is to collect a larger dataset with a wider set of categories, which can be quite laborsome and time-consuming.

Recently, zero-shot object detection (ZSD) [9], [10], [11], [12], [13] has shown an elegant way to address the problem. In this task, a ZSD model is typically trained on an instance set of so-called seen classes and aims to detect the instances from unseen classes during testing. Different from zero-shot recognition (ZSR), the model can not only recognize instances from unseen classes but also localize them. An intuitive and common solution is transferring the knowledge from seen to unseen classes in a shared space through a semantic representation transformation. The representations are usually acquired in the forms of encoded class attributes [14], textual descriptions [15], word vectors [10], [14], [16], [17], etc.

Though the ZSD techniques [9], [13], [14], [16], [18] have made impressive progress in the past few years, there exist some issues: (a) Several approaches [12], [17], [18] map the feature representations from visual to semantic space, resulting in hubness problem 1 [19], [20]. (b) These existing ZSD models directly learn a projection from visual to semantic space without any constraints. It may lead to misplacing the projection of features from unseen categories during testing. (c) Fully connected (FC) layers without any constraints used in [9], [10], [16], [17], [18] result in high computational complexity. It is hard for models to optimize it and get semantic features well represented in visual space.

To address the above issues, we present a novel end-to-end framework named Semantic-Visual Auto-Encoder network for ZSD task (SVAE-ZSD). Firstly, the encoder projects semantic representations of class labels into the visual space of regions of interest (ROIs). As pointed out in [20], this mapping mechanism can mitigate the hubness problem. Secondly, the decoder projects the predicted visual features back to the semantic space, and L2 regularization is applied in this auto-encoder structure. They constrain the projection to improve the compatibility and robustness of the learned model. Thirdly, for the auto-encoder structure, we use a 1-dimensional (1D) convolution operation proposed in [21] to realize the semantic-visual mapping. Comparing with different weights for each class in FC layers, the shared filters for each class in the convolution reduce the computational complexity. It is simple yet effective and makes contributions to making semantic features well represented for the visual space. Additionally, to align the semantic-visual mapping in the classification subnet, we further propose a softplus margin focal loss, keeping the ability of focal loss to deal with class imbalance problem. The loss function forces the mapping mechanism to maximize the projections of semantic features on positive categories and minimize them on negative categories. It offers the model an ability to distinguish the foreground from the background and detect unseen objects. Furthermore, the semantic information is also applied for the box regression subnet to locate unseen objects. Due to high noises in the semantic vectors, we implement a trainable matrix rather than an element-wise multiplication to get a better synergy between the semantic and the visual space.

The main contributions of this work are summarized as follows:

  • We present a novel end-to-end framework, a semantic-visual auto-encoder network based on 1-dimensional convolution, for ZSD task to get semantic features well represented in visual space as well as mitigate the hubness problem. It is simple yet effective.

  • We design a softplus margin focal loss function to align the semantic and visual features in the classification subnet. It helps to distinguish the projections of semantic features on positive categories from those on negative categories with margins and relieves the confusion between unseen objects and the background.

  • Extensive experiments are carried on four challenging datasets. Our proposed method outperforms state-of-the-art methods with significant margins. The model especially achieves over 6% mAP improvements on Microsoft COCO [7] dataset in ZSD/GZSD setting.

The rest of the paper is organized as follows: Section 2 reviews the related works, followed by Section 3, which describes the proposed approach. The detailed SMFL and other loss functions for the model are explained in Section 4. Experimental evaluation analysis is reported in Section 5, and the conclusions are presented in Section 6.

Section snippets

Zero-shot learning (ZSL)

ZSL can be viewed as a process of transferring knowledge learned from seen categories to unseen categories. The existing models can be divided into three groups through the mapping mechanisms: (a) Learning a projection function from visual to semantic space. For example, DeViSe [22] mapped visual features to semantic space via a linear transformation and employed a pairwise ranking objective function to learn the trainable matrix. ConSE [23] built the same mapping mechanism with a convex

Problem definition and system overview

In this ZSD task, let Xs and Xu be the training and testing sets, separately. Suppose the object in i-th ground-truth bounding box {txi,tyi,twi,thi} of an image, its class label is denoted as yk (ykYs). Given the seen class set Ys={y1,,ys} and the unseen class set Yu={ys+1,,ys+u}, we assume that YsYu= and YsYu=Y, where Y is the set of all classes. Note that each image for model training contains at least one seen object, and no objects from unseen classes. For each class, we use a d

Loss functions for zero-shot object detection

To optimize the proposed SVAE-ZSD model, we introduce a multi-task objective function, which consists of classification, bounding box regression, and reconstruction losses, as well as the regularization for the filters in SVAE. In this section, we will present them in details.

Datasets and data split

The proposed approach is evaluated on four commonly used datasets: Pascal VOC [6], Microsoft COCO (MS-COCO) [7], ILSVRC-2017 object detection (ILSVRC-2017 DET) dataset [8] and Visual Genome (VG) [44].

Pascal VOC is a very fundamental dataset for object detection and contains 20 object classes collected from photo-sharing websites with different viewing conditions. Following [14], we split the dataset with 16/4 for seen/unseen classes.

MS-COCO is a collection of common object instances in complex

Conclusion

In this paper, we propose a Semantic-Visual Auto-Encoder network (SVAE) to address zero-shot object detection task. By integrating a 1-dimensional convolution by various shared filters to construct the auto-encoder, the SVAE maps semantic features into visual space to alleviate hubness problem. For the semantic alignment in the classification subnet, we design a softplus margin focal loss to distinguish semantic projections on positive categories from negative categories by margins and address

CRediT authorship contribution statement

Qianzhong Li: Conceptualization, Formal analysis, Software, Validation, Writing - original draft. Yujia Zhang: Methodology, Writing - review & editing. Shiying Sun: Investigation, Data curation. Xiaoguang Zhao: Supervision, Project administration. Kang Li: Visualization. Min Tan: Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Key Research and Development Project of China (Grants No. 2019YFB1310601), the National Key R&D Program of China (Grants No. 2017YFC0820203-03), and the National Natural Science Foundation of China (Grants No. 61673378).

Qianzhong Li received the B.E. degree in Control Science and Engineering from Central South University, Hunan, China, in 2017. He is currently pursuing the Ph.D. degree in Control Theory and Control Engineering with the Institute of Automation, Chinese Academy of Sciences, Beijing, China. His current research interests include computer vision and intelligent robot systems.

References (50)

  • M. Meng et al.

    Joint discriminative attributes and similarity embeddings modeling for zero-shot recognition

    Neurocomputing

    (2020)
  • K. He et al.

    Mask r-cnn

    IEEE International Conference on Computer Vision (ICCV)

    (2017)
  • T. Lin et al.

    Focal loss for dense object detection

    IEEE International Conference on Computer Vision (ICCV)

    (2017)
  • W. Liu et al.

    Ssd: Single shot multibox detector

    European Conference on Computer Vision

    (2016)
  • J. Redmon, A. Farhadi, YOLOv3: An Incremental Improvement, arXiv...
  • S. Ren et al.

    Faster r-cnn: Towards real-time object detection with region proposal networks

    IEEE

    (2017)
  • M. Everingham et al.

    The pascal visual object classes (voc) challenge

    International Journal of Computer Vision

    (2010)
  • T.-Y. Lin et al.

    Microsoft coco: Common objects in context

  • O. Russakovsky et al.

    Imagenet large scale visual recognition challenge

    International Journal of Computer Vision

    (2015)
  • D. Gupta et al.

    A multi-space approach to zero-shot object detection

    IEEE Winter Conference on Applications of Computer Vision (WACV)

    (2020)
  • S. Rahman et al.

    Improved visual-semantic alignment for zero-shot object detection, in

  • Y. Shao et al.

    Zero-shot detection with transferable object proposal mechanism

    IEEE International Conference on Image Processing (ICIP)

    (2019)
  • P. Zhu et al.

    Zero shot detection

    IEEE Transactions on Circuits and Systems for Video Technology

    (2019)
  • P. Zhu et al.

    Don’t even look once: Synthesizing features for zero-shot detection

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    (2020)
  • B. Demirel et al.

    Zero-shot object detection by hybrid region embedding

    British Machine Vision Conference

    (2018)
  • Z. Li, L. Yao, X. Zhang, X. Wang, S. Kanhere, H. Zhang, Zero-shot object detection with textual descriptions, in:...
  • S. Rahman, S. Khan, N. Barnes, Transductive learning for zero-shot object detection, in: 2019 IEEE/CVF International...
  • S. Rahman et al.

    Zero-shot object detection: Learning to simultaneously recognize and localize novel concepts

    Asian Conference on Computer Vision, Springer

    (2018)
  • A. Bansal et al.

    Zero-shot object detection

    European Conference on Computer Vision, Springer

    (2018)
  • M. Radovanovic et al.

    Hubs in space: Popular nearest neighbors in high-dimensional data

    Journal of Machine Learning Research

    (2010)
  • Y. Shigeto et al.

    Ridge regression, hubness, and zero-shot learning

  • Y. Kim

    Convolutional neural networks for sentence classification, in

  • A. Frome, G.S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, T. Mikolov, Devise: A deep visual-semantic embedding...
  • M. Norouzi et al.

    Zero-shot learning by convex combination of semantic embeddings, in

  • Y. Xian et al.

    Latent embeddings for zero-shot classification

  • Cited by (13)

    • A synchronous detection-segmentation method for oversized gangue on a coal preparation plant based on multi-task learning

      2022, Minerals Engineering
      Citation Excerpt :

      In general, the feature encoder of a single-task network should be designed with full consideration of the feature properties required for the objective task. For example, the classification task requires features with invariant, while the localization task requires features with equivariant (Li et al., 2021). Specifically, for the classification task, the feature about image category should not change with the position, shape and angle of the object in this image; but for the localization task, the model must track the position of the specific object in the image, and it needs the features that should have a direct relationship with location information.

    • Zero-shot Object Detection Based on Dynamic Semantic Vectors

      2023, Proceedings - IEEE International Conference on Robotics and Automation
    View all citing articles on Scopus

    Qianzhong Li received the B.E. degree in Control Science and Engineering from Central South University, Hunan, China, in 2017. He is currently pursuing the Ph.D. degree in Control Theory and Control Engineering with the Institute of Automation, Chinese Academy of Sciences, Beijing, China. His current research interests include computer vision and intelligent robot systems.

    Yujia Zhang received the B.E. degree in computer science from Xi’an Jiaotong University in 2014, and the Ph.D. degree in control theory and control engineering from the Institute of Automation, Chinese Academy of Sciences (CASIA) in 2019. She is currently an Assistant Professor with the State Key Laboratory of Management and Control for Complex Systems, CASIA. Her research interests are computer vision and robotics.

    Shiying Sun received the B.E. degree in Control Science and Engineering from Central South University, Hunan, China, in 2013, and the Ph.D. degree from the Institute of Automation, Chinese Academy of Sciences (IACAS), Beijing, China, in 2019, both in control science and engineering. He is currently a postdoctor with the State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China. His current research interests include advanced robot control, navigation and computer vision.

    Xiaoguang Zhao received the B.E. degree in control engineering from Shenyang University of Technology, Shenyang, China, in 1992, and the M.E. and Ph.D. degree in control theory and control engineering at the Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang, China, in 1998 and 2001, respectively. She is currently a Professor with the State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China. Her current research interests include advanced robot control, wireless sensor network and robot vision.

    Kang Li received the B.E. degree from Central South University, Hunan, China, in 2014, and the Ph.D. degree from the Institute of Automation, Chinese Academy of Sciences (IACAS), Beijing, China, in 2019, both in Control Theory and Control Engineering. His research interests include human-machine interaction, computer vision and cognitive neural science.

    Min Tan received the B.E. degree from Tsinghua University, Beijing, China, in 1986, and the Ph.D. degree from the Institute of Automation, Chinese Academy of Sciences (IACAS), Beijing, China, in 1990, both in control science and engineering. He is currently a Professor with the State Key Laboratory of Management and Control for Complex Systems, IACAS. He has published more than 200 papers in journals, books, and conference proceedings. His research interests include robotics and intelligent control systems.

    View full text