Rethinking semantic-visual alignment in zero-shot object detection via a softplus margin focal loss

doi:10.1016/j.neucom.2021.03.073

Neurocomputing

Volume 449, 18 August 2021, Pages 117-135

https://doi.org/10.1016/j.neucom.2021.03.073 Get rights and content

Abstract

Zero-shot object detection (ZSD) aims to locate and recognize novel objects without additional training samples. Most existing methods usually map visual features to semantic space, resulting in a hubness problem, and learning an effective feature mapping between the two modalities remains a considerable challenge. In this work, we propose a novel end-to-end framework, Semantic-Visual Auto-Encoder (SVAE) network, to tackle the above issues. Distinct from previous works that utilize fully-connected layers to learn the feature mapping, we implement a 1-dimensional convolution with various shared filters to construct the auto-encoder, which maps semantic features to visual space to alleviate the hubness problem. Specifically, we design a novel loss function, Softplus Margin Focal Loss (SMFL), for object classification channel to align the projected semantic features in visual space and address the class imbalance problem. The SMFL improves the discrimination of projections on positive and negative categories and maintains the property of focal loss. Besides, to promote the localization performance for novel objects, we also provide semantic information for object localization channel and utilize a trainable matrix to align the semantic-visual mapping, considering noises in semantic representations. We conduct extensive experiments on four challenging benchmarks. The experimental results show the competitive performances compared with state-of-the-art approaches. Especially, we achieve 8.39%/6.58% mean average precision (mAP) improvements for ZSD/general-ZSD on Microsoft COCO benchmark.

Introduction

Deep learning has achieved significant success in object detection. However, existing detection models [1], [2], [3], [4], [5] usually rely on large-scale object datasets [6], [7], [8] with fully-annotated locations and categories. It is hard to apply them in the scenarios where data is scarcely labeled from novel categories. One solution is to collect a larger dataset with a wider set of categories, which can be quite laborsome and time-consuming.

Recently, zero-shot object detection (ZSD) [9], [10], [11], [12], [13] has shown an elegant way to address the problem. In this task, a ZSD model is typically trained on an instance set of so-called seen classes and aims to detect the instances from unseen classes during testing. Different from zero-shot recognition (ZSR), the model can not only recognize instances from unseen classes but also localize them. An intuitive and common solution is transferring the knowledge from seen to unseen classes in a shared space through a semantic representation transformation. The representations are usually acquired in the forms of encoded class attributes [14], textual descriptions [15], word vectors [10], [14], [16], [17], etc.

Though the ZSD techniques [9], [13], [14], [16], [18] have made impressive progress in the past few years, there exist some issues: (a) Several approaches [12], [17], [18] map the feature representations from visual to semantic space, resulting in hubness problem ¹ [19], [20]. (b) These existing ZSD models directly learn a projection from visual to semantic space without any constraints. It may lead to misplacing the projection of features from unseen categories during testing. (c) Fully connected (FC) layers without any constraints used in [9], [10], [16], [17], [18] result in high computational complexity. It is hard for models to optimize it and get semantic features well represented in visual space.

To address the above issues, we present a novel end-to-end framework named Semantic-Visual Auto-Encoder network for ZSD task (SVAE-ZSD). Firstly, the encoder projects semantic representations of class labels into the visual space of regions of interest (ROIs). As pointed out in [20], this mapping mechanism can mitigate the hubness problem. Secondly, the decoder projects the predicted visual features back to the semantic space, and $L_{2}$ regularization is applied in this auto-encoder structure. They constrain the projection to improve the compatibility and robustness of the learned model. Thirdly, for the auto-encoder structure, we use a 1-dimensional (1D) convolution operation proposed in [21] to realize the semantic-visual mapping. Comparing with different weights for each class in FC layers, the shared filters for each class in the convolution reduce the computational complexity. It is simple yet effective and makes contributions to making semantic features well represented for the visual space. Additionally, to align the semantic-visual mapping in the classification subnet, we further propose a softplus margin focal loss, keeping the ability of focal loss to deal with class imbalance problem. The loss function forces the mapping mechanism to maximize the projections of semantic features on positive categories and minimize them on negative categories. It offers the model an ability to distinguish the foreground from the background and detect unseen objects. Furthermore, the semantic information is also applied for the box regression subnet to locate unseen objects. Due to high noises in the semantic vectors, we implement a trainable matrix rather than an element-wise multiplication to get a better synergy between the semantic and the visual space.

The main contributions of this work are summarized as follows:

•
We present a novel end-to-end framework, a semantic-visual auto-encoder network based on 1-dimensional convolution, for ZSD task to get semantic features well represented in visual space as well as mitigate the hubness problem. It is simple yet effective.
•
We design a softplus margin focal loss function to align the semantic and visual features in the classification subnet. It helps to distinguish the projections of semantic features on positive categories from those on negative categories with margins and relieves the confusion between unseen objects and the background.
•
Extensive experiments are carried on four challenging datasets. Our proposed method outperforms state-of-the-art methods with significant margins. The model especially achieves over 6% mAP improvements on Microsoft COCO [7] dataset in ZSD/GZSD setting.

The rest of the paper is organized as follows: Section 2 reviews the related works, followed by Section 3, which describes the proposed approach. The detailed SMFL and other loss functions for the model are explained in Section 4. Experimental evaluation analysis is reported in Section 5, and the conclusions are presented in Section 6.

Section snippets

Zero-shot learning (ZSL)

ZSL can be viewed as a process of transferring knowledge learned from seen categories to unseen categories. The existing models can be divided into three groups through the mapping mechanisms: (a) Learning a projection function from visual to semantic space. For example, DeViSe [22] mapped visual features to semantic space via a linear transformation and employed a pairwise ranking objective function to learn the trainable matrix. ConSE [23] built the same mapping mechanism with a convex

Problem definition and system overview

In this ZSD task, let $X_{s}$ and $X_{u}$ be the training and testing sets, separately. Suppose the object in i-th ground-truth bounding box ${t_{x}^{i}, t_{y}^{i}, t_{w}^{i}, t_{h}^{i}}$ of an image, its class label is denoted as $y_{k}$ ( $y_{k} \in Y_{s}$ ). Given the seen class set $Y_{s} = {y_{1}, \dots, y_{s}}$ and the unseen class set $Y_{u} = {y_{s + 1}, \dots, y_{s + u}}$ , we assume that $Y_{s} \cap Y_{u} = \emptyset$ and $Y_{s} \cup Y_{u} = Y$ , where Y is the set of all classes. Note that each image for model training contains at least one seen object, and no objects from unseen classes. For each class, we use a d

Loss functions for zero-shot object detection

To optimize the proposed SVAE-ZSD model, we introduce a multi-task objective function, which consists of classification, bounding box regression, and reconstruction losses, as well as the regularization for the filters in SVAE. In this section, we will present them in details.

Datasets and data split

The proposed approach is evaluated on four commonly used datasets: Pascal VOC [6], Microsoft COCO (MS-COCO) [7], ILSVRC-2017 object detection (ILSVRC-2017 DET) dataset [8] and Visual Genome (VG) [44].

Pascal VOC is a very fundamental dataset for object detection and contains 20 object classes collected from photo-sharing websites with different viewing conditions. Following [14], we split the dataset with 16/4 for seen/unseen classes.

MS-COCO is a collection of common object instances in complex

Conclusion

In this paper, we propose a Semantic-Visual Auto-Encoder network (SVAE) to address zero-shot object detection task. By integrating a 1-dimensional convolution by various shared filters to construct the auto-encoder, the SVAE maps semantic features into visual space to alleviate hubness problem. For the semantic alignment in the classification subnet, we design a softplus margin focal loss to distinguish semantic projections on positive categories from negative categories by margins and address

CRediT authorship contribution statement

Qianzhong Li: Conceptualization, Formal analysis, Software, Validation, Writing - original draft. Yujia Zhang: Methodology, Writing - review & editing. Shiying Sun: Investigation, Data curation. Xiaoguang Zhao: Supervision, Project administration. Kang Li: Visualization. Min Tan: Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Key Research and Development Project of China (Grants No. 2019YFB1310601), the National Key R&D Program of China (Grants No. 2017YFC0820203-03), and the National Natural Science Foundation of China (Grants No. 61673378).

Qianzhong Li received the B.E. degree in Control Science and Engineering from Central South University, Hunan, China, in 2017. He is currently pursuing the Ph.D. degree in Control Theory and Control Engineering with the Institute of Automation, Chinese Academy of Sciences, Beijing, China. His current research interests include computer vision and intelligent robot systems.

References (50)

M. Meng et al.
Joint discriminative attributes and similarity embeddings modeling for zero-shot recognition
Neurocomputing
(2020)
K. He et al.
Mask r-cnn
IEEE International Conference on Computer Vision (ICCV)
(2017)
T. Lin et al.
Focal loss for dense object detection
IEEE International Conference on Computer Vision (ICCV)
(2017)
W. Liu et al.
Ssd: Single shot multibox detector
European Conference on Computer Vision
(2016)
J. Redmon, A. Farhadi, YOLOv3: An Incremental Improvement, arXiv...
S. Ren et al.
Faster r-cnn: Towards real-time object detection with region proposal networks
IEEE
(2017)
M. Everingham et al.
The pascal visual object classes (voc) challenge
International Journal of Computer Vision
(2010)
T.-Y. Lin et al.
Microsoft coco: Common objects in context
O. Russakovsky et al.
Imagenet large scale visual recognition challenge
International Journal of Computer Vision
(2015)
D. Gupta et al.
A multi-space approach to zero-shot object detection
IEEE Winter Conference on Applications of Computer Vision (WACV)
(2020)

S. Rahman et al.

Improved visual-semantic alignment for zero-shot object detection, in

Y. Shao et al.

Zero-shot detection with transferable object proposal mechanism

IEEE International Conference on Image Processing (ICIP)

(2019)

P. Zhu et al.

Zero shot detection

IEEE Transactions on Circuits and Systems for Video Technology

(2019)

P. Zhu et al.

Don’t even look once: Synthesizing features for zero-shot detection

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

(2020)

B. Demirel et al.

Zero-shot object detection by hybrid region embedding

British Machine Vision Conference

(2018)

Z. Li, L. Yao, X. Zhang, X. Wang, S. Kanhere, H. Zhang, Zero-shot object detection with textual descriptions, in:...

S. Rahman, S. Khan, N. Barnes, Transductive learning for zero-shot object detection, in: 2019 IEEE/CVF International...

S. Rahman et al.

Zero-shot object detection: Learning to simultaneously recognize and localize novel concepts

Asian Conference on Computer Vision, Springer

(2018)

A. Bansal et al.

Zero-shot object detection

European Conference on Computer Vision, Springer

(2018)

M. Radovanovic et al.

Hubs in space: Popular nearest neighbors in high-dimensional data

Journal of Machine Learning Research

(2010)

Y. Shigeto et al.

Ridge regression, hubness, and zero-shot learning

Y. Kim

Convolutional neural networks for sentence classification, in

A. Frome, G.S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, T. Mikolov, Devise: A deep visual-semantic embedding...

M. Norouzi et al.

Zero-shot learning by convex combination of semantic embeddings, in

Y. Xian et al.

Latent embeddings for zero-shot classification

Cited by (13)

A synchronous detection-segmentation method for oversized gangue on a coal preparation plant based on multi-task learning
2022, Minerals Engineering
Citation Excerpt :
In general, the feature encoder of a single-task network should be designed with full consideration of the feature properties required for the objective task. For example, the classification task requires features with invariant, while the localization task requires features with equivariant (Li et al., 2021). Specifically, for the classification task, the feature about image category should not change with the position, shape and angle of the object in this image; but for the localization task, the model must track the position of the specific object in the image, and it needs the features that should have a direct relationship with location information.
Online identification and sorting for coal and gangue has always been a hot issue in the field of coal processing intelligence. Existing research has focused on materials with particle sizes below 300 mm, and its front-end algorithms are dedicated to achieving image classification or object detection. The lack of detailed shape information of materials in these methods enables them to be not suitable for sorting oversized gangue. In this work, we proposed a synchronous detection-segmentation method for oversized gangue, which was implemented as a joint network based on the multi-task learning theory. The loss function of joint network and the feature interaction channels between the shared encoding module and the parallel decoding branches were designed to efficiently achieve object detection and semantic segmentation for oversized gangue. The proposed method has been evaluated in a comprehensive manner using huge amounts of coal-gangue images taken in an actual production process. The superiority of our joint network based on multi-task learning was verified by comparing several experimental results of them with the classical single-task networks. The issue of convergence synchronization between the multi-task branches was investigated to further optimize the segmentation results. Meanwhile, the effectiveness of the proposed method in improving the sorting capability of the manipulator was explained through a qualitative analysis for a case of sorting oversized gangue.
Low-shot learning and class imbalance: a survey
2024, Journal of Big Data
Defect-Aware Unequal Network for Industrial Surface Defect Detection
2023, SSRN
Zero-shot object detection with contrastive semantic association network
2023, Applied Intelligence
A Survey of Deep Learning for Low-shot Object Detection
2023, ACM Computing Surveys
Zero-shot Object Detection Based on Dynamic Semantic Vectors
2023, Proceedings - IEEE International Conference on Robotics and Automation

View all citing articles on Scopus

Yujia Zhang received the B.E. degree in computer science from Xi’an Jiaotong University in 2014, and the Ph.D. degree in control theory and control engineering from the Institute of Automation, Chinese Academy of Sciences (CASIA) in 2019. She is currently an Assistant Professor with the State Key Laboratory of Management and Control for Complex Systems, CASIA. Her research interests are computer vision and robotics.

Shiying Sun received the B.E. degree in Control Science and Engineering from Central South University, Hunan, China, in 2013, and the Ph.D. degree from the Institute of Automation, Chinese Academy of Sciences (IACAS), Beijing, China, in 2019, both in control science and engineering. He is currently a postdoctor with the State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China. His current research interests include advanced robot control, navigation and computer vision.

Xiaoguang Zhao received the B.E. degree in control engineering from Shenyang University of Technology, Shenyang, China, in 1992, and the M.E. and Ph.D. degree in control theory and control engineering at the Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang, China, in 1998 and 2001, respectively. She is currently a Professor with the State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China. Her current research interests include advanced robot control, wireless sensor network and robot vision.

Kang Li received the B.E. degree from Central South University, Hunan, China, in 2014, and the Ph.D. degree from the Institute of Automation, Chinese Academy of Sciences (IACAS), Beijing, China, in 2019, both in Control Theory and Control Engineering. His research interests include human-machine interaction, computer vision and cognitive neural science.

Min Tan received the B.E. degree from Tsinghua University, Beijing, China, in 1986, and the Ph.D. degree from the Institute of Automation, Chinese Academy of Sciences (IACAS), Beijing, China, in 1990, both in control science and engineering. He is currently a Professor with the State Key Laboratory of Management and Control for Complex Systems, IACAS. He has published more than 200 papers in journals, books, and conference proceedings. His research interests include robotics and intelligent control systems.

View full text

Rethinking semantic-visual alignment in zero-shot object detection via a softplus margin focal loss

Abstract

Introduction

Section snippets

Zero-shot learning (ZSL)

Problem definition and system overview

Loss functions for zero-shot object detection

Datasets and data split

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Neurocomputing

Mask r-cnn

IEEE International Conference on Computer Vision (ICCV)

Focal loss for dense object detection

IEEE International Conference on Computer Vision (ICCV)

Ssd: Single shot multibox detector

European Conference on Computer Vision

Faster r-cnn: Towards real-time object detection with region proposal networks

IEEE

The pascal visual object classes (voc) challenge

International Journal of Computer Vision

Microsoft coco: Common objects in context

Imagenet large scale visual recognition challenge

International Journal of Computer Vision

A multi-space approach to zero-shot object detection

IEEE Winter Conference on Applications of Computer Vision (WACV)

Improved visual-semantic alignment for zero-shot object detection, in

Zero-shot detection with transferable object proposal mechanism

IEEE International Conference on Image Processing (ICIP)

Zero shot detection

IEEE Transactions on Circuits and Systems for Video Technology

Don’t even look once: Synthesizing features for zero-shot detection

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zero-shot object detection by hybrid region embedding

British Machine Vision Conference

Zero-shot object detection: Learning to simultaneously recognize and localize novel concepts

Asian Conference on Computer Vision, Springer

Zero-shot object detection

European Conference on Computer Vision, Springer

Hubs in space: Popular nearest neighbors in high-dimensional data

Journal of Machine Learning Research

Ridge regression, hubness, and zero-shot learning

Convolutional neural networks for sentence classification, in

Zero-shot learning by convex combination of semantic embeddings, in

Latent embeddings for zero-shot classification