Rethinking semantic-visual alignment in zero-shot object detection via a softplus margin focal loss
Introduction
Deep learning has achieved significant success in object detection. However, existing detection models [1], [2], [3], [4], [5] usually rely on large-scale object datasets [6], [7], [8] with fully-annotated locations and categories. It is hard to apply them in the scenarios where data is scarcely labeled from novel categories. One solution is to collect a larger dataset with a wider set of categories, which can be quite laborsome and time-consuming.
Recently, zero-shot object detection (ZSD) [9], [10], [11], [12], [13] has shown an elegant way to address the problem. In this task, a ZSD model is typically trained on an instance set of so-called seen classes and aims to detect the instances from unseen classes during testing. Different from zero-shot recognition (ZSR), the model can not only recognize instances from unseen classes but also localize them. An intuitive and common solution is transferring the knowledge from seen to unseen classes in a shared space through a semantic representation transformation. The representations are usually acquired in the forms of encoded class attributes [14], textual descriptions [15], word vectors [10], [14], [16], [17], etc.
Though the ZSD techniques [9], [13], [14], [16], [18] have made impressive progress in the past few years, there exist some issues: (a) Several approaches [12], [17], [18] map the feature representations from visual to semantic space, resulting in hubness problem 1 [19], [20]. (b) These existing ZSD models directly learn a projection from visual to semantic space without any constraints. It may lead to misplacing the projection of features from unseen categories during testing. (c) Fully connected (FC) layers without any constraints used in [9], [10], [16], [17], [18] result in high computational complexity. It is hard for models to optimize it and get semantic features well represented in visual space.
To address the above issues, we present a novel end-to-end framework named Semantic-Visual Auto-Encoder network for ZSD task (SVAE-ZSD). Firstly, the encoder projects semantic representations of class labels into the visual space of regions of interest (ROIs). As pointed out in [20], this mapping mechanism can mitigate the hubness problem. Secondly, the decoder projects the predicted visual features back to the semantic space, and regularization is applied in this auto-encoder structure. They constrain the projection to improve the compatibility and robustness of the learned model. Thirdly, for the auto-encoder structure, we use a 1-dimensional (1D) convolution operation proposed in [21] to realize the semantic-visual mapping. Comparing with different weights for each class in FC layers, the shared filters for each class in the convolution reduce the computational complexity. It is simple yet effective and makes contributions to making semantic features well represented for the visual space. Additionally, to align the semantic-visual mapping in the classification subnet, we further propose a softplus margin focal loss, keeping the ability of focal loss to deal with class imbalance problem. The loss function forces the mapping mechanism to maximize the projections of semantic features on positive categories and minimize them on negative categories. It offers the model an ability to distinguish the foreground from the background and detect unseen objects. Furthermore, the semantic information is also applied for the box regression subnet to locate unseen objects. Due to high noises in the semantic vectors, we implement a trainable matrix rather than an element-wise multiplication to get a better synergy between the semantic and the visual space.
The main contributions of this work are summarized as follows:
- •
We present a novel end-to-end framework, a semantic-visual auto-encoder network based on 1-dimensional convolution, for ZSD task to get semantic features well represented in visual space as well as mitigate the hubness problem. It is simple yet effective.
- •
We design a softplus margin focal loss function to align the semantic and visual features in the classification subnet. It helps to distinguish the projections of semantic features on positive categories from those on negative categories with margins and relieves the confusion between unseen objects and the background.
- •
Extensive experiments are carried on four challenging datasets. Our proposed method outperforms state-of-the-art methods with significant margins. The model especially achieves over 6% mAP improvements on Microsoft COCO [7] dataset in ZSD/GZSD setting.
The rest of the paper is organized as follows: Section 2 reviews the related works, followed by Section 3, which describes the proposed approach. The detailed SMFL and other loss functions for the model are explained in Section 4. Experimental evaluation analysis is reported in Section 5, and the conclusions are presented in Section 6.
Section snippets
Zero-shot learning (ZSL)
ZSL can be viewed as a process of transferring knowledge learned from seen categories to unseen categories. The existing models can be divided into three groups through the mapping mechanisms: (a) Learning a projection function from visual to semantic space. For example, DeViSe [22] mapped visual features to semantic space via a linear transformation and employed a pairwise ranking objective function to learn the trainable matrix. ConSE [23] built the same mapping mechanism with a convex
Problem definition and system overview
In this ZSD task, let and be the training and testing sets, separately. Suppose the object in i-th ground-truth bounding box of an image, its class label is denoted as (). Given the seen class set and the unseen class set , we assume that and , where Y is the set of all classes. Note that each image for model training contains at least one seen object, and no objects from unseen classes. For each class, we use a d
Loss functions for zero-shot object detection
To optimize the proposed SVAE-ZSD model, we introduce a multi-task objective function, which consists of classification, bounding box regression, and reconstruction losses, as well as the regularization for the filters in SVAE. In this section, we will present them in details.
Datasets and data split
The proposed approach is evaluated on four commonly used datasets: Pascal VOC [6], Microsoft COCO (MS-COCO) [7], ILSVRC-2017 object detection (ILSVRC-2017 DET) dataset [8] and Visual Genome (VG) [44].
Pascal VOC is a very fundamental dataset for object detection and contains 20 object classes collected from photo-sharing websites with different viewing conditions. Following [14], we split the dataset with 16/4 for seen/unseen classes.
MS-COCO is a collection of common object instances in complex
Conclusion
In this paper, we propose a Semantic-Visual Auto-Encoder network (SVAE) to address zero-shot object detection task. By integrating a 1-dimensional convolution by various shared filters to construct the auto-encoder, the SVAE maps semantic features into visual space to alleviate hubness problem. For the semantic alignment in the classification subnet, we design a softplus margin focal loss to distinguish semantic projections on positive categories from negative categories by margins and address
CRediT authorship contribution statement
Qianzhong Li: Conceptualization, Formal analysis, Software, Validation, Writing - original draft. Yujia Zhang: Methodology, Writing - review & editing. Shiying Sun: Investigation, Data curation. Xiaoguang Zhao: Supervision, Project administration. Kang Li: Visualization. Min Tan: Funding acquisition.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported by the National Key Research and Development Project of China (Grants No. 2019YFB1310601), the National Key R&D Program of China (Grants No. 2017YFC0820203-03), and the National Natural Science Foundation of China (Grants No. 61673378).
Qianzhong Li received the B.E. degree in Control Science and Engineering from Central South University, Hunan, China, in 2017. He is currently pursuing the Ph.D. degree in Control Theory and Control Engineering with the Institute of Automation, Chinese Academy of Sciences, Beijing, China. His current research interests include computer vision and intelligent robot systems.
References (50)
- et al.
Joint discriminative attributes and similarity embeddings modeling for zero-shot recognition
Neurocomputing
(2020) - et al.
Mask r-cnn
IEEE International Conference on Computer Vision (ICCV)
(2017) - et al.
Focal loss for dense object detection
IEEE International Conference on Computer Vision (ICCV)
(2017) - et al.
Ssd: Single shot multibox detector
European Conference on Computer Vision
(2016) - J. Redmon, A. Farhadi, YOLOv3: An Incremental Improvement, arXiv...
- et al.
Faster r-cnn: Towards real-time object detection with region proposal networks
IEEE
(2017) - et al.
The pascal visual object classes (voc) challenge
International Journal of Computer Vision
(2010) - et al.
Microsoft coco: Common objects in context
- et al.
Imagenet large scale visual recognition challenge
International Journal of Computer Vision
(2015) - et al.
A multi-space approach to zero-shot object detection
IEEE Winter Conference on Applications of Computer Vision (WACV)
(2020)
Improved visual-semantic alignment for zero-shot object detection, in
Zero-shot detection with transferable object proposal mechanism
IEEE International Conference on Image Processing (ICIP)
Zero shot detection
IEEE Transactions on Circuits and Systems for Video Technology
Don’t even look once: Synthesizing features for zero-shot detection
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Zero-shot object detection by hybrid region embedding
British Machine Vision Conference
Zero-shot object detection: Learning to simultaneously recognize and localize novel concepts
Asian Conference on Computer Vision, Springer
Zero-shot object detection
European Conference on Computer Vision, Springer
Hubs in space: Popular nearest neighbors in high-dimensional data
Journal of Machine Learning Research
Ridge regression, hubness, and zero-shot learning
Convolutional neural networks for sentence classification, in
Zero-shot learning by convex combination of semantic embeddings, in
Latent embeddings for zero-shot classification
Cited by (13)
A synchronous detection-segmentation method for oversized gangue on a coal preparation plant based on multi-task learning
2022, Minerals EngineeringCitation Excerpt :In general, the feature encoder of a single-task network should be designed with full consideration of the feature properties required for the objective task. For example, the classification task requires features with invariant, while the localization task requires features with equivariant (Li et al., 2021). Specifically, for the classification task, the feature about image category should not change with the position, shape and angle of the object in this image; but for the localization task, the model must track the position of the specific object in the image, and it needs the features that should have a direct relationship with location information.
Low-shot learning and class imbalance: a survey
2024, Journal of Big DataZero-shot object detection with contrastive semantic association network
2023, Applied IntelligenceA Survey of Deep Learning for Low-shot Object Detection
2023, ACM Computing SurveysZero-shot Object Detection Based on Dynamic Semantic Vectors
2023, Proceedings - IEEE International Conference on Robotics and Automation
Qianzhong Li received the B.E. degree in Control Science and Engineering from Central South University, Hunan, China, in 2017. He is currently pursuing the Ph.D. degree in Control Theory and Control Engineering with the Institute of Automation, Chinese Academy of Sciences, Beijing, China. His current research interests include computer vision and intelligent robot systems.
Yujia Zhang received the B.E. degree in computer science from Xi’an Jiaotong University in 2014, and the Ph.D. degree in control theory and control engineering from the Institute of Automation, Chinese Academy of Sciences (CASIA) in 2019. She is currently an Assistant Professor with the State Key Laboratory of Management and Control for Complex Systems, CASIA. Her research interests are computer vision and robotics.
Shiying Sun received the B.E. degree in Control Science and Engineering from Central South University, Hunan, China, in 2013, and the Ph.D. degree from the Institute of Automation, Chinese Academy of Sciences (IACAS), Beijing, China, in 2019, both in control science and engineering. He is currently a postdoctor with the State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China. His current research interests include advanced robot control, navigation and computer vision.
Xiaoguang Zhao received the B.E. degree in control engineering from Shenyang University of Technology, Shenyang, China, in 1992, and the M.E. and Ph.D. degree in control theory and control engineering at the Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang, China, in 1998 and 2001, respectively. She is currently a Professor with the State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China. Her current research interests include advanced robot control, wireless sensor network and robot vision.
Kang Li received the B.E. degree from Central South University, Hunan, China, in 2014, and the Ph.D. degree from the Institute of Automation, Chinese Academy of Sciences (IACAS), Beijing, China, in 2019, both in Control Theory and Control Engineering. His research interests include human-machine interaction, computer vision and cognitive neural science.
Min Tan received the B.E. degree from Tsinghua University, Beijing, China, in 1986, and the Ph.D. degree from the Institute of Automation, Chinese Academy of Sciences (IACAS), Beijing, China, in 1990, both in control science and engineering. He is currently a Professor with the State Key Laboratory of Management and Control for Complex Systems, IACAS. He has published more than 200 papers in journals, books, and conference proceedings. His research interests include robotics and intelligent control systems.