Single-shot weakly-supervised object detection guided by empirical saliency model

doi:10.1016/j.neucom.2021.03.047

Neurocomputing

Volume 455, 30 September 2021, Pages 431-440

https://doi.org/10.1016/j.neucom.2021.03.047 Get rights and content

Abstract

Even though weakly-supervised object detection (WSOD) has become an effective method to relieve the heavy work of labeling, there are still difficult problems to be solved. WSOD method represented by a Multiple Instance Learning (MIL) have some common problems including running slowly and focusing on discriminative parts rather than the whole object, which will lead to false detection. To improve the efficiency and accuracy, we propose a single-shot weakly-supervised object detection model guided by empirical saliency model (SSWOD). As human vision always focuses on the most attracting parts of the image, saliency maps can usually guide our model to locate the most promising object areas. By this way, our model takes the saliency areas as pseudo ground-truths to realize the WSOD task with only class labels. Moreover, empirical saliency is designed to refine the pseudo ground-truth and improve the detection. Our new framework not only realizes a one-step detection without region proposals, but also reduces computational consumption. Experiments on PASCAL VOC 2007 & 2012 benchmarks demonstrate that SSWOD is 8 times faster and 5 times smaller than previous approaches, surpassing the state-of-the-art WSOD methods by 6.1% mean average precision (mAP).

Introduction

Many algorithms of object detection have emerged [1], [2], [3], [4], [5] due to the development of Convolutional Neural Networks (CNNs) [6], [7]. Thanks to the use of large scale datasets with instance-level annotations, these methods can achieve high accuracy in different tasks such as military reconnaissance, face detection [8], and medical diagnosis [9]. The instances in training images are labeled with both classes and locations. High-accuracy algorithms even explore the relationship between objects to boost the performance [10], [11]. Correlation filters are applied to recent advanced methods [12] to combine spatial features with semantic information. However, these advanced methods all rely on refined annotations, which need intensive labor and financial support. Moreover, it is extremely hard to obtain labels in some special tasks. Thus, Weakly-Supervised Object Detection (WSOD) is becoming an important and promising topic. WSOD refers to the method of training the detection network by images only labeled with locations. Through this way, the labeling cost can be saved, and the training data can be easily expanded. Therefore, increasing researchers are diving into the field of WSOD, that is, training CNN only with image-level labeled images.

Most researchers have made progress on WSOD through Multiple Instance Learning (MIL) [13], [14], [15], [16], [17], [18], but there is still a huge gap (about 30% mAP) in performance with fully-supervised algorithms. Moreover, during training and inference, MIL-based methods need region proposals provided by external algorithms such as Selective Search (SS) [19] or Edge Boxes (EB) [20], which also take long. Also, as shown in Fig. 1, MIL-based methods tend to extract the most obvious features of the object, which usually causes mistakes. Other researchers have tried to use pre-trained models and image-level annotations to generate pseudo ground-truth for training. However, these methods do not complete the end-to-end training, and the inference speed cannot meet the real-time performance.

In this paper, we focus on WSOD problems. To overcome the shortages in computational cost and accuracy of MIL-based approaches, a new framework besides MIL should be proposed, so that we put forward a single-shot weakly-supervised object detection model guided by empirical saliency model (SSWOD). CNNs have strong capability in feature extraction and generalization, and the inference speed is also outstanding under Graphics Processing Unit (GPU) acceleration. So, an end-to-end CNN can perform better in both accuracy and speed than classic machine learning approaches. Therefore, we construct pseudo ground-truth through empirical saliency to avoid MIL, prevent the time consuming prepossessing, and fully utilize CNN for training and inference.

As for a WSOD problem, we can obtain class information from annotations, so how to finish accurate detection without precise location label is the key to the problem. In natural images, the attention mechanism of the human eye determines that most targets are located near the center of an image. Based on this prior, we novelly introduce saliency detection to the field. Through pre-trained saliency models, we can obtain the rough location of the targets. However, human visual attention is not sensitive to size. Even though the position of the object obtained by saliency is relatively accurate, the size of it is often slightly larger than it really is. Thus, directly using saliency information as a pseudo ground-truth will endanger the performance of the model. Therefore, we tried to obtain empirical information through the statistics of a small amount of data in order to modify the pseudo ground-truth constructed by the pre-trained model.

Our contributions can be summarized as follows:

•
We build a weak supervised target detection framework based on human visual mechanism, using saliency regions as the guidance to locate the area of the candidate targets. Empirical pseudo ground-truths are introduced for training to prevent the problem that MIL-based methods are easy to focus on part of the target.
•
We design a lightweight SSWOD network to achieve end-to-end training and inference. The proposed network has two branches for pseudo ground-truth generation and object detection respectively. Therefore, labeling and detection can be done in a single network during training. A fast single-shot detection can be achieved during inference.
•
Our SSWOD network surpasses state-of-the-art WSOD methods by 6.1% mAP on PASCAL VOC 2007 & 2012 benchmarks. It also greatly improves the efficiency up to 8 times, and realizes real-time detection by avoiding the repetitive iterations in MIL-based methods.

Section snippets

Related work

MIL-based deep neural networks are currently the most popular way to solve WSOD problems [13], [14], [15], [18], [16], [21], [17], [22], [23]. MIL was originally proposed in [24] to solve the problem of drug activity. The idea is that in the classification problem, the sample is considered as a bag that contains several instances. If all instances in a bag are positive, the bag is positive; While if any of the instances in a bag is negative, the bag is negative. Only the category of the bag is

Method

The proposed method in this paper generates pseudo ground-truth from a saliency detection. After refining the pseudo ground-truth by an empirical saliency mechanism, a fully supervised network is followed. Different from MIL-based methods, SSWOD do not need pre-computed region proposals and can avoid repeated iterative calculations. Comparing with other approaches based on pseudo ground truth, our method gains pseudo ground-truth via simple saliency algorithms, which is computational effective.

Experiments

In this section, we experimentally verify the effectiveness of the network and method described in Section 3. Firstly, we introduce the experimental parameters and environment. Subsequently, we verify the accuracy and speed of the model by comparing it with the state-of-the-art methods. Lastly, we demonstrate the effects of each part of the model through ablation study.

Conclusion

In this paper, we present a single-shot framework (SSWOD) for weakly-supervised object detection. Different from recent works, our method does not use MIL as foundation so as to avoid the time-consuming iteration. We introduce saliency detection to WSOD and propose the empirical saliency model to guide the generation of the pseudo ground-truths. Comparing to current models, our method achieves real-time detection by combining the advantages of fully-supervised learning and weakly-supervised

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

CRediT authorship contribution statement

Danpei Zhao: Conceptualization, Methodology, Software, Formal analysis, Investigation, Resources, Visualization, Funding acquisition. Zhichao Yuan: Software, Validation, Formal analysis, Data curation, Writing - original draft, Visualization. Zhenwei Shi: Writing - review & editing, Project administration. Fengying Xie: Writing - review & editing, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (45)

J. Zhang et al.
Feature agglomeration networks for single stage face detection
Neurocomputing
(2020)
R.S. Bressan et al.
Breast cancer diagnosis through active learning in content-based image retrieval
Neurocomputing
(2019)
T. Dietterich et al.
Solving the multiple instance problem with axis-parallel rectangles
Artificial Intelligence
(1997)
Y. Zhang et al.
Weakly-supervised object detection via mining pseudo ground truth bounding-boxes
Pattern Recognition
(2018)
S. Ren, K. He, R.B. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks.,...
J. Redmon, S.K. Divvala, R.B. Girshick, A. Farhadi, You only look once: Unified, real-time object detection., in: CVPR,...
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, A.C. Berg, Ssd: Single shot multibox detector, in: B....
R.B. Girshick, J. Donahue, T. Darrell, J. Malik, Region-based convolutional networks for accurate object detection and...
T.Y. Lin, P. Goyal, R.B. Girshick, K. He, P. Dollár, Focal loss for dense object detection., in: ICCV, IEEE Computer...
A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances...

Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, in: Proceedings of...

Y. Liu et al.

Structure inference net: Object detection using scene-level context and instance-level relationships

J. Dai, Y. Li, K. He, J. Sun, R-fcn: Object detection via region-based fully convolutional networks., in: D.D. Lee, M....

J. Zhang, X. Jin, J. Sun, J. Wang, A.K. Sangaiah, Spatial and semantic convolutional features for robust visual object...

H. Bilen, A. Vedaldi, Weakly supervised deep detection networks., in: CVPR, IEEE Computer Society, 2016, pp. 2846–2854....

P. Tang, X. Wang, X. Bai, W. Liu, Multiple instance detection network with online instance classifier refinement., in:...

F. Wan, P. Wei, J. Jiao, Z. Han, Q. Ye, Min-entropy latent model for weakly supervised object detection., in: CVPR,...

J. Wang, J. Yao, Y. Zhang, R. Zhang, Collaborative learning for weakly supervised object detection., in: J. Lang (Ed.),...

F. Wan, C. Liu, W. Ke, X. Ji, J. Jiao, Q. Ye, C-mil: Continuation multiple instance learning for weakly supervised...

P. Tang, X. Wang, S. Bai, W. Shen, X. Bai, W. Liu, A.L. Yuille, Pcl: Proposal cluster learning for weakly supervised...

J.R.R. Uijlings, K.E.A. van de Sande, T. Gevers, A.W.M. Smeulders, Selective search for object recognition,...

C.L. Zitnick, P. Dollár, Edge boxes: Locating object proposals from edges., in: D.J. Fleet, T. Pajdla, B. Schiele, T....

Cited by (2)

Salient instance segmentation with region and box-level annotations
2022, Neurocomputing
Citation Excerpt :
Most deep networks require full supervision in the form of handcrafted pixel-level masks, which limits their application on large-scale datasets with weaker forms of labeling [46]. To reduce the cost of hand-labeling, weakly-supervised learning has attracted a great deal of attention in recent years [47–49]. Many weakly-supervised principles have been introduced in the computer vision area, including object detection [50], instance segmentation [51], semantic segmentation [52], and saliency detection [53,54].
In the field of saliency detection, salient instance segmentation is a novel challenging task that has received widespread attention. Due to the limited scale of the available dataset and the high cost of mask annotations, a substantial quantity of supervision sources is urgently required to train a high-performing salient instance model. To this end, we aim to train a novel salient instance segmentation model by weak supervisions that make full use of the existing salient object detection dataset. In this paper, we present a cyclic global context salient instance segmentation network (CGCNet) supervised by the combination of salient regions and bounding boxes from ready-made salient object detection datasets. To locate salient instances more accurately, a global feature refining layer is designed to expand the size of the features from the region of interest (ROI) to the global field in a scene. Moreover, a labeling updating scheme is embedded in the proposed framework to iteratively update the weak labels. Extensive experimental results demonstrate that our CGCNet trained by weak labels is competitive with the existing fully-supervised state-of-the-art methods.
Instance-level Context Attention Network for instance segmentation
2022, Neurocomputing
Citation Excerpt :
Instance segmentation. Instance segmentation can be regarded as a combination of two basic tasks: object detection [13,14,26–28] and semantic segmentation [29–31]. For object detection, AMF [13] proposes a new multi-scale feature fusion strategy, which first shatters features a Convolutional LSTM and then fuses features based on channel-wise attention.
Instance segmentation has made great progress in recent years. However, current mainstream detection-based methods ignore the process of distinguishing different instances in the same detected region, making it hard to segment the correct instance when a detected region contains multiple instances. To address this problem, we propose an Instance-level Context Attention Network (ICANet) to generate more discriminative features for different instances based on the context from their respective instance scopes, called instance-level context. Specifically, given a detected region, we first propose an instance attention module to obtain the specific attention maps that focus on the relationships between pixel pairs from the same instance by learning an embedding space. With this type of attention map, the features from the same instance achieve mutual gains, while the features from different instances become more discriminative. Then, we propose a spatial attention module to incorporate spatial information and enhance feature representations based on both feature and spatial relations. Moreover, to obtain clearer attention maps, we further propose a weight clipping strategy to filter out the noise by cutting off the lower weight. We perform extensive experiments to verify the effectiveness of the proposed method. As reported in the results, our method steadily outperforms the baseline by over 1.5% on the COCO dataset using different backbones and 3.7% on the Cityscapes dataset, which demonstrates the effectiveness of our method.

Danpei Zhao is an associate professor at Beihang University, and has been the Vice Director of the center of image processing at Beihang University. She currently serves as a standing member of the Executive Council of Beijing Society of Image and Graphics. She received her Ph.D. in Optical engineering from Changchun Institute of Optics, Fine Mechanics and Physics of Chinese Academy of Sciences in 2006. From 2006 to 2008, she was in Beihang University for postdoctoral research. She has been working at the Department of Computer Science, Rutgers, the State University of New Jersey, USA, as a visiting Scholar from 2014 to 2015. Her research interests include saliency detection, target detection and recognition, image understanding and their application in remote sensing images.

Zhichao Yuan is a M.S. Candidate of Beihang University. He received his B.S. from Beihang University, China in 2019. His current interests include image processing and computer vision.

Zhenwei Shi (M13) received his Ph.D. degree in mathematics from Dalian University of Technology, Dalian, China, in 2005. He was a Postdoctoral Researcher in the Department of Automation, Tsinghua University, Beijing, China, from 2005 to 2007. He was Visiting Scholar in the Department of Electrical Engineering and Computer Science, Northwestern University, U.S.A., from 2013 to 2014. He is currently a professor and the dean of the Image Processing Center, School of Astronautics, Beihang University. His current research interests include remote sensing image processing and analysis, computer vision, pattern recognition, and machine learning.

Dr. Shi serves as an Associate Editor for the Infrared Physics and Technology. He has authored or co-authored over 100 scientific papers in refereed journals and proceedings, including the IEEE Transactions on Pattern Analysis and Machine Intelligence, the IEEE Transactions on Neural Networks, the IEEE Transactions on Geoscience and Remote Sensing, the IEEE Geoscience and Remote Sensing Letters and the IEEE Conference on Computer Vision and Pattern Recognition. His personal website ishttp://levir.buaa.edu.cn/

Fengying Xie received the Ph.D. degree in pattern recognition and intelligent system from Beihang University, Beijing, China, in 2009. She was a Visiting Scholar with Laboratory for Image and Video Engineering, The University of Texas at Austin, from 2010 to 2011. She is currently a Professor with Image Processing Center, School of Astronautics, Beihang University. Her research interests include biomedical image processing, remote sensing image understanding and application, image quality assessment, and object recognition.

View full text

Single-shot weakly-supervised object detection guided by empirical saliency model

Abstract

Introduction

Section snippets

Related work

Method

Experiments

Conclusion

Declaration of Competing Interest

CRediT authorship contribution statement

Declaration of Competing Interest

Neurocomputing

Neurocomputing

Artificial Intelligence

Pattern Recognition

Structure inference net: Object detection using scene-level context and instance-level relationships