Context-aware co-supervision for accurate object detection
Introduction
Object detection is a challenging task in computer vision. Recently, with the development of deep learning especially convolutional neural networks, researchers have designed an effective two-stage framework for object detection [1], [2], [3]. Based on a powerful network backbone, these methods first use a proposal stage to extract a number of regions as probable objects, and then apply an individual classifier stage to distinguish the object class within each region.
In another point of view, there are two kinds of approach for understanding the process of perception. Top-down processing is defined as the development of pattern recognition through the use of contextual information. For instance, you are presented with a paragraph written with difficult handwriting. It is easier to understand what the writer wants to convey if you read the whole paragraph rather than reading the words in separate terms. The brain may be able to perceive and understand the gist of the paragraph due to the context supplied by the surrounding words. In the bottom-up processing approach, perception starts at the sensory input, the stimulus. Thus, perception can be described as data-driven. For example, there is a flower at the center of a person’s field. The sight of the flower and all the information about the stimulus is carried from the retina to the visual cortex in the brain. We argue that such guidance is mostly context-free and thus not aware of high-level semantic cues in its surrounding areas. When a number of small proposals are extracted from a complex scene, the corresponding features computed from a small fraction of the feature map are less capable of accurately determining the object classes. With less confident scores, these small objects can be suppressed by large objects and thus mis-detected. Consequently, we propose a context-aware module, which extracts top-down signals to assist bottom-up signals for object detection. An example is shown in Fig. 1.
To assist object detection with richer high-level semantic information, we propose an efficient module which introduces a context-aware module to the detection network. This is achieved by adding an extra classification path to determine which object classes appear in the entire image. This is to say, the features built up in the backbone are expected to have the ability of depicting both each individual proposal and the entire image, which the latter provides top-down information to refine the former outputs. In practice, we propose a global context encoding (GCE) module to summarize all features, and simultaneously feed these global features to complement the features pooled within each proposal. There are two sources of supervision, i.e., co-supervision. The first one lies in the conventional way of classification and regression within each proposal, and the second one is produced by the image-level classification which is obtained by adding a linear layer beyond the GCE features. There is a deep supervision of image-level classification which designed as modified binary cross-entropy (BCE) loss and will be introduced in Section 3. As we can see in Fig. 2, firstly we summarize regional features and send them for image-level classification; secondly, fuse the trained features with ROI-pooled features within each proposal for enhancing regional classification and regression. Our approach is easy to implement and generalizes well to a wide range of detection frameworks, including some popular add-ons such as feature pyramid network (FPN) [4] which enables multi-scale feature extraction. A detailed flowchart can be found in Fig. 3.
Experiments are performed on the MS-COCO dataset [5] which contains a lot of small and medium objects in complex image contents. Two popular baselines are considered, namely, Faster R-CNN [2] and Mask R-CNN [3], both of which are equipped with FPN [4]. Our approach, requiring less than extra time and memory costs, achieves consistent accuracy gain, e.g., AP improvement and [email protected] improvement beyond Mask R-CNN. Diagnostic experiments suggest that our approach is especially effective in finding small and difficult objects. This work paves the way of future research in combining top-down semantic-aware information with bottom-up regional features, which is also applicable to a wide range of vision tasks.
The remainder of this paper is organized as follows. Section 2 briefly reviews prior work, and Section 3 describes our approach. After experiments are shown in Section 4, we conclude this work in Section 5.
Section snippets
Deep learning for object detection
In the past years, with the blooming development of deep learning, great progresses have been achieved in the field of object detection. Compared with the traditional object detection, based on deep learning has great advantages.In the feature extraction stage, deep learning made use of convolutional neural network to learn features instead of handcrafted features. At the stage of candidate box selection, object detection algorithms can be partitioned into two types.
The first type, known as
Object detection
In this paper, we extend the two-stage architecture of Mask-RCNN[3] and Faster-RCNN with FPN[4]. FPN is a multi-scale RPN which proposes candidate bounding boxes from each level of backbone. In FPN, a feature pyramid is constructed to cope with large variation of object sizes. A head of conv and two sibling conv is attached to each level of the feature pyramid to predict objectiveness of multiple pre-defined anchors. Formally, it is designed to have six stages { } in FPN.
Dataset and settings
We use the MS-COCO dataset [5], which contains training images and testing images, covering 80 object categories. As defined by MS-COCO, targets with a target pixel less than 3232 are small-scale targets, while mesoscale targets are targets with pixels between 3232 and 9696, while large-scale targets are targets with pixels greater than 9696. Moreover, the number of MS-COCO data sets is not balanced, and the number of small targets accounts for the most among this dataset. MS-COCO
Conclusions
In this paper, we present an intuitive approach which introduces image-level co-supervision in order to provide richer contextual cues for object detection. The implementation is simple (with a light-weighted module containing regular operations) and efficient (requiring less than extra overheads), yet it achieves consistent gain in terms of detection accuracy. Our research reveals the usefulness of combining top-down and bottom-up signals in object detection, and we believe it to be
Declaration of Competing Interest
We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome. We confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. We further confirm that the order of authors listed in the manuscript has been approved by all of us. We confirm that
Acknowledgements
This work was supported in part by the National Key R&D Program of China (No. 2018YFB1402605, No. 2018YFB1004602, No. 2019QY1604), the Major Project for New Generation of AI (No.2018AAA0100400), the National Natural Science Foundation of China (No. 61836014, No. 61773375, No. 62006231, No. 62072457), and in part by the TuSimple Collaborative Research Project.
Junran Peng received his B.S. from Tsinghua University and currently is working as a Ph.D. at Chinese Academy of Science. He has published sev- eral papers in the international conferences like ICCV, NeurIPS and CVPR. His research interests include object detection, AutoML and machine learning.
References (51)
- et al.
Cagnet: content-aware guidance for salient object detection
Pattern Recognit
(2020) - et al.
Mdfn: multi-scale deep feature learning network for object detection
Pattern Recognit
(2020) - et al.
Robust one-stage object detection with location-aware classifiers
Pattern Recognit
(2020) - et al.
Multi-model ensemble with rich spatial information for object detection
Pattern Recognit
(2020) - et al.
Rich feature hierarchies for accurate object detection and semantic segmentation
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2014) - et al.
Faster R-CNN: Towards real-time object detection with region proposal networks
Advances in Neural Information Processing Systems
(2015) - et al.
Mask R-CNN
Proceedings of the IEEE International Conference on Computer Vision
(2017) - et al.
Feature pyramid networks for object detection.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2017) - et al.
Microsoft coco: Common objects in context
Proceedings of the European Conference on Computer Vision
(2014) - et al.
You only look once: Unified, real-time object detection
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2016)
Yolo9000: better, faster, stronger
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Ssd: Single shot multibox detector
Proceedings of the European Conference on Computer Vision
Dssd: deconvolutional single shot detector
arXiv preprint arXiv:1701.06659
Cornernet: Detecting objects as paired keypoints
Proceedings of the European Conference on Computer Vision
Single-shot refinement neural network for object detection
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Nas-fpn: Learning scalable feature pyramid architecture for object detection
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Efficientnet: rethinking model scaling for convolutional neural networks
arXiv preprint arXiv:1905.11946
Receptive field block net for accurate and fast object detection
arXiv preprint arXiv:1711.07767
Dsod: Learning deeply supervised object detectors from scratch
Proceedings of the IEEE International Conference on Computer Vision
Gated cnn: integrating multi-scale feature layers for object detection
Pattern Recognit
Beyond skip connections: top-down modulation for object detection
arXiv preprint arXiv:1612.06851
Scale-transferrable object detection
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
On learning to localize objects with minimal supervision
International Conference on Machine Learning
Deep self-taught learning for weakly supervised object localization
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Weakly supervised deep detection networks
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Cited by (26)
YOLO*C — Adding context improves YOLO performance
2023, NeurocomputingMultiscale features integration based multiple-in-single-out network for object detection
2023, Image and Vision ComputingCycle-object consistency for image-to-image domain adaptation
2023, Pattern Recognition
Junran Peng received his B.S. from Tsinghua University and currently is working as a Ph.D. at Chinese Academy of Science. He has published sev- eral papers in the international conferences like ICCV, NeurIPS and CVPR. His research interests include object detection, AutoML and machine learning.
Haoquan Wang received his B.S. degree from the School of Physics and Electronics, Hunan University in 2019. He is currently pursuing the M.S. degree with the School of Microelectronics, Tianjin University, China. His research interests include computer vision, machine learning and super resolution reconstruction.
Shaolong Yue received his B.Sc. degree from the School of Electrical and Electronic Engineering, ShanDong University of Technology, China, in 2018. He is currently pursuing the M.S. degree with the School of Control and Computer Engineering, North China Electric Power University, China. His research interests include pattern recognition, computer vision and machine learning.
Zhaoxiang Zhang received his bachelors degree in Circuits and Systems from the University of Science and Technology of China (USTC) in 2004. In 2004, he joined the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, under the supervision of Professor Tieniu Tan, and he received his Ph.D. degree in 2009. In October 2009, he joined the School of Computer Science and Engineering, Beihang University, as an Assistant Professor (2009–2011), an Associate professor (2012–2015) and the vise-director of the Department of Computer application technology (2014–2015). In July 2015, he returned to the Institute of Automation, Chinese Academy of Sciences. He is now a full Professor in the Center for Research on Intelligent Perception and Computing (CRIPAC) and the National Laboratory of Pattern Recognition (NLPR). His research interests include Computer Vision, Pattern Recognition, and Machine Learning. Recently, he specifically focuses on biologically inspired intelligent computing and its applications on human analysis and scene understanding. He has published more than 150 papers in the international journals and conferences, including reputable international journals such as IEEE TIP, IEEE TCSVT, IEEE TIFS and top level international conferences like CVPR, ICCV, ECCV, NeurIPS, AAAI and IJCAI. He served as the (guesting) associated editor of IEEE TCSVT, PR, PRL, and Frontiers of Computer Science. He served as the area chair, senior PC or PC of international conferences like CVPR, NeurIPS, ICML, AAAI, IJCAI.