Elsevier

Pattern Recognition

Volume 121, January 2022, 108199
Pattern Recognition

Context-aware co-supervision for accurate object detection

https://doi.org/10.1016/j.patcog.2021.108199Get rights and content

Highlights

  • We advocate the importance of equipping two-stage detectors with top-down signals, in order to which provides high-level contextual cues to complement low-level features. In practice, this is implemented by adding a side path in the detection head to predict all object classes in the image, which is co-supervised by image-level semantics and requires little extra overheads.

  • Our research reveals the usefulness of combining top-down and bottom-up signals in object detection, and we believe it to be generalized to other tasks. The simplicity and originality of our approach leave much room for future research, in which we will append more powerful modules to enhance contexts and other cues for visual recognition.

Abstract

State-of-the-art object detection approaches are often composed of two stages, namely, proposing a number of regions on an image and classifying each of them into one class. Both stages share a network backbone which builds visual features in a bottom-up manner. In this paper, we advocate the importance of equipping two-stage detectors with top-down signals, in order to which provides high-level contextual cues to complement low-level features. In practice, this is implemented by adding a side path in the detection head to predict all object classes in the image, which is co-supervised by image-level semantics and requires little extra overheads. Our approach is easily applied to two popular object detection algorithms, and achieves consistent performance gain in the MS-COCO dataset.

Introduction

Object detection is a challenging task in computer vision. Recently, with the development of deep learning especially convolutional neural networks, researchers have designed an effective two-stage framework for object detection [1], [2], [3]. Based on a powerful network backbone, these methods first use a proposal stage to extract a number of regions as probable objects, and then apply an individual classifier stage to distinguish the object class within each region.

In another point of view, there are two kinds of approach for understanding the process of perception. Top-down processing is defined as the development of pattern recognition through the use of contextual information. For instance, you are presented with a paragraph written with difficult handwriting. It is easier to understand what the writer wants to convey if you read the whole paragraph rather than reading the words in separate terms. The brain may be able to perceive and understand the gist of the paragraph due to the context supplied by the surrounding words. In the bottom-up processing approach, perception starts at the sensory input, the stimulus. Thus, perception can be described as data-driven. For example, there is a flower at the center of a person’s field. The sight of the flower and all the information about the stimulus is carried from the retina to the visual cortex in the brain. We argue that such guidance is mostly context-free and thus not aware of high-level semantic cues in its surrounding areas. When a number of small proposals are extracted from a complex scene, the corresponding features computed from a small fraction of the feature map are less capable of accurately determining the object classes. With less confident scores, these small objects can be suppressed by large objects and thus mis-detected. Consequently, we propose a context-aware module, which extracts top-down signals to assist bottom-up signals for object detection. An example is shown in Fig. 1.

To assist object detection with richer high-level semantic information, we propose an efficient module which introduces a context-aware module to the detection network. This is achieved by adding an extra classification path to determine which object classes appear in the entire image. This is to say, the features built up in the backbone are expected to have the ability of depicting both each individual proposal and the entire image, which the latter provides top-down information to refine the former outputs. In practice, we propose a global context encoding (GCE) module to summarize all features, and simultaneously feed these global features to complement the features pooled within each proposal. There are two sources of supervision, i.e., co-supervision. The first one lies in the conventional way of classification and regression within each proposal, and the second one is produced by the image-level classification which is obtained by adding a linear layer beyond the GCE features. There is a deep supervision of image-level classification which designed as modified binary cross-entropy (BCE) loss and will be introduced in Section 3. As we can see in Fig. 2, firstly we summarize regional features and send them for image-level classification; secondly, fuse the trained features with ROI-pooled features within each proposal for enhancing regional classification and regression. Our approach is easy to implement and generalizes well to a wide range of detection frameworks, including some popular add-ons such as feature pyramid network (FPN) [4] which enables multi-scale feature extraction. A detailed flowchart can be found in Fig. 3.

Experiments are performed on the MS-COCO dataset [5] which contains a lot of small and medium objects in complex image contents. Two popular baselines are considered, namely, Faster R-CNN [2] and Mask R-CNN [3], both of which are equipped with FPN [4]. Our approach, requiring less than 10% extra time and memory costs, achieves consistent accuracy gain, e.g., 1.01% AP improvement and 1.85% [email protected] improvement beyond Mask R-CNN. Diagnostic experiments suggest that our approach is especially effective in finding small and difficult objects. This work paves the way of future research in combining top-down semantic-aware information with bottom-up regional features, which is also applicable to a wide range of vision tasks.

The remainder of this paper is organized as follows. Section 2 briefly reviews prior work, and Section 3 describes our approach. After experiments are shown in Section 4, we conclude this work in Section 5.

Section snippets

Deep learning for object detection

In the past years, with the blooming development of deep learning, great progresses have been achieved in the field of object detection. Compared with the traditional object detection, based on deep learning has great advantages.In the feature extraction stage, deep learning made use of convolutional neural network to learn features instead of handcrafted features. At the stage of candidate box selection, object detection algorithms can be partitioned into two types.

The first type, known as

Object detection

In this paper, we extend the two-stage architecture of Mask-RCNN[3] and Faster-RCNN with FPN[4]. FPN is a multi-scale RPN which proposes candidate bounding boxes from each level of backbone. In FPN, a feature pyramid is constructed to cope with large variation of object sizes. A head of 3×3 conv and two sibling 1×1 conv is attached to each level of the feature pyramid to predict objectiveness of multiple pre-defined anchors. Formally, it is designed to have six stages {P2,P3,P4,P5,P6 } in FPN. P

Dataset and settings

We use the MS-COCO dataset [5], which contains 118K training images and 5K testing images, covering 80 object categories. As defined by MS-COCO, targets with a target pixel less than 3232 are small-scale targets, while mesoscale targets are targets with pixels between 3232 and 9696, while large-scale targets are targets with pixels greater than 9696. Moreover, the number of MS-COCO data sets is not balanced, and the number of small targets accounts for the most among this dataset. MS-COCO

Conclusions

In this paper, we present an intuitive approach which introduces image-level co-supervision in order to provide richer contextual cues for object detection. The implementation is simple (with a light-weighted module containing regular operations) and efficient (requiring less than 10% extra overheads), yet it achieves consistent gain in terms of detection accuracy. Our research reveals the usefulness of combining top-down and bottom-up signals in object detection, and we believe it to be

Declaration of Competing Interest

We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome. We confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. We further confirm that the order of authors listed in the manuscript has been approved by all of us. We confirm that

Acknowledgements

This work was supported in part by the National Key R&D Program of China (No. 2018YFB1402605, No. 2018YFB1004602, No. 2019QY1604), the Major Project for New Generation of AI (No.2018AAA0100400), the National Natural Science Foundation of China (No. 61836014, No. 61773375, No. 62006231, No. 62072457), and in part by the TuSimple Collaborative Research Project.

Junran Peng received his B.S. from Tsinghua University and currently is working as a Ph.D. at Chinese Academy of Science. He has published sev- eral papers in the international conferences like ICCV, NeurIPS and CVPR. His research interests include object detection, AutoML and machine learning.

References (51)

  • J. Redmon et al.

    Yolo9000: better, faster, stronger

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • W. Liu et al.

    Ssd: Single shot multibox detector

    Proceedings of the European Conference on Computer Vision

    (2016)
  • C. Fu et al.

    Dssd: deconvolutional single shot detector

    arXiv preprint arXiv:1701.06659

    (2017)
  • H. Law et al.

    Cornernet: Detecting objects as paired keypoints

    Proceedings of the European Conference on Computer Vision

    (2018)
  • S. Zhang et al.

    Single-shot refinement neural network for object detection

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2018)
  • G. Ghiasi et al.

    Nas-fpn: Learning scalable feature pyramid architecture for object detection

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2019)
  • M. Tan et al.

    Efficientnet: rethinking model scaling for convolutional neural networks

    arXiv preprint arXiv:1905.11946

    (2019)
  • S. Liu et al.

    Receptive field block net for accurate and fast object detection

    arXiv preprint arXiv:1711.07767

    (2017)
  • Z. Shen et al.

    Dsod: Learning deeply supervised object detectors from scratch

    Proceedings of the IEEE International Conference on Computer Vision

    (2017)
  • J. Yuan et al.

    Gated cnn: integrating multi-scale feature layers for object detection

    Pattern Recognit

    (2019)
  • A. Shrivastava et al.

    Beyond skip connections: top-down modulation for object detection

    arXiv preprint arXiv:1612.06851

    (2016)
  • P. Zhou et al.

    Scale-transferrable object detection

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2018)
  • H.O. Song et al.

    On learning to localize objects with minimal supervision

    International Conference on Machine Learning

    (2014)
  • Z. Jie et al.

    Deep self-taught learning for weakly supervised object localization

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • H. Bilen et al.

    Weakly supervised deep detection networks

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2016)
  • Cited by (26)

    View all citing articles on Scopus

    Junran Peng received his B.S. from Tsinghua University and currently is working as a Ph.D. at Chinese Academy of Science. He has published sev- eral papers in the international conferences like ICCV, NeurIPS and CVPR. His research interests include object detection, AutoML and machine learning.

    Haoquan Wang received his B.S. degree from the School of Physics and Electronics, Hunan University in 2019. He is currently pursuing the M.S. degree with the School of Microelectronics, Tianjin University, China. His research interests include computer vision, machine learning and super resolution reconstruction.

    Shaolong Yue received his B.Sc. degree from the School of Electrical and Electronic Engineering, ShanDong University of Technology, China, in 2018. He is currently pursuing the M.S. degree with the School of Control and Computer Engineering, North China Electric Power University, China. His research interests include pattern recognition, computer vision and machine learning.

    Zhaoxiang Zhang received his bachelors degree in Circuits and Systems from the University of Science and Technology of China (USTC) in 2004. In 2004, he joined the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, under the supervision of Professor Tieniu Tan, and he received his Ph.D. degree in 2009. In October 2009, he joined the School of Computer Science and Engineering, Beihang University, as an Assistant Professor (2009–2011), an Associate professor (2012–2015) and the vise-director of the Department of Computer application technology (2014–2015). In July 2015, he returned to the Institute of Automation, Chinese Academy of Sciences. He is now a full Professor in the Center for Research on Intelligent Perception and Computing (CRIPAC) and the National Laboratory of Pattern Recognition (NLPR). His research interests include Computer Vision, Pattern Recognition, and Machine Learning. Recently, he specifically focuses on biologically inspired intelligent computing and its applications on human analysis and scene understanding. He has published more than 150 papers in the international journals and conferences, including reputable international journals such as IEEE TIP, IEEE TCSVT, IEEE TIFS and top level international conferences like CVPR, ICCV, ECCV, NeurIPS, AAAI and IJCAI. He served as the (guesting) associated editor of IEEE TCSVT, PR, PRL, and Frontiers of Computer Science. He served as the area chair, senior PC or PC of international conferences like CVPR, NeurIPS, ICML, AAAI, IJCAI.

    View full text