Context-aware co-supervision for accurate object detection

doi:10.1016/j.patcog.2021.108199

Pattern Recognition

Volume 121, January 2022, 108199

https://doi.org/10.1016/j.patcog.2021.108199 Get rights and content

Highlights

•
We advocate the importance of equipping two-stage detectors with top-down signals, in order to which provides high-level contextual cues to complement low-level features. In practice, this is implemented by adding a side path in the detection head to predict all object classes in the image, which is co-supervised by image-level semantics and requires little extra overheads.
•
Our research reveals the usefulness of combining top-down and bottom-up signals in object detection, and we believe it to be generalized to other tasks. The simplicity and originality of our approach leave much room for future research, in which we will append more powerful modules to enhance contexts and other cues for visual recognition.

Abstract

State-of-the-art object detection approaches are often composed of two stages, namely, proposing a number of regions on an image and classifying each of them into one class. Both stages share a network backbone which builds visual features in a bottom-up manner. In this paper, we advocate the importance of equipping two-stage detectors with top-down signals, in order to which provides high-level contextual cues to complement low-level features. In practice, this is implemented by adding a side path in the detection head to predict all object classes in the image, which is co-supervised by image-level semantics and requires little extra overheads. Our approach is easily applied to two popular object detection algorithms, and achieves consistent performance gain in the MS-COCO dataset.

Introduction

Object detection is a challenging task in computer vision. Recently, with the development of deep learning especially convolutional neural networks, researchers have designed an effective two-stage framework for object detection [1], [2], [3]. Based on a powerful network backbone, these methods first use a proposal stage to extract a number of regions as probable objects, and then apply an individual classifier stage to distinguish the object class within each region.

In another point of view, there are two kinds of approach for understanding the process of perception. Top-down processing is defined as the development of pattern recognition through the use of contextual information. For instance, you are presented with a paragraph written with difficult handwriting. It is easier to understand what the writer wants to convey if you read the whole paragraph rather than reading the words in separate terms. The brain may be able to perceive and understand the gist of the paragraph due to the context supplied by the surrounding words. In the bottom-up processing approach, perception starts at the sensory input, the stimulus. Thus, perception can be described as data-driven. For example, there is a flower at the center of a person’s field. The sight of the flower and all the information about the stimulus is carried from the retina to the visual cortex in the brain. We argue that such guidance is mostly context-free and thus not aware of high-level semantic cues in its surrounding areas. When a number of small proposals are extracted from a complex scene, the corresponding features computed from a small fraction of the feature map are less capable of accurately determining the object classes. With less confident scores, these small objects can be suppressed by large objects and thus mis-detected. Consequently, we propose a context-aware module, which extracts top-down signals to assist bottom-up signals for object detection. An example is shown in Fig. 1.

To assist object detection with richer high-level semantic information, we propose an efficient module which introduces a context-aware module to the detection network. This is achieved by adding an extra classification path to determine which object classes appear in the entire image. This is to say, the features built up in the backbone are expected to have the ability of depicting both each individual proposal and the entire image, which the latter provides top-down information to refine the former outputs. In practice, we propose a global context encoding (GCE) module to summarize all features, and simultaneously feed these global features to complement the features pooled within each proposal. There are two sources of supervision, i.e., co-supervision. The first one lies in the conventional way of classification and regression within each proposal, and the second one is produced by the image-level classification which is obtained by adding a linear layer beyond the GCE features. There is a deep supervision of image-level classification which designed as modified binary cross-entropy (BCE) loss and will be introduced in Section 3. As we can see in Fig. 2, firstly we summarize regional features and send them for image-level classification; secondly, fuse the trained features with ROI-pooled features within each proposal for enhancing regional classification and regression. Our approach is easy to implement and generalizes well to a wide range of detection frameworks, including some popular add-ons such as feature pyramid network (FPN) [4] which enables multi-scale feature extraction. A detailed flowchart can be found in Fig. 3.

Experiments are performed on the MS-COCO dataset [5] which contains a lot of small and medium objects in complex image contents. Two popular baselines are considered, namely, Faster R-CNN [2] and Mask R-CNN [3], both of which are equipped with FPN [4]. Our approach, requiring less than $10 %$ extra time and memory costs, achieves consistent accuracy gain, e.g., $1.01 %$ AP improvement and $1.85 %$ [email protected] improvement beyond Mask R-CNN. Diagnostic experiments suggest that our approach is especially effective in finding small and difficult objects. This work paves the way of future research in combining top-down semantic-aware information with bottom-up regional features, which is also applicable to a wide range of vision tasks.

The remainder of this paper is organized as follows. Section 2 briefly reviews prior work, and Section 3 describes our approach. After experiments are shown in Section 4, we conclude this work in Section 5.

Section snippets

Deep learning for object detection

In the past years, with the blooming development of deep learning, great progresses have been achieved in the field of object detection. Compared with the traditional object detection, based on deep learning has great advantages.In the feature extraction stage, deep learning made use of convolutional neural network to learn features instead of handcrafted features. At the stage of candidate box selection, object detection algorithms can be partitioned into two types.

The first type, known as

Object detection

In this paper, we extend the two-stage architecture of Mask-RCNN[3] and Faster-RCNN with FPN[4]. FPN is a multi-scale RPN which proposes candidate bounding boxes from each level of backbone. In FPN, a feature pyramid is constructed to cope with large variation of object sizes. A head of $3 \times 3$ conv and two sibling $1 \times 1$ conv is attached to each level of the feature pyramid to predict objectiveness of multiple pre-defined anchors. Formally, it is designed to have six stages { $P_{2}, P_{3}, P_{4}, P_{5}, P_{6}$ } in FPN. $P$

Dataset and settings

We use the MS-COCO dataset [5], which contains $118 K$ training images and $5 K$ testing images, covering 80 object categories. As defined by MS-COCO, targets with a target pixel less than 3232 are small-scale targets, while mesoscale targets are targets with pixels between 3232 and 9696, while large-scale targets are targets with pixels greater than 9696. Moreover, the number of MS-COCO data sets is not balanced, and the number of small targets accounts for the most among this dataset. MS-COCO

Conclusions

In this paper, we present an intuitive approach which introduces image-level co-supervision in order to provide richer contextual cues for object detection. The implementation is simple (with a light-weighted module containing regular operations) and efficient (requiring less than $10 %$ extra overheads), yet it achieves consistent gain in terms of detection accuracy. Our research reveals the usefulness of combining top-down and bottom-up signals in object detection, and we believe it to be

Declaration of Competing Interest

We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome. We confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. We further confirm that the order of authors listed in the manuscript has been approved by all of us. We confirm that

Acknowledgements

This work was supported in part by the National Key R&D Program of China (No. 2018YFB1402605, No. 2018YFB1004602, No. 2019QY1604), the Major Project for New Generation of AI (No.2018AAA0100400), the National Natural Science Foundation of China (No. 61836014, No. 61773375, No. 62006231, No. 62072457), and in part by the TuSimple Collaborative Research Project.

Junran Peng received his B.S. from Tsinghua University and currently is working as a Ph.D. at Chinese Academy of Science. He has published sev- eral papers in the international conferences like ICCV, NeurIPS and CVPR. His research interests include object detection, AutoML and machine learning.

References (51)

S. Mohammadi et al.
Cagnet: content-aware guidance for salient object detection
Pattern Recognit
(2020)
W. Ma et al.
Mdfn: multi-scale deep feature learning network for object detection
Pattern Recognit
(2020)
Q. Chen et al.
Robust one-stage object detection with location-aware classifiers
Pattern Recognit
(2020)
J. Xu et al.
Multi-model ensemble with rich spatial information for object detection
Pattern Recognit
(2020)
R. Girshick et al.
Rich feature hierarchies for accurate object detection and semantic segmentation
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2014)
S. Ren et al.
Faster R-CNN: Towards real-time object detection with region proposal networks
Advances in Neural Information Processing Systems
(2015)
K. He et al.
Mask R-CNN
Proceedings of the IEEE International Conference on Computer Vision
(2017)
T. Lin et al.
Feature pyramid networks for object detection.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2017)
T. Lin et al.
Microsoft coco: Common objects in context
Proceedings of the European Conference on Computer Vision
(2014)
J. Redmon et al.
You only look once: Unified, real-time object detection
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2016)

J. Redmon et al.

Yolo9000: better, faster, stronger

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2017)

W. Liu et al.

Ssd: Single shot multibox detector

Proceedings of the European Conference on Computer Vision

(2016)

C. Fu et al.

Dssd: deconvolutional single shot detector

arXiv preprint arXiv:1701.06659

(2017)

H. Law et al.

Cornernet: Detecting objects as paired keypoints

Proceedings of the European Conference on Computer Vision

(2018)

S. Zhang et al.

Single-shot refinement neural network for object detection

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2018)

G. Ghiasi et al.

Nas-fpn: Learning scalable feature pyramid architecture for object detection

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2019)

M. Tan et al.

Efficientnet: rethinking model scaling for convolutional neural networks

arXiv preprint arXiv:1905.11946

(2019)

S. Liu et al.

Receptive field block net for accurate and fast object detection

arXiv preprint arXiv:1711.07767

(2017)

Z. Shen et al.

Dsod: Learning deeply supervised object detectors from scratch

Proceedings of the IEEE International Conference on Computer Vision

(2017)

J. Yuan et al.

Gated cnn: integrating multi-scale feature layers for object detection

Pattern Recognit

(2019)

A. Shrivastava et al.

Beyond skip connections: top-down modulation for object detection

arXiv preprint arXiv:1612.06851

(2016)

P. Zhou et al.

Scale-transferrable object detection

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2018)

H.O. Song et al.

On learning to localize objects with minimal supervision

International Conference on Machine Learning

(2014)

Z. Jie et al.

Deep self-taught learning for weakly supervised object localization

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2017)

H. Bilen et al.

Weakly supervised deep detection networks

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2016)

Cited by (26)

YOLO*C — Adding context improves YOLO performance
2023, Neurocomputing
You Only Look Once (YOLO) algorithms deliver state-of-the-art performance in object detection. This research proposes a novel one-stage YOLO-based algorithm that explicitly models the spatial context inherent in traffic scenes. The new YOLO*C algorithm introduces the MCTX context module and integrates loss function changes, effectively leveraging rich global context information. The performance of YOLO*C models is tested on BDD100K traffic data with multiple context variables. The results show that including context improves YOLO detection results without losing efficiency. Smaller models report the most significant improvements. The smallest model accomplished more than a 10% increase in mAP .5 compared to the baseline YOLO model. Modified YOLOv7 outperformed all models on mAP .5, including two-stage and transformer-based detectors, available at the dataset zoo. The analysis shows that improvement mainly results from better detection of smaller traffic objects, which presents a significant detection challenge within the complex traffic environment.
Learn from each other to Classify better: Cross-layer mutual attention learning for fine-grained visual classification
2023, Pattern Recognition
Fine-grained visual classification (FGVC) is valuable yet challenging. The difficulty of FGVC mainly lies in its intrinsic inter-class similarity, intra-class variation, and limited training data. Moreover, with the popularity of deep convolutional neural networks, researchers have mainly used deep, abstract, semantic information for FGVC, while shallow, detailed information has been neglected. This work proposes a cross-layer mutual attention learning network (CMAL-Net) to solve the above problems. Specifically, this work views the shallow to deep layers of CNNs as “experts” knowledgeable about different perspectives. We let each expert give a category prediction and an attention region indicating the found clues. Attention regions are treated as information carriers among experts, bringing three benefits: ( $i$ ) helping the model focus on discriminative regions; ( $i i$ ) providing more training data; ( $i i i$ ) allowing experts to learn from each other to improve the overall performance. CMAL-Net achieves state-of-the-art performance on three competitive datasets: FGVC-Aircraft, Stanford Cars, and Food-11. The source code is available at https://github.com/Dichao-Liu/CMAL
CANet: Contextual Information and Spatial Attention Based Network for Detecting Small Defects in Manufacturing Industry
2023, Pattern Recognition
Despite the promising development of Automatic Visual Inspection (AVI) in the manufacturing industry, detecting small-sized defects with fewer pixels coverage remains a challenging problem due to its insufficient attention and lack of semantic information. Most exsiting convolutional inspection methods overlook the long-range dependence of context and lack adaptive fusion strategies to exploit heterogeneous features. To address these issues in AVI, this paper proposes a novel contextual information and spatial attention based network (CANet), which consists of two steps, namely CAblock and LaplacianFPN, for effective perception and exploitation of small defect features. Specifically, CAblock extracts semantic information with rich context by encoding spatial long-range dependence and decoding contextual information as channel-specific bias through a Spatial Attention Encoder (SAE) and a Context Block Decoder (CBD), respectively. LaplacianFPN further performs adaptive feature fusion considering both feature consistency and heterogeneity via two parallel branches. As a benchmark, a self-built Engine Surface Defects (ESD) dataset collected in real industry containing 89.70% small defects is constructed. Experimental results show that CANet achieves mAP-50 improvements of 1.5% and 4.3% compared to state-of-the-art methods on NEU-DET and ESD, which demonstrates the effectiveness of the proposed method. The code is now available at https://github.com/xiuqhou/CANet.
Multiscale features integration based multiple-in-single-out network for object detection
2023, Image and Vision Computing
The single-level feature map-based object detection has been a challenging task due to the feature scale limitation. Therefore, enriching multiscale information of single-level features is considered a promising approach to deal with this challenge. Although most existing methods have attempted to augment the feature scale of single-level features, the detection performance is still unsatisfactory because these methods mine multiscale features only based on a one-level feature map. To address this problem, we propose a multiple-in-single-out network (MiSoNet) to integrate multiscale information from multilevel feature maps into a single-level feature map. To achieve this, MiSoNet’s key component is equipped with two cascaded modules: a multilevel feature integration module (MFIM) and a depthwise convolutional residual encoder (DWEncoder). Specifically, MFIM adaptively fuses features of inconsistent semantics and scales from multilevel feature maps. DWEncoder stacks several residual blocks with depthwise convolutions to extract multiscale contexts in the single feature map, which can further extend the scale range of the receptive fields. Extensive experiments are conducted on the Common Objects in Context (COCO) dataset, where the MiSoNet achieves a 41.0AP, which surpasses the YOLOF by 1.4AP with negligible computational overhead. Moreover, the MiSoNet, with fewer parameters and FLOPs, outperforms some advanced detectors based on the feature pyramid network.
Cycle-object consistency for image-to-image domain adaptation
2023, Pattern Recognition
Recent advances in generative adversarial networks (GANs) have been proven effective in performing domain adaptation for object detectors through data augmentation. While GANs are exceptionally successful, those methods that can preserve objects well in the image-to-image translation task usually require an auxiliary task, such as semantic segmentation to prevent the image content from being too distorted. However, pixel-level annotations are difficult to obtain in practice. Alternatively, instance-aware image-translation model treats object instances and background separately. Yet, it requires object detectors at test time, assuming that off-the-shelf detectors work well in both domains. In this work, we present AugGAN-Det, which introduces Cycle-object Consistency (CoCo) loss to generate instance-aware translated images across complex domains. The object detector of the target domain is directly leveraged in generator training and guides the preserved objects in the translated images to carry target-domain appearances. Compared to previous models, which e.g., require pixel-level semantic segmentation to force the latent distribution to be object-preserving, this work only needs bounding box annotations which are significantly easier to acquire. Next, as to the instance-aware GAN models, our model, AugGAN-Det, internalizes global and object style-transfer without explicitly aligning the instance features. Most importantly, a detector is not required at test time. Experimental results demonstrate that our model outperforms recent object-preserving and instance-level models and achieves state-of-the-art detection accuracy and visual perceptual quality.
AFI-GAN: Improving feature interpolation of feature pyramid networks via adversarial training for object detection
2023, Pattern Recognition
Recent convolutional detectors learn strong semantic features by generating and combining multi-scale features via feature interpolation. However, simple interpolation incurs often noisy and blurred features. To resolve this, we propose a novel adversarially-trained interpolator which can substitute for the traditional interpolation effortlessly. In specific, we design AFI-GAN consisting of an AF interpolator and a feature patch discriminator. In addition, we present a progressive adversarial learning and AFI-GAN losses to generate multi-scale features for downstream detection tasks. However, we can also finetune the proposed AFI-GAN with the recent multi-scale detectors without the adversarial learning once a pre-trained AF interpolator is provided. We prove the effectiveness and flexibility of our AF interpolator, and achieve the better box and mask APs by 2.2% and 1.6% on average compared to using other interpolation. Moreover, we achieve an impressive detection score of 57.3% mAP on the MSCOCO dataset. Code is available at https://github.com/inhavl-shlee/AFI-GAN.

View all citing articles on Scopus

Haoquan Wang received his B.S. degree from the School of Physics and Electronics, Hunan University in 2019. He is currently pursuing the M.S. degree with the School of Microelectronics, Tianjin University, China. His research interests include computer vision, machine learning and super resolution reconstruction.

Shaolong Yue received his B.Sc. degree from the School of Electrical and Electronic Engineering, ShanDong University of Technology, China, in 2018. He is currently pursuing the M.S. degree with the School of Control and Computer Engineering, North China Electric Power University, China. His research interests include pattern recognition, computer vision and machine learning.

Zhaoxiang Zhang received his bachelors degree in Circuits and Systems from the University of Science and Technology of China (USTC) in 2004. In 2004, he joined the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, under the supervision of Professor Tieniu Tan, and he received his Ph.D. degree in 2009. In October 2009, he joined the School of Computer Science and Engineering, Beihang University, as an Assistant Professor (2009–2011), an Associate professor (2012–2015) and the vise-director of the Department of Computer application technology (2014–2015). In July 2015, he returned to the Institute of Automation, Chinese Academy of Sciences. He is now a full Professor in the Center for Research on Intelligent Perception and Computing (CRIPAC) and the National Laboratory of Pattern Recognition (NLPR). His research interests include Computer Vision, Pattern Recognition, and Machine Learning. Recently, he specifically focuses on biologically inspired intelligent computing and its applications on human analysis and scene understanding. He has published more than 150 papers in the international journals and conferences, including reputable international journals such as IEEE TIP, IEEE TCSVT, IEEE TIFS and top level international conferences like CVPR, ICCV, ECCV, NeurIPS, AAAI and IJCAI. He served as the (guesting) associated editor of IEEE TCSVT, PR, PRL, and Frontiers of Computer Science. He served as the area chair, senior PC or PC of international conferences like CVPR, NeurIPS, ICML, AAAI, IJCAI.

View full text

Context-aware co-supervision for accurate object detection

Highlights

Abstract

Introduction

Section snippets

Deep learning for object detection

Object detection

Dataset and settings

Conclusions

Declaration of Competing Interest

Acknowledgements

Pattern Recognit

Pattern Recognit

Pattern Recognit

Pattern Recognit

Rich feature hierarchies for accurate object detection and semantic segmentation

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Faster R-CNN: Towards real-time object detection with region proposal networks

Advances in Neural Information Processing Systems

Mask R-CNN

Proceedings of the IEEE International Conference on Computer Vision

Feature pyramid networks for object detection.

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Microsoft coco: Common objects in context

Proceedings of the European Conference on Computer Vision

You only look once: Unified, real-time object detection

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Yolo9000: better, faster, stronger

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Ssd: Single shot multibox detector

Proceedings of the European Conference on Computer Vision

Dssd: deconvolutional single shot detector

arXiv preprint arXiv:1701.06659

Cornernet: Detecting objects as paired keypoints

Proceedings of the European Conference on Computer Vision

Single-shot refinement neural network for object detection

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Nas-fpn: Learning scalable feature pyramid architecture for object detection

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Efficientnet: rethinking model scaling for convolutional neural networks

arXiv preprint arXiv:1905.11946

Receptive field block net for accurate and fast object detection

arXiv preprint arXiv:1711.07767

Dsod: Learning deeply supervised object detectors from scratch

Proceedings of the IEEE International Conference on Computer Vision

Gated cnn: integrating multi-scale feature layers for object detection

Pattern Recognit

Beyond skip connections: top-down modulation for object detection

arXiv preprint arXiv:1612.06851

Scale-transferrable object detection

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

On learning to localize objects with minimal supervision

International Conference on Machine Learning

Deep self-taught learning for weakly supervised object localization

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Weakly supervised deep detection networks

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition