Elsevier

Neurocomputing

Volume 454, 24 September 2021, Pages 474-482
Neurocomputing

Adaptive attention augmentor for weakly supervised object localization

https://doi.org/10.1016/j.neucom.2021.05.024Get rights and content

Abstract

Weakly Supervised Object Localization (WSOL) is a technique for obtaining the object location from attention maps of the classification network, without using bounding box annotations. Existing WSOL approaches lack the modeling of the correlations between different regions of the target object. Hence they can only locate some discriminative attentions, which are small and sparse. Besides, they introduce too many background attentions when mining more object parts. In this paper, we propose a novel Adaptive Attention Augmentor (A3) to adaptively augment the target object attentions on class attention maps. It can supplement object attentions by discovering the semantic correspondence between different regions and dynamically suppress background attentions through the proposed Focal Dice loss. Extensive experiments demonstrate the effectiveness of our approach. On the ILSVRC dataset, A3 achieves a new state-of-the-art localization performance. On the fine-grained datasets including CUB-200–2011 and Cars-196, it also achieves very competitive results.

Introduction

Recently, Weakly Supervised Object Localization (WSOL) has attracted extensive attention because of its implementation simplicity. It can locate the target object using image-level labels, without any expensive bounding box annotations. A large number of methods have emerged [12], [14], [36], [39], [41] and some of which have achieved breakthrough progress and considerable localization results.

Class Activation Mapping (CAM) [41], as one of the most representative and widely-used methods, laid the foundation for subsequent studies. CAM generates an attention map for each class through the modified classification network (e.g. VGGnet [29] and GoogLeNet [31]), and obtains the location of the target object from these class-specific maps. Whereas CAM has its Achilles' heel, that is, only the most discriminative part is highlighted on the class attention map, thus it fails to locate the full extent of the target object. To address this issue, Wei et al. [37] proposed an adversarial erasing approach which mines more objects regions by sequentially erasing the most discriminative part on images. It costs a lot of time and computing resources because of the need to train several classification networks. Therefore, an Adversarial Complementary Learning (ACoL) [39] approach was proposed, which incorporated an additional classifier to implement end-to-end online attention-erasing. Moreover, Choe and Shim [3] introduced an Attention-based Dropout Layer (ADL),

a lightweight yet powerful method which utilizes a drop mask and an importance map to randomly erase the most discriminative attention and improve the classification power of the model. The core idea of these self-erasing approaches is to remove the most discriminative object part or object attention during training, forcing the network to learn the less discriminative features to mine more object attentions, so as to capture the full extent of object. Although they have achieved excellent results on several datasets, there is a serious drawback: they introduce too many backgrounds under the following circumstances:

  • When the most discriminative attention covers almost the entire object. The full extent of target object can be easily captured on the class attention map at times, especially when the object is small. Therefore, when the most discriminative attention is erased, there are no other object attentions left for the network to mine, only the surrounding background attentions.

  • When specific types of backgrounds co-occur with target objects. In a wide variety of classes, certain backgrounds co-occur with target objects, like airplane and sky, boat and sea. Hence the discriminativeness of background attentions might be higher than that of most object regions. Thus, when the most discriminative attention is eliminated, the network tends to perceive background attentions despite there are many object attentions that can be mined.

In this paper, we propose the Adaptive Attention Augmentor (A3), which can be easily embedded in the classifier of any classification network to adaptively augment the object attention on the classification attention maps, so as to capture the full extent of the target object and improve the localization accuracy. The framework is shown in Fig. 1. Specifically, A3 transforms the input feature maps into guiding maps and supplementary maps. The guiding maps serve to obtain the correlation matrix through the spatial self-attention to discover the semantic correspondence between different object regions and guide the supplementary maps to perceive more object attentions, which is revealed in Fig. 2. Moreover, we propose the Focal Dice loss, which checks and balances with the classification loss, dynamically augmenting object attentions and suppressing background attentions on the supplementary maps. We note that the localization attention map is obtained by fusing the class-specific guiding attention map and supplementary attention map.

Compared to the vanilla model [41] or recent state-of-the-art WSOL methods [3], [39], [40], [49], [50], our proposed approach can effectively augment object attentions while suppressing background attentions on the class attention maps. It significantly improving both the Top-1 Localization Accuracy and the MaxBoxAccV2 [51] on the ILSVRC dataset and achieving very competitive results on several fine-grained datasets.

Section snippets

Related work

Convolutional neural networks (CNN) have been widely used in common computer vision tasks [1], [7], [8], [10], [15], [16], [20], [22], [25], [28], [54], [55], [56]. Faster-RCNN [24], as the most famous two-stage detection method, generates region proposals and predicts highly reliable object locations in an end-to-end network in real time. One-stage detectors like YOLO [23] and SSD [18] are proposed to improve detection speed. FCN [19] is one of the earliest methods to implement semantic

Overview of A3

In this section, we present the overview of our proposed Adaptive Attention Augmentor (A3). Formally, we denote the input image as IRH×W, whose label is y. Note that H and W are height and width, respectively. y = {0, 1, …, C-1}, where C is the number of classes. We first get the feature maps FRH1×W1×N through the backbone network, where H1 and W1 are height and width, N is the number of channels. After that, we send the feature maps F into A3. As illustrated in Fig. 3, inside A3, F are

Experiment setup

Dataset. We evaluate the localization performance of A3 on three datasets: ILSVRC 2016 [26], CUB-200–2011 [35] and Cars-196 [13]. ILSVRC 2016 has 1.2 million images of 1,000 classes for training. We report the accuracy on the validation set of 50,000 images. CUB-200–2011 contains 11,788 images of 200 classes with 5,994 images for training and 5,794 for testing. Cars-196 contains 16,185 images of 196 classes of cars. The data is split into 8,144 training images and 8,041 testing images.

Conclusions

In this paper, we propose a novel Adaptive Attention Augmentor (A3) for Weakly Supervised Object Localization, which is lightweight yet effective and can be easily embedded into various classification networks. It overcomes the problems that existing methods lack the modeling of regional correlation and introduce too much background when mining more object regions. Specifically, A3 supplements object attentions by discovering the semantic correspondence between different regions and dynamically

CRediT authorship contribution statement

Longhao Zhang: Methodology, Software, Writing - original draft. Huihua Yang: Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Longhao Zhang is a Ph.D. of Automation School of Beijing University of Posts and Telecommunications. His major is Pattern Recognition. He has been working in the field of computer vision for three years. He has done a lot of research on scene recognition, object detection, instance segmentation and so on.

References (56)

  • Y. Ming et al.

    Deep Learning for Monocular Depth Estimation: A Review

    Neurocomputing

    (2021)
  • X. Yan et al.

    Deep Multi-view Learning Methods: A Review

    Neurocomputing

    (2021)
  • V. Badrinarayanan, A. Handa, and R. Cipolla. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Robust...
  • L.C. Chen et al.

    DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2018)
  • J. Choe and H. Shim. Attention-based Dropout Layer for Weakly Supervised Object Localization, pp. 2219–2228,...
  • A. Diba, V. Sharma, A. Pazandeh, H. Pirsiavash, and L. Van Gool. Weakly supervised cascaded convolutional networks....
  • X. Wang et al.

    Non-local Neural Networks

    Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.

    (2018)
  • H. Zhang et al.

    “Self-Attention Generative Adversarial Networks”

    (2018)
  • R. Girshick. Fast R-CNN. Proc. IEEE Int. Conf. Comput. Vis., vol. 2015 Inter, pp. 1440–1448,...
  • R. Girshick et al.

    Rich feature hierarchies for accurate object detection and semantic segmentation

    Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.

    (2014)
  • G. Griffin, A. Holub, and P. Perona. Caltech-256 object category...
  • K. He et al.

    Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2015)
  • H. Zhang et al.

    Networks.

    (2018)
  • Z. Jie, Y. Wei, X. Jin, J. Feng, and W. Liu. Deep self-taught learning for weakly supervised object localization. Proc....
  • J. Krause et al.

    3D object representations for fine-grained categorization

    Proc. IEEE Int. Conf. Comput. Vis.

    (2013)
  • M. Oquab et al.

    Is object localization for free? - Weakly-supervised learning with convolutional neural networks

    Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.

    (2015)
  • R. Li et al.

    DeepUNet: A Deep Fully Convolutional Network for Pixel-Level Sea-Land Segmentation

    IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.

    (2018)
  • M. Lin et al.

    Network in network. arXiv preprint arXiv.

    (2013)
  • T.Y. Lin et al.

    Focal Loss for Dense Object Detection

    Proc. IEEE Int. Conf. Comput. Vis.

    (2017)
  • W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. Berg. SSD: Single shot multibox detector. Lect. Notes...
  • E. Shelhamer et al.

    Fully Convolutional Networks for Semantic Segmentation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2017)
  • M. Mostajabi et al.

    Feedforward semantic segmentation with zoom-out features

    Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.

    (2015)
  • S. J. Oh, R. Benenson, A. Khoreva, Z. Akata, M. Fritz, and B. Schiele. Exploiting saliency for object segmentation from...
  • P.O. Pinheiro et al.

    From image-level to pixel-level labeling with Convolutional Networks

    Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.

    (2015)
  • J. Redmon et al.

    You only look once: Unified, real-time object detection

    Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.

    (2016)
  • S. Ren et al.

    Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2017)
  • O. Ronneberger et al.

    Convolutional networks for biomedical image segmentation

    Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics)

    (2015)
  • O. Russakovsky et al.

    ImageNet Large Scale Visual Recognition Challenge

    Int. J. Comput. Vis.

    (2015)
  • Longhao Zhang is a Ph.D. of Automation School of Beijing University of Posts and Telecommunications. His major is Pattern Recognition. He has been working in the field of computer vision for three years. He has done a lot of research on scene recognition, object detection, instance segmentation and so on.

    Huihua Yang is a professor of Automation School of Beijing University of Posts and Telecommunications. Research area involves 1) Machine learning and artificial intelligence; 2) Spectroscopy, image analysis technology; 3) Optimization and 4) High performance computing. He has published more than 40 papers in important journals and conferences, and has won 5 national invention patents.

    View full text