Adaptive attention augmentor for weakly supervised object localization
Introduction
Recently, Weakly Supervised Object Localization (WSOL) has attracted extensive attention because of its implementation simplicity. It can locate the target object using image-level labels, without any expensive bounding box annotations. A large number of methods have emerged [12], [14], [36], [39], [41] and some of which have achieved breakthrough progress and considerable localization results.
Class Activation Mapping (CAM) [41], as one of the most representative and widely-used methods, laid the foundation for subsequent studies. CAM generates an attention map for each class through the modified classification network (e.g. VGGnet [29] and GoogLeNet [31]), and obtains the location of the target object from these class-specific maps. Whereas CAM has its Achilles' heel, that is, only the most discriminative part is highlighted on the class attention map, thus it fails to locate the full extent of the target object. To address this issue, Wei et al. [37] proposed an adversarial erasing approach which mines more objects regions by sequentially erasing the most discriminative part on images. It costs a lot of time and computing resources because of the need to train several classification networks. Therefore, an Adversarial Complementary Learning (ACoL) [39] approach was proposed, which incorporated an additional classifier to implement end-to-end online attention-erasing. Moreover, Choe and Shim [3] introduced an Attention-based Dropout Layer (ADL),
a lightweight yet powerful method which utilizes a drop mask and an importance map to randomly erase the most discriminative attention and improve the classification power of the model. The core idea of these self-erasing approaches is to remove the most discriminative object part or object attention during training, forcing the network to learn the less discriminative features to mine more object attentions, so as to capture the full extent of object. Although they have achieved excellent results on several datasets, there is a serious drawback: they introduce too many backgrounds under the following circumstances:
- •
When the most discriminative attention covers almost the entire object. The full extent of target object can be easily captured on the class attention map at times, especially when the object is small. Therefore, when the most discriminative attention is erased, there are no other object attentions left for the network to mine, only the surrounding background attentions.
- •
When specific types of backgrounds co-occur with target objects. In a wide variety of classes, certain backgrounds co-occur with target objects, like airplane and sky, boat and sea. Hence the discriminativeness of background attentions might be higher than that of most object regions. Thus, when the most discriminative attention is eliminated, the network tends to perceive background attentions despite there are many object attentions that can be mined.
In this paper, we propose the Adaptive Attention Augmentor (A3), which can be easily embedded in the classifier of any classification network to adaptively augment the object attention on the classification attention maps, so as to capture the full extent of the target object and improve the localization accuracy. The framework is shown in Fig. 1. Specifically, A3 transforms the input feature maps into guiding maps and supplementary maps. The guiding maps serve to obtain the correlation matrix through the spatial self-attention to discover the semantic correspondence between different object regions and guide the supplementary maps to perceive more object attentions, which is revealed in Fig. 2. Moreover, we propose the Focal Dice loss, which checks and balances with the classification loss, dynamically augmenting object attentions and suppressing background attentions on the supplementary maps. We note that the localization attention map is obtained by fusing the class-specific guiding attention map and supplementary attention map.
Compared to the vanilla model [41] or recent state-of-the-art WSOL methods [3], [39], [40], [49], [50], our proposed approach can effectively augment object attentions while suppressing background attentions on the class attention maps. It significantly improving both the Top-1 Localization Accuracy and the MaxBoxAccV2 [51] on the ILSVRC dataset and achieving very competitive results on several fine-grained datasets.
Section snippets
Related work
Convolutional neural networks (CNN) have been widely used in common computer vision tasks [1], [7], [8], [10], [15], [16], [20], [22], [25], [28], [54], [55], [56]. Faster-RCNN [24], as the most famous two-stage detection method, generates region proposals and predicts highly reliable object locations in an end-to-end network in real time. One-stage detectors like YOLO [23] and SSD [18] are proposed to improve detection speed. FCN [19] is one of the earliest methods to implement semantic
Overview of A3
In this section, we present the overview of our proposed Adaptive Attention Augmentor (A3). Formally, we denote the input image as , whose label is y. Note that H and W are height and width, respectively. y = {0, 1, …, C-1}, where C is the number of classes. We first get the feature maps through the backbone network, where H1 and W1 are height and width, N is the number of channels. After that, we send the feature maps F into A3. As illustrated in Fig. 3, inside A3, F are
Experiment setup
Dataset. We evaluate the localization performance of A3 on three datasets: ILSVRC 2016 [26], CUB-200–2011 [35] and Cars-196 [13]. ILSVRC 2016 has 1.2 million images of 1,000 classes for training. We report the accuracy on the validation set of 50,000 images. CUB-200–2011 contains 11,788 images of 200 classes with 5,994 images for training and 5,794 for testing. Cars-196 contains 16,185 images of 196 classes of cars. The data is split into 8,144 training images and 8,041 testing images.
Conclusions
In this paper, we propose a novel Adaptive Attention Augmentor (A3) for Weakly Supervised Object Localization, which is lightweight yet effective and can be easily embedded into various classification networks. It overcomes the problems that existing methods lack the modeling of regional correlation and introduce too much background when mining more object regions. Specifically, A3 supplements object attentions by discovering the semantic correspondence between different regions and dynamically
CRediT authorship contribution statement
Longhao Zhang: Methodology, Software, Writing - original draft. Huihua Yang: Supervision.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Longhao Zhang is a Ph.D. of Automation School of Beijing University of Posts and Telecommunications. His major is Pattern Recognition. He has been working in the field of computer vision for three years. He has done a lot of research on scene recognition, object detection, instance segmentation and so on.
References (56)
- et al.
Deep Learning for Monocular Depth Estimation: A Review
Neurocomputing
(2021) - et al.
Deep Multi-view Learning Methods: A Review
Neurocomputing
(2021) - V. Badrinarayanan, A. Handa, and R. Cipolla. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Robust...
- et al.
DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs
IEEE Trans. Pattern Anal. Mach. Intell.
(2018) - J. Choe and H. Shim. Attention-based Dropout Layer for Weakly Supervised Object Localization, pp. 2219–2228,...
- A. Diba, V. Sharma, A. Pazandeh, H. Pirsiavash, and L. Van Gool. Weakly supervised cascaded convolutional networks....
- et al.
Non-local Neural Networks
Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.
(2018) - et al.
“Self-Attention Generative Adversarial Networks”
(2018) - R. Girshick. Fast R-CNN. Proc. IEEE Int. Conf. Comput. Vis., vol. 2015 Inter, pp. 1440–1448,...
- et al.
Rich feature hierarchies for accurate object detection and semantic segmentation
Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.
(2014)
Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition
IEEE Trans. Pattern Anal. Mach. Intell.
Networks.
3D object representations for fine-grained categorization
Proc. IEEE Int. Conf. Comput. Vis.
Is object localization for free? - Weakly-supervised learning with convolutional neural networks
Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.
DeepUNet: A Deep Fully Convolutional Network for Pixel-Level Sea-Land Segmentation
IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.
Network in network. arXiv preprint arXiv.
Focal Loss for Dense Object Detection
Proc. IEEE Int. Conf. Comput. Vis.
Fully Convolutional Networks for Semantic Segmentation
IEEE Trans. Pattern Anal. Mach. Intell.
Feedforward semantic segmentation with zoom-out features
Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.
From image-level to pixel-level labeling with Convolutional Networks
Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.
You only look once: Unified, real-time object detection
Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
IEEE Trans. Pattern Anal. Mach. Intell.
Convolutional networks for biomedical image segmentation
Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics)
ImageNet Large Scale Visual Recognition Challenge
Int. J. Comput. Vis.
Cited by (3)
AGMG-Net: Leveraging multiscale and fine-grained features for improved cargo recognition
2023, Mathematical Biosciences and EngineeringHiCT: Hierarchical Comprehend of Transformer for Weakly Supervised Object Localization
2023, IEEE Transactions on Instrumentation and MeasurementRSMNet: A Regional Similar Module Network for Weakly Supervised Object Localization
2022, Neural Processing Letters
Longhao Zhang is a Ph.D. of Automation School of Beijing University of Posts and Telecommunications. His major is Pattern Recognition. He has been working in the field of computer vision for three years. He has done a lot of research on scene recognition, object detection, instance segmentation and so on.
Huihua Yang is a professor of Automation School of Beijing University of Posts and Telecommunications. Research area involves 1) Machine learning and artificial intelligence; 2) Spectroscopy, image analysis technology; 3) Optimization and 4) High performance computing. He has published more than 40 papers in important journals and conferences, and has won 5 national invention patents.