Elsevier

Neurocomputing

Volume 445, 20 July 2021, Pages 244-254
Neurocomputing

Contrastive and consistent feature learning for weakly supervised object localization and semantic segmentation

https://doi.org/10.1016/j.neucom.2021.03.023Get rights and content

Abstract

Weakly supervised learning attempts to construct predictive models by learning with weak supervision. In this paper, we concentrate on weakly supervised object localization and semantic segmentation tasks. Existing methods are limited to focusing on narrow discriminative parts or overextending the activations to less discriminative regions even on backgrounds. To mitigate these problems, we regard the background as an important cue that guides the feature activation to cover the entire object to the right extent, and propose two novel objective functions: 1) contrastive attention loss and 2) foreground consistency loss. Contrastive attention loss draws the foreground feature and its dropped version close together and pushes the dropped foreground feature away from the background feature. Foreground consistency loss favors agreement between layers and provides early layers with a sense of objectness. Using both losses leads to balanced improvements over localization and segmentation accuracy by boosting activations on less discriminative regions but restraining the activation in the target object extent. For better optimizing the above losses, we use the non-local attention blocks to replace channel-pooled attention leading to enhanced attention maps considering the spatial similarity. Finally, our method achieves state-of-the-art localization performance on CUB-200-2011, ImageNet, and OpenImages benchmarks regarding top-1 localization accuracy, MaxBoxAccV2, and PxAP. We also demonstrate the effectiveness of our method in improving segmentation performance measured by mIoU on the PASCAL VOC dataset.

Introduction

Weakly supervised learning (WSL) aims to train a convolution neural network (CNN) for the target tasks using only weak supervision, e.g., class label for object localization [8], [27], [46], class label [18], [22] or bounding box [9], [33] for semantic segmentation. They require cheaper annotation cost than fully supervised approaches [2], [5], [24], [25], [41]. Therefore, WSL has received significant attention over the various computer vision tasks [14], [28], [36], [40] due to the lower cost to obtain weak annotations.

We focus on the task of weakly supervised object localization (WSOL) and weakly supervised semantic segmentation (WSSS) utilizing image-level class labels. WSOL pursues both classifying and localizing the target object as a class label and a bounding box. WSSS predicts a class label for each pixel in a given image. In general, WSSS consists of two steps: making a pseudo mask as weak supervision and training segmentation network using pseudo masks. Accordingly, we extend localization to segmentation as both tasks have similar goals to solve. The localization maps extracted from our model can be employed as initial weak supervision for the WSSS task.

Zhou et al. [47] generate a class activation map (CAM) based on the classification network using a global average pooling. CAM highlights the class-specific discriminative regions from the input image. As shown in Fig. 1, it is limited to focusing on narrow discriminative parts (e.g., the face of a person) rather than covering the entire object. To relieve this problem, recent methods apply an adversarial erasing method [8], [27], [44] that erases the most discriminative parts to extend the activations to less discriminative regions. Consequently, they excessively spread out the activations even on backgrounds, which tends to over-estimate bounding boxes. These under-estimated or over-estimated activation maps harm predicting the location of the object (WSOL) or pixel-level probability map (WSSS).

In this paper, we propose four ingredients for both WSOL and WSSS tasks: 1) contrastive attention loss, 2) foreground consistency loss, 3) non-local attention block, 4) dropped foreground mask.

The contrastive attention loss pulls the foreground feature and its erased version close together and pushes the erased foreground feature away from the background feature. It encourages the learned representation to be homogeneous in the object extent and heterogeneous from the backgrounds. The foreground consistency loss favors agreement between layers and provides early layers with a sense of backgroundness. It boosts the foreground activations while restraining the background regions since low-level features are activated on locally distinctive regions, e.g., edges. Moreover, we employ the non-local attention blocks for producing better attention maps which consider the similarity between pixel locations in a feature map. It goes along with our contrastive attention loss as it assigns high weights on similar features. Last but not least, we propose a dropped foreground mask which drops the background region as well as the most discriminative region. It guides the model to avoid spreading attention to backgrounds.

Our method shows competitive performances in three benchmark datasets (CUB-200-2011 [38], ImageNet [30] and OpenImages [20]) for WSOL and PASCAL VOC 2012 [10] for WSSS.

To verify the effectiveness of the proposed method, we measure four evaluation metrics: top-1 localization accuracy, MaxBoxAccV2 [7], PxAP [7], mIoU.

This manuscript extends the conference paper [16], which only focuses on the WSOL task. We newly add (1) an extension of contrastive attention loss; (2) an extensive comparative study with existing WSOL methods using various datasets and metrics; and (3) validation of our method for the WSSS task.

Section snippets

Weakly supervised object localization

Most of the weakly supervised object localization (WSOL) methods train a classification network using only the class labels with the image and extract class activation map (CAM) [47] to estimate object location. CAM represents the strength of activation in every location in the feature map to stimulate the corresponding class [1], [8], [27], [44].

Recent methods [8], [27], [32], [44], [45] propose adversarial erasing to spread activation from the most discriminative parts to the less

Proposed method

We extend classification network with non-local attention blocks (Section 3.4) and and train it with the contrastive attention loss (Section 3.2) and and the foreground consistency loss (Section 3.3). Fig. 2 illustrates the overview.

Experimental setup

Datasets. We evaluate the proposed method on four benchmark datasets, from which only the image-level labels are used in training: CUB-200-2011 [38], ImageNet [30] and OpenImages [20] for WSOL; PASCAL VOC 2012 [10] for WSSS. Many weak-supervision methods have used full supervision to some extent, directly or indirectly, for hyperparameter tuning. Since the amount of full supervision used for hyperparameter tuning is not consistent, it has been ambiguous using the previous evaluation metric for

Conclusion

We presented the contrastive attention loss and foreground consistency loss for weakly supervised object localization and semantic segmentation tasks. Previous methods focused on discriminative parts or overextended to backgrounds rather than localizing the entire object. The contrastive attention loss leads the model to expand the attention within the objects. The foreground consistency loss lessens the noisy activations outside the object and promotes activations in the object. Our non-local

CRediT authorship contribution statement

Minsong Ki: Conceptualization, Methodology, Data curation, Writing - original draft, Visualization. Youngjung Uh: Conceptualization, Methodology, Writing - review & editing. Wonyoung Lee: Visualization, Validation, Formal analysis. Hyeran Byun: Supervision, Writing - review & editing, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work was supported by the National Research Foundation of Korea grant funded by Korean government (No. NRF-2019R1A2C2003760) and Artificial Intelligence Graduate School Program (YONSEI UNIVERSITY) under Grant 2020-0-01361.

Minsong Ki received the B.S degree in Computer Science from Duksung Women’s University, Seoul, Korea in 2014. She is currently a Ph.D. student with the Department of computer science, Yonsei University, Seoul, Korea. Her research interests include weakly supervised learning, object localization, detection, face recognition and deep learning.

References (47)

  • J. Choe et al.

    Attention-based dropout layer for weakly supervised object localization

  • J. Dai et al.

    Boxsup: exploiting bounding boxes to supervise convolutional networks for semantic segmentation

  • M. Everingham et al.

    The pascal visual object classes (voc) challenge

    Int. J. Comput. Vision

    (2010)
  • R. Hadsell, S. Chopra, Y. LeCun, Dimensionality reduction by learning an invariant mapping, in: 2006 IEEE Computer...
  • K. He et al.

    Momentum contrast for unsupervised visual representation learning

  • K. He et al.

    Deep residual learning for image recognition

  • H. Jiang et al.

    Salient object detection: a discriminative regional feature integration approach

  • M. Ki et al.

    In-sample contrastive learning and consistent attention for weakly supervised object localization

  • D. Kim et al.

    Two-phase learning for weakly supervised object localization

  • A. Kolesnikov, C.H. Lampert, Seed, expand and constrain: Three principles for weakly-supervised image segmentation, in:...
  • P. Krähenbühl et al.

    Efficient inference in fully connected crfs with gaussian edge potentials

    Adv. Neural Inf. Process. Syst.

    (2011)
  • A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, T. Duerig,...
  • K. Li et al.

    Tell me where to look: guided attention inference network

  • Minsong Ki received the B.S degree in Computer Science from Duksung Women’s University, Seoul, Korea in 2014. She is currently a Ph.D. student with the Department of computer science, Yonsei University, Seoul, Korea. Her research interests include weakly supervised learning, object localization, detection, face recognition and deep learning.

    Youngjung Uh received his B.S. and Ph.D. degree in Computer Science from Yonsei University. He worked at NAVER CLOVA AI Research and is currently an assistant professor at Applied Information Engineering major, Yonsei University. Research interests include weakly-supervised tasks, representation learning and generative models.

    Wonyoung Lee received the B.S degree in Computer Engineering from Inha University. He is currently a M.S student at the Graduate School of Artificial Intelligence, Yonsei University, Seoul, Korea. His research interests include computer vision, deep learning and machine learning.

    Hyeran Byun is currently a professor of Computer Science at Yonsei University. She was an Assistant Professor at Hallym University, Chooncheon, Korea, from 1994 to 1995. She served as a non-executive director of National IT Industry Promotion Agency (NIPA) from Mar. 2014 to Feb. 2018. She is a member of National Academy Engineering of Korea. Her research interests include computer vision, image and video processing, deep learning, artificial intelligence, machine learning, and pattern recognition. She received the B.S. and M.S. degrees in mathematics from Yonsei University, Seoul, Korea, and the Ph.D. degree in computer science from Purdue University, West Lafayette, IN, USA.

    View full text