Elsevier

Pattern Recognition

Volume 128, August 2022, 108663
Pattern Recognition

End-to-end weakly supervised semantic segmentation with reliable region mining

https://doi.org/10.1016/j.patcog.2022.108663Get rights and content

Highlights

  • We make an exten sion of our previous wok and design a more powerful end to end n etwork for weakly supervised semantic segmentation

  • We propose two new loss functions for utilizing the reliable labels, including a new dense energy loss and a batch based class distance loss. The former relies on shallow features, whilst the latter focuses on distinguishing high level s emantic features for different classes.

  • We design a new attention module to extract comprehensive global information. By using a re weighting technique, it can suppress dominant or noisy attention values and aggregate sufficient global information.

  • Our approach achieves a new state of the art performance for weakly supervised semantic segmentation.

Abstract

Weakly supervised semantic segmentation is a challenging task that only takes image-level labels as supervision but produces pixel-level predictions for testing. To address such a challenging task, most current approaches generate pseudo pixel masks first that are then fed into a separate semantic segmentation network. However, these two-step approaches suffer from high complexity and being hard to train as a whole. In this work, we harness the image-level labels to produce reliable pixel-level annotations and design a fully end-to-end network to learn to predict segmentation maps. Concretely, we firstly leverage an image classification branch to generate class activation maps for the annotated categories, which are further pruned into tiny reliable object/background regions. Such reliable regions are then directly served as ground-truth labels for the segmentation branch, where both global information and local information sub-branches are used to generate accurate pixel-level predictions. Furthermore, a new joint loss is proposed that considers both shallow and high-level features. Despite its apparent simplicity, our end-to-end solution achieves competitive mIoU scores (val: 65.4%, test: 65.3%) on Pascal VOC compared with the two-step counterparts. By extending our one-step method to two-step, we get a new state-of-the-art performance on the Pascal VOC 2012 dataset(val: 69.3%, test: 69.2%). Code is available at: https://github.com/zbf1991/RRM.

Introduction

Recently, weakly supervised semantic segmentation receives great interest and is being extensively studied. Requiring merely low degree (cheaper or simpler) annotations including scribbles [1], [2], [3], bounding boxes [4], [5], points [6], [7] and image-level labels [8], [9], [10] for training, weakly supervised semantic segmentation offers a much easier way than its fully supervised counterpart that adopts pixel-level masks [11]. Among these weakly supervised labels, the image-level annotation is the easiest to collect but also the most challenging case since there is no direct mapping between semantic labels and pixels.

To learn semantic segmentation models using image-level labels as supervision, many existing approaches can be categorized as one-step approaches and two-step approaches. One-step approaches [12] often establish an end-to-end framework, which augments multi-instance learning with other constrained strategies for optimization. This family of methods is elegant and easy to implement. However, one significant drawback of these approaches is that the segmentation accuracy is far behind their fully supervised counterparts. To achieve better segmentation performance, many researchers alternatively propose to leverage two-step approaches [13], [14]. This family of approaches usually aims to take bottom-up [15] or top-down [16], [17] strategies to firstly generate high-quality pseudo pixel-level masks with image-level labels as supervision. These pseudo masks then act as ground-truth and are fed into the off-the-shelf fully convolutional networks such as Fully Convolutional Network (FCN) [18] and DeepLab [19], [20] to train the semantic segmentation models. The state-of-the-art methods are mainly two-step approaches, with segmentation performance approaching that of their fully supervised counterparts. However, to produce high-quality pseudo masks, these approaches often employ many bells and whistles, such as introducing additional object/background cues from object proposals [21] or saliency maps [22] in an off-line manner. As a result, the two-step approaches are usually complicated and hard to be re-implemented, limiting their application to research areas such as object localization and video object tracking.

In this paper, we make an extension of our previous work [23], and present a simple yet effective one-step approach, called Reliable Region Mining (RRM), which can be easily trained in an end-to-end manner. It includes two branches: one to produce pseudo pixel-level masks using image-level annotations, and the other to produce the semantic segmentation results. In contrast to the previous two-step methods [8], [24], [25], [26] that prefer to mine dense and integral object regions, our RRM only leverages those reliable object/background regions that are usually tiny but with high response scores on the class activation maps. We find these regions can be further pruned into more reliable ones by augmenting an additional Conditional Random Field (CRF) operation, which is then employed as supervision for the parallel semantic segmentation branch. We design two parallel sub-branches for the segmentation branch: one extracts local information using the regular convolution layer, the other extracts global information with our proposed Re-weighting Feature-Attention Module (R-FAM). More importantly, with limited pixels as supervision, we design a new joint training loss, including a pixel-wise cross entropy loss, a regularized loss named dense energy loss and a Batch-based Class Distance loss (BCD loss) to optimize the training process. We introduce the dense energy loss to use the shallow features such as RGB color and spatial information, and BCD loss to make the high-level semantic features more discriminative for different classes. With the help of the newly designed joint loss and R-FAM, our one-step RRM achieves 65.4% and 65.3% of mIoU scores on the Pascal VOC val and test sets, respectively. These results achieve state-of-the-art performances and they are even competitive compared with those two-step state-of-the-arts, which usually adopt complex bells and whistles to produce pseudo masks. We believe that our proposed RRM offers a new insight to the one-step solution for weakly supervised semantic segmentation. Furthermore, in order to show the effectiveness of our method, we also extend our method to a two-step framework and get a new state-of-the-art performance with 69.3% and 69.2% on the Pascal VOC val and test sets.

Our contributions are summarized as:

  • We design an elegant and efficient end-to-end network for weakly supervised semantic segmentation. Relying on tiny reliable pixel-level pseudo labels, our network can be trained in a one-stage manner given image-level labels, without bells and whistles.

  • We propose two new loss functions for utilizing the reliable labels, including a new dense energy loss and a batch-based class distance (BCD) loss. The former relies on shallow features, whilst the latter focuses on distinguishing high-level semantic features for different classes.

  • We design a new attention module (R-FAM) to extract comprehensive global information. By using a re-weighting technique, our R-FAM can suppress dominant or noisy attention values. Thus our semantic segmentation branch can aggregate sufficient global information.

  • Our end-to-end approach achieves competitive performance (val: 65.4%, test: 65.3%) compared to other two-step approaches on PASCAL VOC 2012 dataset. By extending our network to a two-step solution, our approach achieves a new state-of-the-art performance (val: 69.3%, test: 69.2%).

Section snippets

Related work

Semantic segmentation is an important task in computer vision [27], [28], [29], which requires to predict pixel-level classification. Long et al. [18] proposed the first Fully Convolutional Network (FCNs) for semantic segmentation. Chen et al. [19] proposed a new deep neural network structure named “DeepLab” to conduct pixel-wise prediction using dilated convolution, and a series of new network structures were developed after that [11], [20], [30]. Kim et al. proposed a level set loss for

Overview

Our proposed RRM can be divided into two parallel branches including a classification branch and a semantic segmentation branch. Both branches share the same backbone network, and during training, both of them update the whole network at the same time. The overall framework of our method is illustrated in Fig. 1. The algorithm flow is illustrated in Algorithm 1.

  • The classification branch is used to generate reliable pixel-level annotations. Original CAMs will be processed to generate tiny

Dataset and implementation details

Dataset. Our RRM mdoel is trained and validated on PASCAL VOC 2012 [45] as well as its augmented data, including 10,582 images for training, 1,449 images for validating and 1,456 images for testing. The Mean Intersection over Union (mIoU) is considered as the evaluation criterion.

Implementation Details. The backbone network is a ResNet model with 38 convolution layers [46]. We remove all the fully connected layers of the original network and engage dilated convolution for the last three ResNet

Discussion

There are several possible solutions to improve the current approach: (1) Making the classification branch and segmentation branch benefit from each other. In the current framework, there is no feedback from the segmentation branch to the classification branch. The segmentation branch only receives the reliable label from the classification branch and then makes predictions. Since the quality of the predictions from the segmentation branch is high, we can attempt to use them to refine the

Conclusion

In this paper, we proposed the Reliable Region Mining model, an end-to-end network for image-level weakly supervised semantic segmentation. We revisited drawbacks of the state-of-the-art methods, which adopt the two-step approach. We proposed a one-step approach through mining tiny reliable regions and used them as ground-truth labels directly for our segmentation branch training. With limited pixels as supervision, we designed a dense energy loss and a batch-based class distance loss, which

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work was supported by the National Key R&D Program of China (No. 2021ZD0112100), and the National Natural Science Foundation of China (Nos. U1936212, 62120106009, 61972323, 61876155).

Bingfeng Zhang received the B.S. degree in electronic information engineering from China University of Petroleum (East China), Qingdao, PR China, in 2015, the M.E. degree in systems, control and signal processing from University of Southampton, Southampton, U.K., in 201 6. He is now a Ph.D student in the University of Liverpool, Liverpool, U.K., and also a Ph.D student in the school of the advanced technology of the Xi’an Jiaotong Liverpool University, Suzhou, PR China. His current research

References (68)

  • A. Bearman et al.

    Whats the point: Semantic segmentation with point supervision

    ECCV

    (2016)
  • J. Ahn et al.

    Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation

    CVPR

    (2018)
  • Q. Hou et al.

    Self-erasing network for integral object attention

    NeurIPS

    (2018)
  • Y. Wei et al.

    Revisiting dilated convolution: a simple approach for weakly-and semi-supervised semantic segmentation

    CVPR

    (2018)
  • L.-C. Chen et al.

    Rethinking atrous convolution for semantic image segmentation

    arXiv preprint arXiv:1706.05587

    (2017)
  • G. Papandreou et al.

    Weakly-and semi-supervised learning of a dcnn for semantic image segmentation

    http://arxiv. org/abs/1502

    (2015)
  • Y. Wei et al.

    Object region mining with adversarial erasing: a simple classification to semantic segmentation approach

    CVPR

    (2017)
  • Z. Huang et al.

    Weakly-supervised semantic segmentation network with deep seeded region growing

    CVPR

    (2018)
  • Q. Hou et al.

    Deeply supervised salient object detection with short connections

    CVPR

    (2017)
  • J. Zhang et al.

    Top-down neural attention by excitation backprop

    IJCV

    (2018)
  • B. Zhou et al.

    Learning deep features for discriminative localization

    CVPR

    (2016)
  • J. Long et al.

    Fully convolutional networks for semantic segmentation

    CVPR

    (2015)
  • L.-C. Chen et al.

    Semantic image segmentation with deep convolutional nets and fully connected crfs

    arXiv preprint arXiv:1412.7062

    (2014)
  • L.-C. Chen et al.

    Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs

    IEEE transactions on PAMI

    (2017)
  • P.O. Pinheiro et al.

    From image-level to pixel-level labeling with convolutional networks

    CVPR

    (2015)
  • H. Jiang et al.

    Salient object detection: a discriminative regional feature integration approach

    CVPR

    (2013)
  • B. Zhang et al.

    Reliability does matter: An end-to-end weakly supervised semantic segmentation approach

    Proceedings of the AAAI Conference on Artificial Intelligence

    (2020)
  • J. Lee et al.

    Ficklenet: weakly and semi-supervised semantic image segmentation using stochastic inference

    arXiv preprint arXiv:1902.10421

    (2019)
  • W. Shimoda et al.

    Self-supervised difference detection for weakly-supervised semantic segmentation

    ICCV

    (2019)
  • Y. Wang et al.

    Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation

    arXiv preprint arXiv:2004.04581

    (2020)
  • Y. Xie et al.

    Correlation filter selection for visual tracking using reinforcement learning

    arXiv preprint arXiv:1811.03196

    (2018)
  • J. Huang et al.

    Multi-level adversarial network for domain adaptive semantic segmentation

    Pattern Recognit

    (2021)
  • L.-C. Chen et al.

    Encoder-decoder with atrous separable convolution for semantic image segmentation

    ECCV

    (2018)
  • F. Jiang et al.

    Robust visual saliency optimization based on bidirectional markov chains

    Cognit Comput

    (2021)
  • Cited by (22)

    View all citing articles on Scopus

    Bingfeng Zhang received the B.S. degree in electronic information engineering from China University of Petroleum (East China), Qingdao, PR China, in 2015, the M.E. degree in systems, control and signal processing from University of Southampton, Southampton, U.K., in 201 6. He is now a Ph.D student in the University of Liverpool, Liverpool, U.K., and also a Ph.D student in the school of the advanced technology of the Xi’an Jiaotong Liverpool University, Suzhou, PR China. His current research interest is weakly supervised s emantic segmentation and few shot segmentation.

    Jimin Xiao received the B.S. and M.E. degrees in telecommunication engineering from the Nanjing University of Posts and Telecommunications, Nanjing, China, in 2004 and 2007, respectively, and the Ph.D. degree in electrical engineering and electronics from the University of Liverpool, Liverpool, U.K., in 2013. From 2013 to 2014, he was a Senior Researcher with the Department of Signal Processing, Tampere University of Technology, Tampere, Finland, and an External Researcher with the Nokia Research Center, Tampere. Since 2014, he has been a Faculty Member with Xi’an Jiaotong Liverpool University, Suzhou, China. His research interests include image and video processing, computer vision, and deep learning.

    Yunchao Wei is currently a Professor at B ei jing Jiaotong University. He received his PhD degree from Beijing Jiaotong University in 2016. Before joining UTS, he was a Postdoc Researcher in Prof. Thomas Huang’s Image Formation and Professing (IFP) group at Beckman Institute, UIUC, from 2017 to 2019. He has published over 60 papers on top tier journals and conferences (e.g., T PAMI, CVPR, ICCV, etc.), Google citation 3900+. He received the Excellent Doctoral Dissertation Award of CIE in 2016, AR C Discovery Early Career Researcher Award in 2019, 1st Prize in Science and Technology awarded by China Society of Image and Graphics in 2019. His research interests mainly include Deep learning and its applications in computer vision, e.g., image classifi cation, video/image object detection/segmentation, and learning with imperfect data. He has organized multiple Workshops and Tutorials in CVPR, ICCV, ECCV and ACM MM.

    Kaizhu Huang is currently a Professor at Duke Kunshan University, China. Prof. Huang obtained his PhD degree from Chinese University of Hong Kong (CUHK) in 2004. He worked in Fujitsu Research Centre, CUHK, University of Bristol, National Laboratory of Pattern Recognition, Chinese Academy of Sciences from 2004 to 2012. Prof. Huang ha s been working in machine learning, neural information processing, and pattern recognition. He was the recipient of 2011 Asia Pacific Neural Network Society Young Researcher Award. He received best paper or book award five times. Until September 2020, he has published 9 books and over 190 international research papers (70+ international journals) e.g., in journals (JMLR, Neural Computation, IEEE T PAMI, IEEE T NNLS, IEEE T BME, IEEE TCybernetics) and conferences (NeurIPS, IJCAI, SIGIR, UAI, CIKM, ICDM, ICML, ECML, CVPR). He serves as associated editors/advisory board members in a number of journals and book series. He was invited as keynote speaker in more than 20 international conferences or workshops.

    Shan Luo is a Lecturer (Assistant Professor) at the Department of Computer Science, University of Liverpool. Previous to Liverpool, he was a Research Fellow at Harvard University and University of Leeds. He was also a Visiting Scientist at the Computer Science and Artificial Intelligence Laboratory (CSAIL), MIT. He received the B.Eng. degree in Automatic Control from China University of Petroleum, Qingdao, China, in 2012. He was awarded the Ph.D. degree in Robotics from King’s College London, UK, in 2016. His research interests include tactile sensing, object recognition and computer vision.

    Yao Zhao received the B.S. degree from the Radio Engineering Department, Fuzhou received the B.S. degree from the Radio Engineering Department, Fuzhou University, Fuzhou, China, in 1989, the M.E. degree from tUniversity, Fuzhou, China, in 1989, the M.E. degree from the Radio Engineering he Radio Engineering Department, Southeast University, Nanjing, China, in 1992, and the Ph.D. degree from the Department, Southeast University, Nanjing, China, in 1992, and the Ph.D. degree from the Institute of Information Science, Beijing Jiaotong University (BJTU), Beijing, China, in 1996, Institute of Information Science, Beijing Jiaotong University (BJTU), Beijing, China, in 1996, where he became an Associate Professor and a Profeswhere he became an Associate Professor and a Professor in 1998 and 2001, respectively. sor in 1998 and 2001, respectively. From 2001 to 2002, he was a Senior Research Fellow with the Information and From 2001 to 2002, he was a Senior Research Fellow with the Information and Communication Theory Group, Faculty of Information Technology and Systems, Delft Communication Theory Group, Faculty of Information Technology and Systems, Delft University of Technology, Delft, The Netherlands. In 2015, he vUniversity of Technology, Delft, The Netherlands. In 2015, he visited the Swiss Federal isited the Swiss Federal Institute of Technology, Lausanne (EPFL), Switzerland. From 2017 to 2018, he visited Institute of Technology, Lausanne (EPFL), Switzerland. From 2017 to 2018, he visited University of Southern California. He is currently the Director with the Institute of University of Southern California. He is currently the Director with the Institute of Information Science, BJTU. His current research interests inInformation Science, BJTU. His current research interests include image/video coding, clude image/video coding, digital watermarking and forensics, video analysis and understanding, and artificial digital watermarking and forensics, video analysis and understanding, and artificial intelligence. Dr. Zhao is a Fellow of the IET. He serves on the Editorial Boards of several intelligence. Dr. Zhao is a Fellow of the IET. He serves on the Editorial Boards of several international journals, including as an Associate Ediinternational journals, including as an Associate Editor for the IEEE TRANSACTIONS ON tor for the IEEE TRANSACTIONS ON CYBERNETICS, a Senior Associate Editor for the IEEE SIGNAL PROCESSING LETTERS, and CYBERNETICS, a Senior Associate Editor for the IEEE SIGNAL PROCESSING LETTERS, and an Area Editor for Signal Processing: Image Communication. He was named a an Area Editor for Signal Processing: Image Communication. He was named a Distinguished Young Scholar by the National Science Foundation of Distinguished Young Scholar by the National Science Foundation of China in 2010 and China in 2010 and was elected as a Chang Jiang Scholar of Ministry of Education of China in 2013.was elected as a Chang Jiang Scholar of Ministry of Education of China in 2013.

    View full text