End-to-end weakly supervised semantic segmentation with reliable region mining
Introduction
Recently, weakly supervised semantic segmentation receives great interest and is being extensively studied. Requiring merely low degree (cheaper or simpler) annotations including scribbles [1], [2], [3], bounding boxes [4], [5], points [6], [7] and image-level labels [8], [9], [10] for training, weakly supervised semantic segmentation offers a much easier way than its fully supervised counterpart that adopts pixel-level masks [11]. Among these weakly supervised labels, the image-level annotation is the easiest to collect but also the most challenging case since there is no direct mapping between semantic labels and pixels.
To learn semantic segmentation models using image-level labels as supervision, many existing approaches can be categorized as one-step approaches and two-step approaches. One-step approaches [12] often establish an end-to-end framework, which augments multi-instance learning with other constrained strategies for optimization. This family of methods is elegant and easy to implement. However, one significant drawback of these approaches is that the segmentation accuracy is far behind their fully supervised counterparts. To achieve better segmentation performance, many researchers alternatively propose to leverage two-step approaches [13], [14]. This family of approaches usually aims to take bottom-up [15] or top-down [16], [17] strategies to firstly generate high-quality pseudo pixel-level masks with image-level labels as supervision. These pseudo masks then act as ground-truth and are fed into the off-the-shelf fully convolutional networks such as Fully Convolutional Network (FCN) [18] and DeepLab [19], [20] to train the semantic segmentation models. The state-of-the-art methods are mainly two-step approaches, with segmentation performance approaching that of their fully supervised counterparts. However, to produce high-quality pseudo masks, these approaches often employ many bells and whistles, such as introducing additional object/background cues from object proposals [21] or saliency maps [22] in an off-line manner. As a result, the two-step approaches are usually complicated and hard to be re-implemented, limiting their application to research areas such as object localization and video object tracking.
In this paper, we make an extension of our previous work [23], and present a simple yet effective one-step approach, called Reliable Region Mining (RRM), which can be easily trained in an end-to-end manner. It includes two branches: one to produce pseudo pixel-level masks using image-level annotations, and the other to produce the semantic segmentation results. In contrast to the previous two-step methods [8], [24], [25], [26] that prefer to mine dense and integral object regions, our RRM only leverages those reliable object/background regions that are usually tiny but with high response scores on the class activation maps. We find these regions can be further pruned into more reliable ones by augmenting an additional Conditional Random Field (CRF) operation, which is then employed as supervision for the parallel semantic segmentation branch. We design two parallel sub-branches for the segmentation branch: one extracts local information using the regular convolution layer, the other extracts global information with our proposed Re-weighting Feature-Attention Module (R-FAM). More importantly, with limited pixels as supervision, we design a new joint training loss, including a pixel-wise cross entropy loss, a regularized loss named dense energy loss and a Batch-based Class Distance loss (BCD loss) to optimize the training process. We introduce the dense energy loss to use the shallow features such as RGB color and spatial information, and BCD loss to make the high-level semantic features more discriminative for different classes. With the help of the newly designed joint loss and R-FAM, our one-step RRM achieves 65.4% and 65.3% of mIoU scores on the Pascal VOC val and test sets, respectively. These results achieve state-of-the-art performances and they are even competitive compared with those two-step state-of-the-arts, which usually adopt complex bells and whistles to produce pseudo masks. We believe that our proposed RRM offers a new insight to the one-step solution for weakly supervised semantic segmentation. Furthermore, in order to show the effectiveness of our method, we also extend our method to a two-step framework and get a new state-of-the-art performance with 69.3% and 69.2% on the Pascal VOC val and test sets.
Our contributions are summarized as:
- •
We design an elegant and efficient end-to-end network for weakly supervised semantic segmentation. Relying on tiny reliable pixel-level pseudo labels, our network can be trained in a one-stage manner given image-level labels, without bells and whistles.
- •
We propose two new loss functions for utilizing the reliable labels, including a new dense energy loss and a batch-based class distance (BCD) loss. The former relies on shallow features, whilst the latter focuses on distinguishing high-level semantic features for different classes.
- •
We design a new attention module (R-FAM) to extract comprehensive global information. By using a re-weighting technique, our R-FAM can suppress dominant or noisy attention values. Thus our semantic segmentation branch can aggregate sufficient global information.
- •
Our end-to-end approach achieves competitive performance (val: 65.4%, test: 65.3%) compared to other two-step approaches on PASCAL VOC 2012 dataset. By extending our network to a two-step solution, our approach achieves a new state-of-the-art performance (val: 69.3%, test: 69.2%).
Section snippets
Related work
Semantic segmentation is an important task in computer vision [27], [28], [29], which requires to predict pixel-level classification. Long et al. [18] proposed the first Fully Convolutional Network (FCNs) for semantic segmentation. Chen et al. [19] proposed a new deep neural network structure named “DeepLab” to conduct pixel-wise prediction using dilated convolution, and a series of new network structures were developed after that [11], [20], [30]. Kim et al. proposed a level set loss for
Overview
Our proposed RRM can be divided into two parallel branches including a classification branch and a semantic segmentation branch. Both branches share the same backbone network, and during training, both of them update the whole network at the same time. The overall framework of our method is illustrated in Fig. 1. The algorithm flow is illustrated in Algorithm 1.
- •
The classification branch is used to generate reliable pixel-level annotations. Original CAMs will be processed to generate tiny
Dataset and implementation details
Dataset. Our RRM mdoel is trained and validated on PASCAL VOC 2012 [45] as well as its augmented data, including 10,582 images for training, 1,449 images for validating and 1,456 images for testing. The Mean Intersection over Union (mIoU) is considered as the evaluation criterion.
Implementation Details. The backbone network is a ResNet model with 38 convolution layers [46]. We remove all the fully connected layers of the original network and engage dilated convolution for the last three ResNet
Discussion
There are several possible solutions to improve the current approach: (1) Making the classification branch and segmentation branch benefit from each other. In the current framework, there is no feedback from the segmentation branch to the classification branch. The segmentation branch only receives the reliable label from the classification branch and then makes predictions. Since the quality of the predictions from the segmentation branch is high, we can attempt to use them to refine the
Conclusion
In this paper, we proposed the Reliable Region Mining model, an end-to-end network for image-level weakly supervised semantic segmentation. We revisited drawbacks of the state-of-the-art methods, which adopt the two-step approach. We proposed a one-step approach through mining tiny reliable regions and used them as ground-truth labels directly for our segmentation branch training. With limited pixels as supervision, we designed a dense energy loss and a batch-based class distance loss, which
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgment
This work was supported by the National Key R&D Program of China (No. 2021ZD0112100), and the National Natural Science Foundation of China (Nos. U1936212, 62120106009, 61972323, 61876155).
Bingfeng Zhang received the B.S. degree in electronic information engineering from China University of Petroleum (East China), Qingdao, PR China, in 2015, the M.E. degree in systems, control and signal processing from University of Southampton, Southampton, U.K., in 201 6. He is now a Ph.D student in the University of Liverpool, Liverpool, U.K., and also a Ph.D student in the school of the advanced technology of the Xi’an Jiaotong Liverpool University, Suzhou, PR China. His current research
References (68)
- et al.
Ian: the individual aggregation network for person search
Pattern Recognit
(2019) - et al.
Wider or deeper: revisiting the resnet model for visual recognition
Pattern Recognit
(2019) - et al.
Boundarymix: generating pseudo-training images for improving segmentation with scribble annotations
Pattern Recognit
(2021) - et al.
Weakly-supervised semantic segmentation with saliency and incremental supervision updating
Pattern Recognit
(2021) - et al.
Scribblesup: Scribble-supervised convolutional networks for semantic segmentation
CVPR
(2016) - et al.
Learning random-walk label propagation for weakly-supervised semantic segmentation
CVPR
(2017) - et al.
On regularized losses for weakly-supervised cnn segmentation
ECCV
(2018) - et al.
Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation
CVPR
(2015) - et al.
Simple does it: Weakly supervised instance and semantic segmentation
CVPR
(2017) - et al.
Deep extreme cut: From extreme points to object segmentation
CVPR
(2018)
Whats the point: Semantic segmentation with point supervision
ECCV
Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation
CVPR
Self-erasing network for integral object attention
NeurIPS
Revisiting dilated convolution: a simple approach for weakly-and semi-supervised semantic segmentation
CVPR
Rethinking atrous convolution for semantic image segmentation
arXiv preprint arXiv:1706.05587
Weakly-and semi-supervised learning of a dcnn for semantic image segmentation
http://arxiv. org/abs/1502
Object region mining with adversarial erasing: a simple classification to semantic segmentation approach
CVPR
Weakly-supervised semantic segmentation network with deep seeded region growing
CVPR
Deeply supervised salient object detection with short connections
CVPR
Top-down neural attention by excitation backprop
IJCV
Learning deep features for discriminative localization
CVPR
Fully convolutional networks for semantic segmentation
CVPR
Semantic image segmentation with deep convolutional nets and fully connected crfs
arXiv preprint arXiv:1412.7062
Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs
IEEE transactions on PAMI
From image-level to pixel-level labeling with convolutional networks
CVPR
Salient object detection: a discriminative regional feature integration approach
CVPR
Reliability does matter: An end-to-end weakly supervised semantic segmentation approach
Proceedings of the AAAI Conference on Artificial Intelligence
Ficklenet: weakly and semi-supervised semantic image segmentation using stochastic inference
arXiv preprint arXiv:1902.10421
Self-supervised difference detection for weakly-supervised semantic segmentation
ICCV
Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation
arXiv preprint arXiv:2004.04581
Correlation filter selection for visual tracking using reinforcement learning
arXiv preprint arXiv:1811.03196
Multi-level adversarial network for domain adaptive semantic segmentation
Pattern Recognit
Encoder-decoder with atrous separable convolution for semantic image segmentation
ECCV
Robust visual saliency optimization based on bidirectional markov chains
Cognit Comput
Cited by (22)
Cross-frame feature-saliency mutual reinforcing for weakly supervised video salient object detection
2024, Pattern RecognitioneX-ViT: A Novel explainable vision transformer for weakly supervised semantic segmentation
2023, Pattern RecognitionSATS: Self-attention transfer for continual semantic segmentation
2023, Pattern RecognitionA multi-strategy contrastive learning framework for weakly supervised semantic segmentation
2023, Pattern RecognitionNon-bias self-attention learning for weakly supervised semantic segmentation
2023, Computers and Electrical Engineering
Bingfeng Zhang received the B.S. degree in electronic information engineering from China University of Petroleum (East China), Qingdao, PR China, in 2015, the M.E. degree in systems, control and signal processing from University of Southampton, Southampton, U.K., in 201 6. He is now a Ph.D student in the University of Liverpool, Liverpool, U.K., and also a Ph.D student in the school of the advanced technology of the Xi’an Jiaotong Liverpool University, Suzhou, PR China. His current research interest is weakly supervised s emantic segmentation and few shot segmentation.
Jimin Xiao received the B.S. and M.E. degrees in telecommunication engineering from the Nanjing University of Posts and Telecommunications, Nanjing, China, in 2004 and 2007, respectively, and the Ph.D. degree in electrical engineering and electronics from the University of Liverpool, Liverpool, U.K., in 2013. From 2013 to 2014, he was a Senior Researcher with the Department of Signal Processing, Tampere University of Technology, Tampere, Finland, and an External Researcher with the Nokia Research Center, Tampere. Since 2014, he has been a Faculty Member with Xi’an Jiaotong Liverpool University, Suzhou, China. His research interests include image and video processing, computer vision, and deep learning.
Yunchao Wei is currently a Professor at B ei jing Jiaotong University. He received his PhD degree from Beijing Jiaotong University in 2016. Before joining UTS, he was a Postdoc Researcher in Prof. Thomas Huang’s Image Formation and Professing (IFP) group at Beckman Institute, UIUC, from 2017 to 2019. He has published over 60 papers on top tier journals and conferences (e.g., T PAMI, CVPR, ICCV, etc.), Google citation 3900+. He received the Excellent Doctoral Dissertation Award of CIE in 2016, AR C Discovery Early Career Researcher Award in 2019, 1st Prize in Science and Technology awarded by China Society of Image and Graphics in 2019. His research interests mainly include Deep learning and its applications in computer vision, e.g., image classifi cation, video/image object detection/segmentation, and learning with imperfect data. He has organized multiple Workshops and Tutorials in CVPR, ICCV, ECCV and ACM MM.
Kaizhu Huang is currently a Professor at Duke Kunshan University, China. Prof. Huang obtained his PhD degree from Chinese University of Hong Kong (CUHK) in 2004. He worked in Fujitsu Research Centre, CUHK, University of Bristol, National Laboratory of Pattern Recognition, Chinese Academy of Sciences from 2004 to 2012. Prof. Huang ha s been working in machine learning, neural information processing, and pattern recognition. He was the recipient of 2011 Asia Pacific Neural Network Society Young Researcher Award. He received best paper or book award five times. Until September 2020, he has published 9 books and over 190 international research papers (70+ international journals) e.g., in journals (JMLR, Neural Computation, IEEE T PAMI, IEEE T NNLS, IEEE T BME, IEEE TCybernetics) and conferences (NeurIPS, IJCAI, SIGIR, UAI, CIKM, ICDM, ICML, ECML, CVPR). He serves as associated editors/advisory board members in a number of journals and book series. He was invited as keynote speaker in more than 20 international conferences or workshops.
Shan Luo is a Lecturer (Assistant Professor) at the Department of Computer Science, University of Liverpool. Previous to Liverpool, he was a Research Fellow at Harvard University and University of Leeds. He was also a Visiting Scientist at the Computer Science and Artificial Intelligence Laboratory (CSAIL), MIT. He received the B.Eng. degree in Automatic Control from China University of Petroleum, Qingdao, China, in 2012. He was awarded the Ph.D. degree in Robotics from King’s College London, UK, in 2016. His research interests include tactile sensing, object recognition and computer vision.
Yao Zhao received the B.S. degree from the Radio Engineering Department, Fuzhou received the B.S. degree from the Radio Engineering Department, Fuzhou University, Fuzhou, China, in 1989, the M.E. degree from tUniversity, Fuzhou, China, in 1989, the M.E. degree from the Radio Engineering he Radio Engineering Department, Southeast University, Nanjing, China, in 1992, and the Ph.D. degree from the Department, Southeast University, Nanjing, China, in 1992, and the Ph.D. degree from the Institute of Information Science, Beijing Jiaotong University (BJTU), Beijing, China, in 1996, Institute of Information Science, Beijing Jiaotong University (BJTU), Beijing, China, in 1996, where he became an Associate Professor and a Profeswhere he became an Associate Professor and a Professor in 1998 and 2001, respectively. sor in 1998 and 2001, respectively. From 2001 to 2002, he was a Senior Research Fellow with the Information and From 2001 to 2002, he was a Senior Research Fellow with the Information and Communication Theory Group, Faculty of Information Technology and Systems, Delft Communication Theory Group, Faculty of Information Technology and Systems, Delft University of Technology, Delft, The Netherlands. In 2015, he vUniversity of Technology, Delft, The Netherlands. In 2015, he visited the Swiss Federal isited the Swiss Federal Institute of Technology, Lausanne (EPFL), Switzerland. From 2017 to 2018, he visited Institute of Technology, Lausanne (EPFL), Switzerland. From 2017 to 2018, he visited University of Southern California. He is currently the Director with the Institute of University of Southern California. He is currently the Director with the Institute of Information Science, BJTU. His current research interests inInformation Science, BJTU. His current research interests include image/video coding, clude image/video coding, digital watermarking and forensics, video analysis and understanding, and artificial digital watermarking and forensics, video analysis and understanding, and artificial intelligence. Dr. Zhao is a Fellow of the IET. He serves on the Editorial Boards of several intelligence. Dr. Zhao is a Fellow of the IET. He serves on the Editorial Boards of several international journals, including as an Associate Ediinternational journals, including as an Associate Editor for the IEEE TRANSACTIONS ON tor for the IEEE TRANSACTIONS ON CYBERNETICS, a Senior Associate Editor for the IEEE SIGNAL PROCESSING LETTERS, and CYBERNETICS, a Senior Associate Editor for the IEEE SIGNAL PROCESSING LETTERS, and an Area Editor for Signal Processing: Image Communication. He was named a an Area Editor for Signal Processing: Image Communication. He was named a Distinguished Young Scholar by the National Science Foundation of Distinguished Young Scholar by the National Science Foundation of China in 2010 and China in 2010 and was elected as a Chang Jiang Scholar of Ministry of Education of China in 2013.was elected as a Chang Jiang Scholar of Ministry of Education of China in 2013.