Adversarial network integrating dual attention and sparse representation for semi-supervised semantic segmentation
Introduction
Semantic segmentation, namely scene understanding, has always been a hot and difficult topic in the field of computer vision (Hassaballah and Hosny, 2019, Hu et al., 2020). It aims to assign each pixel in the image to a specific semantic category, such as road, sky, airplane. Despite the great success of Deep Neural Networks (DNNs), semantic segmentation still has many thorny problems to be solved in uncontrolled and realistic scenarios. One of the main obstacles is inadequate training data. The segmentation task is a pixel-level operation heavily relied on the image sources, yet existing datasets often suffer from insufficient annotated examples and class diversities. In recent years, many semi-supervised/weakly supervised (Dai et al., 2015, Hong et al., 2015, Papandreou et al., 2015, Pathak et al., 2015) methods have been proposed to solve the aforementioned problem, while making the model more scalable. They utilize annotations such as bounding boxes (Dai et al., 2015) and scribes (Lin, Dai, Jia, He, & Sun, 2016), which are weaker than the pixel-level label but easier to obtain due to low annotation costs. Unlike the above methods, a novel method is proposed in the present study that unlabeled images are adopted in semi-supervised learning to generate supervised signals. Although these segmentation models based on deep learning have achieved remarkable success in different tasks, the application of adversarial learning to semantic segmentation has not been fully investigated. It offers an alternative strategy for semantic segmentation.
The successful applications of the generative adversarial model in many fields, especially unsupervised learning, have motivated us to apply adversarial learning to semantic image segmentation tasks with effective results from numerous experiments. A typical Generative Adversarial Networks (GAN) comprises two components, i.e., generator and discriminator. The generator generates synthetic images by sampling random noises, and the discriminator aims to differentiate generated samples from true samples. In the present study, the core idea of the GAN is adopted and the segmentation network is treated as the generator of a GAN framework. The goal of the discriminator network is to determine whether the input data is the segmented image generated from the segmentation network or from the ground truth. Specifically, the designed discriminator intends to identify the input data at pixel-level and generates a probability map instead of a single probability value. Confidence map is regarded as the supervisory signal to guide adversarial learning. Meanwhile, the confidence maps capture important knowledge, the information illustrates which pixels need to be focused on. When the prediction of the segmentation network is fed to a strong discriminator, ideally, every pixel of the full resolution output should be close to 0, which means that the discriminator can distinguish the prediction from the ground truth well. If there are several pixels in the outputs of the discriminator that are close to 1, which means that the model judges these pixels are derived from the ground truth. Furthermore, it is observed that the pixels corresponding to the original image may be misclassified. Those pixels misclassified by discriminator often correspond to pixels that are difficult to segment, such as contour edge and small targets. Therefore, confidence maps are treated as focal attention maps. To get a robust model, the discriminator should be carefully designed, and trained for the specified rounds to generated proper focal attention maps. The focal attention maps are fed back to the segmentation network, which can force the model to pay more attention to the hard-to-segment pixels, inducing the model to converge faster, and making the entire training process more stable. Moreover, semi-supervised learning is applied to the proposed framework. The unlabeled images are fed into the segmentation network to be trained for several iterations to obtain masked pseudo labels, which are treated as the ground truth maps of the input images. Although many novel fully convolutional-based methods have been proposed to tackle semantic segmentation, most of them are not able to take advantage of the interrelationship between objects from the global view. Due to the fixed geometric structures, the performance is inherently suffered from local receptive fields and short-range contextual information.
Towards the above issues, the dual attention module that can capture long-range semantic context is introduced to the model. Besides, the sparse representation module is embedded into the segmentation network to enhance the feature representation of the edge and position information in the image. Due to the attention modules can model long-range dependencies, they have been broadly applied in many works (Laenen and Moens, 2020, Shen et al., 2018, Tang et al., 2011, Tang et al., 2015), which allows the model to focus on the most relevant features as needed. In the semantic segmentation task, contours, scales of objects and the relationship between each pixel in the image have a significant impact on the performance of the model. Some studies enhanced the model’s ability to detect different object scales by integrating the features of different receptive fields (Zhao, Shi, Qi, Wang, & Jia, 2017), and others adopt decoupling of large kernel convolution to gain a large receptive field and capture long-range semantic information (Peng, Zhang, Yu, Luo, & Sun, 2017). Unlike the above-mentioned arts, a dual attention mechanism is added to the segmentation network, which can gain interdependencies in two dimensions. It contains two modules: position attention module and channel attention module. The first one aims to capture the dependency relationship between any two points on the feature map, which can reduce the intra-class distance and increase the inter-class distance. The second one’s structure resembles the first, which can capture the overall interdependence between different channels in the feature map and improve the global feature representation for scene segmentation. Besides, the sparse representation module is also introduced to improve the feature representation of the model further. It is noteworthy that the sparse representation of the color image is learned by convolutional sparse dictionary learning. Lastly, it is fed into a pre-trained deep convolutional network to capture the edge and position information of the objects in the image.
The contributions of this paper are highlighted as follows:
- (1)
A semi-supervised GAN framework for semantic segmentation is proposed.
- (2)
The focal attention map is introduced to guide the training process, which can make the training more stable and converge faster.
- (3)
The dual attention module is proposed to model the global and local semantic dependencies of features, which can boost the representation ability of feature for scene segmentation.
- (4)
The sparse representation module is proposed to learn a high-level expression of the image’s sparse representation and to adaptively fuse it with features extracted from the backbone network, which can effectively improve the segmentation performance.
- (5)
The comprehensive experiments on the PASCAL VOC2012 and Cityscapes datasets show the effectiveness of our proposed semi-supervised model, which is comparable to the fully-supervised method.
The rest of this paper is organized as follows, literature review related work was conducted in Section 2, Section 3 discuss the methodology, and Section 4 covers the training scheme. Experimental results and analysis are shown in Section 5. Finally, conclusion is presented in the last section.
Section snippets
Related work
In this section, some works related to our research will be reviewed, including semantic segmentation, generative adversarial networks, attention model and sparse representations.
Method
In this section, detailed explanation of the proposed model will be explained. First, the overall structure of the whole network in Section 3.1 will be described. Then, the details of our proposed methods are discussed in Sections 3.2 Segmentation network, 3.3 Feedback module.
Training scheme
Adversarial learning is introduced to improve the fitting ability and generalization ability of the model. To ease the consumption of obtaining high-quality data, we further apply semi-supervised learning to our model. We combine the merits of the aforementioned methods to present a novel semi-supervised adversarial network.
The dataset is split into two subsets, one for semi-supervised training and the other one for fully-supervised training. It is worthy to mention that discriminator network
Experimental results
The details of the used method for our large-scale experiments on the PASCAL VOC 2012 and Cityscapes segmentation benchmark are given in this section, the introduction of datasets and implementation details are also presented. To verify the effectiveness of the proposed method, comprehensive experiments are implemented on the validation sets, while our model achieves remarkable performance compared with the state-of-the-art methods. Moreover, we also perform an ablation study to verify the
Conclusions
In this paper, a GAN-based model has been proposed that includes two sub-networks: segmentation network and discriminator network of which the segmentation network consists of four parts: the sparse representation module, the backbone network, the feedback module, and the dual attention module. The sparse representation module can enhance the representation of edge information and location information by extracting convolutional sparse features, and the dual attention module can capture the
CRediT authorship contribution statement
Ge Jin: Conceptualization, Methodology, Software, Investigation, Writing - original draft. Chuancai Liu: Supervision, Resources, Writing - review & editing, Funding acquisition. Xu Chen: Formal analysis, Writing - review & editing, Validation.
Acknowledgments
This work was supported by the National Natural Science Fund of China [Grant Nos. 61473155, 61872188]; Collaborative Innovation Center of IoT Technology and Intelligent Systems, Minjiang University, China [Grant No. IIC1701]; and Science Research Project of Bengbu University, China (No. 2017ZR12).
References (68)
- et al.
Automatic segmentation of intracerebral hemorrhage in CT images using encoder–decoder convolutional neural network
Information Processing & Management
(2020) - et al.
Vessel segmentation and microaneurysm detection using discriminative dictionary learning and sparse representation
Computer Methods and Programs in Biomedicine
(2017) - et al.
A comparative study of outfit recommendation methods with a focus on attention-based fusion
Information Processing & Management
(2020) - et al.
Image caption generation with dual attention mechanism
Information Processing & Management
(2020) - et al.
Local adaptive joint sparse representation for hyperspectral image classification
Neurocomputing
(2019) - et al.
K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation
IEEE Transactions on Signal Processing
(2006) - Ahn, J., & Kwak, S. (2018). Learning pixel-level semantic affinity with image-level supervision for weakly supervised...
- et al.
Segnet: A deep convolutional encoder-decoder architecture for image segmentation
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2017) - et al.
Adaresu-net: Multiobjective adaptive convolutional neural network for medical image segmentation
Neurocomputing
(2019) - et al.
What’s the point: Semantic segmentation with point supervision
Distributed optimization and statistical learning via the alternating direction method of multipliers
Foundations and Trends® in Machine Learning
Atomic decomposition by basis pursuit
SIAM Review
Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs
IEEE Transactions on Pattern Analysis and Machine Intelligence
Rethinking atrous convolution for semantic image segmentation
Attention-based models for speech recognition
The pascal visual object classes (voc) challenge
International Journal of Computer Vision
Laplacian Reconstruction and refinement for semantic segmentation
Generative adversarial nets
Semantic contours from inverse detectors
Recent advances in computer vision: Theories and applications
Studies in Computational Intelligence
Cycada: Cycle-consistent adversarial domain adaptation
Decoupled deep neural network for semi-supervised semantic segmentation
Adversarial learning for semi-supervised semantic segmentation
Image-to-image translation with conditional adversarial networks
Adam: A method for stochastic optimization
Ladder-style densenets for semantic segmentation of large natural images
Cited by (7)
Unsupervised self-training correction learning for 2D image-based 3D model retrieval
2023, Information Processing and ManagementAn end-to-end deep generative approach with meta-learning optimization for zero-shot object classification
2023, Information Processing and ManagementCitation Excerpt :Xian et al. (2016) proposed to learn a bilinear compatibility function by introducing latent variables. With the rise of deep learning (Jin, Liu, & Chen, 2021), massive deep learning techniques have been successfully applied to address ZSL problems. Jiang, Wang, Shan, and Chen (2019) proposed a transferable contrastive model and considered both the discriminative property and the transferable property.
A large-scale data security detection method based on continuous time graph embedding framework
2023, Journal of Cloud Computing