Adversarial network integrating dual attention and sparse representation for semi-supervised semantic segmentation

https://doi.org/10.1016/j.ipm.2021.102680Get rights and content

Highlights

  • A semi-supervised GAN framework for semantic segmentation is proposed.

  • The dual attention is proposed to model the global and local semantic dependencies.

  • The sparse representation module is adopted to further improve the performance.

  • The focal attention map is proposed to enhance the robustness of the model.

  • Extensive experiments are conducted to verify the effectiveness of our method.

Abstract

Semantic segmentation is the important task of assigning a semantic label to each pixel. However, semantic segmentation based on the deep neural network usually requires massive annotations consumption to acquire better performance. To avoid the problem, some algorithms based on weakly-supervised and semi-supervised conditions have been proposed and achieved gradually improving performance in recent years. In this paper, we propose a novel semi-supervised adversarial network to alleviate the shortage of labeled data, which only requires a few labeled images to get competitive performance. The model is composed of two parts: the segmentation network and the discriminator network. The first part aims to semantically generate a segmented result that has the same size as the input color image. The discriminator network is designed in a fully convolutional manner to distinguish the predicted probability maps depending on the ground truth distribution. In particular, the probability maps are regarded as focal attention maps, which are fed back to the segmentation network to make the model converge faster, and the process can induce the model to focus on pixels that are hard to segment. To enhance the representation ability of image features, sparse representation and dual attention are adopted in the segmentation network. The sparse representation module aims to emphasize the object edges and locations by learning the convolutional sparse representation of the input color images, and the dual attention module can exploit the semantic interdependencies in two different dimensions. Moreover, the semi-supervised mechanism is introduced to the network, in which the adaptive parameter T that controls the sensitivity of the self-taught phase is proposed, and the training dataset is split into two parts for fully-supervised learning and semi-supervised learning. Specifically, the first part is unlabeled data, which is applied to provide supervised signals for semi-supervised training. The labeled data drawn from the other part is utilized for fully-supervised learning. Our semi-supervised adversarial framework can improve the learning ability and achieve higher performance, also, provide a novel approach to tackle the semantic segmentation task. Finally, comprehensive experiments on the PASCAL VOC 2012 and Cityscapes datasets are conducted to verify the effectiveness of the proposed model, which achieves the expected performance.

Introduction

Semantic segmentation, namely scene understanding, has always been a hot and difficult topic in the field of computer vision (Hassaballah and Hosny, 2019, Hu et al., 2020). It aims to assign each pixel in the image to a specific semantic category, such as road, sky, airplane. Despite the great success of Deep Neural Networks (DNNs), semantic segmentation still has many thorny problems to be solved in uncontrolled and realistic scenarios. One of the main obstacles is inadequate training data. The segmentation task is a pixel-level operation heavily relied on the image sources, yet existing datasets often suffer from insufficient annotated examples and class diversities. In recent years, many semi-supervised/weakly supervised (Dai et al., 2015, Hong et al., 2015, Papandreou et al., 2015, Pathak et al., 2015) methods have been proposed to solve the aforementioned problem, while making the model more scalable. They utilize annotations such as bounding boxes (Dai et al., 2015) and scribes (Lin, Dai, Jia, He, & Sun, 2016), which are weaker than the pixel-level label but easier to obtain due to low annotation costs. Unlike the above methods, a novel method is proposed in the present study that unlabeled images are adopted in semi-supervised learning to generate supervised signals. Although these segmentation models based on deep learning have achieved remarkable success in different tasks, the application of adversarial learning to semantic segmentation has not been fully investigated. It offers an alternative strategy for semantic segmentation.

The successful applications of the generative adversarial model in many fields, especially unsupervised learning, have motivated us to apply adversarial learning to semantic image segmentation tasks with effective results from numerous experiments. A typical Generative Adversarial Networks (GAN) comprises two components, i.e., generator and discriminator. The generator generates synthetic images by sampling random noises, and the discriminator aims to differentiate generated samples from true samples. In the present study, the core idea of the GAN is adopted and the segmentation network is treated as the generator of a GAN framework. The goal of the discriminator network is to determine whether the input data is the segmented image generated from the segmentation network or from the ground truth. Specifically, the designed discriminator intends to identify the input data at pixel-level and generates a probability map instead of a single probability value. Confidence map is regarded as the supervisory signal to guide adversarial learning. Meanwhile, the confidence maps capture important knowledge, the information illustrates which pixels need to be focused on. When the prediction of the segmentation network is fed to a strong discriminator, ideally, every pixel of the full resolution output should be close to 0, which means that the discriminator can distinguish the prediction from the ground truth well. If there are several pixels in the outputs of the discriminator that are close to 1, which means that the model judges these pixels are derived from the ground truth. Furthermore, it is observed that the pixels corresponding to the original image may be misclassified. Those pixels misclassified by discriminator often correspond to pixels that are difficult to segment, such as contour edge and small targets. Therefore, confidence maps are treated as focal attention maps. To get a robust model, the discriminator should be carefully designed, and trained for the specified rounds to generated proper focal attention maps. The focal attention maps are fed back to the segmentation network, which can force the model to pay more attention to the hard-to-segment pixels, inducing the model to converge faster, and making the entire training process more stable. Moreover, semi-supervised learning is applied to the proposed framework. The unlabeled images are fed into the segmentation network to be trained for several iterations to obtain masked pseudo labels, which are treated as the ground truth maps of the input images. Although many novel fully convolutional-based methods have been proposed to tackle semantic segmentation, most of them are not able to take advantage of the interrelationship between objects from the global view. Due to the fixed geometric structures, the performance is inherently suffered from local receptive fields and short-range contextual information.

Towards the above issues, the dual attention module that can capture long-range semantic context is introduced to the model. Besides, the sparse representation module is embedded into the segmentation network to enhance the feature representation of the edge and position information in the image. Due to the attention modules can model long-range dependencies, they have been broadly applied in many works (Laenen and Moens, 2020, Shen et al., 2018, Tang et al., 2011, Tang et al., 2015), which allows the model to focus on the most relevant features as needed. In the semantic segmentation task, contours, scales of objects and the relationship between each pixel in the image have a significant impact on the performance of the model. Some studies enhanced the model’s ability to detect different object scales by integrating the features of different receptive fields (Zhao, Shi, Qi, Wang, & Jia, 2017), and others adopt decoupling of large kernel convolution to gain a large receptive field and capture long-range semantic information (Peng, Zhang, Yu, Luo, & Sun, 2017). Unlike the above-mentioned arts, a dual attention mechanism is added to the segmentation network, which can gain interdependencies in two dimensions. It contains two modules: position attention module and channel attention module. The first one aims to capture the dependency relationship between any two points on the feature map, which can reduce the intra-class distance and increase the inter-class distance. The second one’s structure resembles the first, which can capture the overall interdependence between different channels in the feature map and improve the global feature representation for scene segmentation. Besides, the sparse representation module is also introduced to improve the feature representation of the model further. It is noteworthy that the sparse representation of the color image is learned by convolutional sparse dictionary learning. Lastly, it is fed into a pre-trained deep convolutional network to capture the edge and position information of the objects in the image.

The contributions of this paper are highlighted as follows:

  • (1)

    A semi-supervised GAN framework for semantic segmentation is proposed.

  • (2)

    The focal attention map is introduced to guide the training process, which can make the training more stable and converge faster.

  • (3)

    The dual attention module is proposed to model the global and local semantic dependencies of features, which can boost the representation ability of feature for scene segmentation.

  • (4)

    The sparse representation module is proposed to learn a high-level expression of the image’s sparse representation and to adaptively fuse it with features extracted from the backbone network, which can effectively improve the segmentation performance.

  • (5)

    The comprehensive experiments on the PASCAL VOC2012 and Cityscapes datasets show the effectiveness of our proposed semi-supervised model, which is comparable to the fully-supervised method.

The rest of this paper is organized as follows, literature review related work was conducted in Section 2, Section 3 discuss the methodology, and Section 4 covers the training scheme. Experimental results and analysis are shown in Section 5. Finally, conclusion is presented in the last section.

Section snippets

Related work

In this section, some works related to our research will be reviewed, including semantic segmentation, generative adversarial networks, attention model and sparse representations.

Method

In this section, detailed explanation of the proposed model will be explained. First, the overall structure of the whole network in Section 3.1 will be described. Then, the details of our proposed methods are discussed in Sections 3.2 Segmentation network, 3.3 Feedback module.

Training scheme

Adversarial learning is introduced to improve the fitting ability and generalization ability of the model. To ease the consumption of obtaining high-quality data, we further apply semi-supervised learning to our model. We combine the merits of the aforementioned methods to present a novel semi-supervised adversarial network.

The dataset is split into two subsets, one for semi-supervised training and the other one for fully-supervised training. It is worthy to mention that discriminator network

Experimental results

The details of the used method for our large-scale experiments on the PASCAL VOC 2012 and Cityscapes segmentation benchmark are given in this section, the introduction of datasets and implementation details are also presented. To verify the effectiveness of the proposed method, comprehensive experiments are implemented on the validation sets, while our model achieves remarkable performance compared with the state-of-the-art methods. Moreover, we also perform an ablation study to verify the

Conclusions

In this paper, a GAN-based model has been proposed that includes two sub-networks: segmentation network and discriminator network of which the segmentation network consists of four parts: the sparse representation module, the backbone network, the feedback module, and the dual attention module. The sparse representation module can enhance the representation of edge information and location information by extracting convolutional sparse features, and the dual attention module can capture the

CRediT authorship contribution statement

Ge Jin: Conceptualization, Methodology, Software, Investigation, Writing - original draft. Chuancai Liu: Supervision, Resources, Writing - review & editing, Funding acquisition. Xu Chen: Formal analysis, Writing - review & editing, Validation.

Acknowledgments

This work was supported by the National Natural Science Fund of China [Grant Nos. 61473155, 61872188]; Collaborative Innovation Center of IoT Technology and Intelligent Systems, Minjiang University, China [Grant No. IIC1701]; and Science Research Project of Bengbu University, China (No. 2017ZR12).

References (68)

  • BoydS. et al.

    Distributed optimization and statistical learning via the alternating direction method of multipliers

    Foundations and Trends® in Machine Learning

    (2011)
  • Byeon, W., Breuel, T. M., Raue, F., & Liwicki, M. (2015). Scene labeling with lstm recurrent neural networks. In...
  • ChenS.S. et al.

    Atomic decomposition by basis pursuit

    SIAM Review

    (2001)
  • ChenL.-C. et al.

    Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2017)
  • ChenL.-C. et al.

    Rethinking atrous convolution for semantic image segmentation

    (2017)
  • ChorowskiJ.K. et al.

    Attention-based models for speech recognition

  • Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., & Benenson, R., et al. (2016). The cityscapes dataset for...
  • Dai, J., He, K., & Sun, J. (2015). Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic...
  • Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., & Hu, H., et al. (2017). Deformable convolutional networks. In...
  • EveringhamM. et al.

    The pascal visual object classes (voc) challenge

    International Journal of Computer Vision

    (2010)
  • Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., & Fang, Z., et al. (2019). Dual attention network for scene segmentation....
  • Gao, Y., Beijbom, O., Zhang, N., & Darrell, T. (2016). Compact bilinear pooling. In Proceedings of the IEEE conference...
  • GhiasiG. et al.

    Laplacian Reconstruction and refinement for semantic segmentation

    (2016)
  • GoodfellowI. et al.

    Generative adversarial nets

  • HariharanB. et al.

    Semantic contours from inverse detectors

  • HassaballahM. et al.

    Recent advances in computer vision: Theories and applications

    Studies in Computational Intelligence

    (2019)
  • HoffmanJ. et al.

    Cycada: Cycle-consistent adversarial domain adaptation

  • HongS. et al.

    Decoupled deep neural network for semi-supervised semantic segmentation

  • Hong, S., Yang, D., Choi, J., & Lee, H. (2018). Inferring semantic layout for hierarchical text-to-image synthesis. In...
  • HungW.-C. et al.

    Adversarial learning for semi-supervised semantic segmentation

    (2018)
  • IsolaP. et al.

    Image-to-image translation with conditional adversarial networks

  • KingmaD.P. et al.

    Adam: A method for stochastic optimization

    (2014)
  • KrapacJ. et al.

    Ladder-style densenets for semantic segmentation of large natural images

  • Ledig, C., Theis, L., Huszár, F., Caballero, J., Cunningham, A., & Acosta, A., et al. (2017). Photo-realistic single...
  • Cited by (7)

    • An end-to-end deep generative approach with meta-learning optimization for zero-shot object classification

      2023, Information Processing and Management
      Citation Excerpt :

      Xian et al. (2016) proposed to learn a bilinear compatibility function by introducing latent variables. With the rise of deep learning (Jin, Liu, & Chen, 2021), massive deep learning techniques have been successfully applied to address ZSL problems. Jiang, Wang, Shan, and Chen (2019) proposed a transferable contrastive model and considered both the discriminative property and the transferable property.

    View all citing articles on Scopus
    View full text