Distinguishing foreground and background alignment for unsupervised domain adaptative semantic segmentation

https://doi.org/10.1016/j.imavis.2022.104513Get rights and content

Highlights

  • We use self-supervised learning to generate pseudo-labels for the target domain.

  • We distinguish and align the foreground and background classes.

  • We use parallel attention module to capture the space and channel information.

  • We add focal loss to the overall loss to reduce the impact of class imbalance.

Abstract

Unsupervised domain adaptive semantic segmentation uses the knowledge learned from the labeled source domain dataset to guide the segmentation of the target domain. However, this domain migration method will cause a large inter-domain difference due to the different feature distributions between the source domain and the target domain. We use the self-supervised learning method to generate pseudo labels for the target domain, so that the corresponding pixels are directly aligned with the source domain according to the segmentation loss. Through observation, it is found that the spatial distribution of the background class in the source domain and the target domain has a small difference, and the appearance of the same class of the foreground class will also be quite different. We use the method of distinguishing alignment between foreground and background classes. We understand that acquiring the rich space and channel information on the feature map during the convolution process is essential for fine-grained semantic segmentation. Therefore, in order to obtain the dependency relationship between the channels of the feature map and the spatial position information, we use a channel and spatial parallel attention module. This module enables the network to select and amplify valuable space and channel information from the global information and suppress useless information. In addition, we introduce focal loss to solve the problem of class imbalance in the data set. Experiments show that our method achieves better segmentation performance in unsupervised domain adaptive semantic segmentation.

Introduction

To effectively alleviate the domain gap is the key point to improve the performance of domain adaptative semantic segmentation tasks. At present, widely approaches [[1], [2], [3]] are to use adversarial learning method to align the global semantic features between different domains, e.g., [4] construct a multi-level adversarial network to effectively perform output space domain adaptation at different feature levels. [5] proposes to apply different adversarial weights to different regions to solve the problem of class-level alignment. Different with them, we consider to align for the background and foreground classes.

Moreover, the unsupervised domain adaptative semantic segmentation method based on pseudo-labels uses high-confidence prediction as the pseudo ground truth of unlabeled data in the target domain, thus, does fine-tune the model trained on the source domain. In [6], a self-supervised learning method of combining different outputs of the model is proposed, and then pseudo-labels are generated for unlabeled data to train the model. CBST [7] achieves domain adaptation by generating class-balanced pseudo-labels from images, and introduces a spatial prior to guide adaptation processing. After comprehensive analysis and consideration, we adopt the self-supervised learning (SSL) method proposed by bidirectional learning (BDL) [8]. The target domain with pseudo-labels is used to update the adaptative network, while excluding low-confidence prediction labels. This method combines the two domains better than existing methods to generate pseudo-labels through only one learning.

We notice that the spatial context is important to segmentation task, however, the balance of capturing rich context information and consuming complexity of computation needs our consideration. A lot of work has made great improvements on the joint encoding of spatial and channel information. The self-attention map in SAGAN [9] shows a good balance between the ability to simulate long-range dependencies and computational efficiency. Moreover, the self-attention module takes the weighted sum of the features at all positions on the feature map as the response of the position, and the calculation cost of the attention vector is small. The squeenze and excitation (SE) [10] module improves the expressive ability of the network by modeling the dependency between the channels of the convolutional features. The model selects and amplifies valuable channels from global information and suppresses useless feature channels. SAGAN focuses on obtaining the spatial position relationship between pixels, while the channel attention SE module discards spatial correlation through global average pooling. However, for semantic segmentation, which is a intensive prediction classification task, the dependence between the acquisition spatial and the channel is equally important. We introduce the spatial and channel parallel attention module (scSE) [11] to obtain the dependency relationship between the channels and the spatial position information.

The main contributions of this paper are:

  • In view of the difference between the background classes and the foreground classes, the method of distinguishing and aligning the foreground classes and the background classes is adopted to improve the semantic level alignment.

  • In the segmented network module of GAN, a parallel attention module of spatial and channel is introduced to capture the spatial position information and the dependence between channels.

  • We add Focal Loss to the overall loss to reduce the impact of class imbalance on the adaptation process. And used the technique of spectral normalization(SN) to stabilize GAN training.

The proposed method is evaluated on two unsupervised domain adaptation tasks GTA5 [12] to Cityscapes [13] and SYNTHIA [14] to Cityscapes. And a high performance is achieved in these two domain adaptation tasks.

Section snippets

Related work

The main idea of the domain adaptative task is to align the feature distribution between the source domain and the target domain. Different from domain adaptation in image classification tasks, domain adaptation in semantic segmentation is a challenging task. When the knowledge learned from the virtual image is converted to the real image, it is necessary to correct the difference from the training to the test stage so that the model has better generalization ability during the test procedure [

Overview

Our overall framework is shown in Fig. 1, which mainly includes segmentation network G and discriminator network D. The same as AdaptSeg [4] is that we adopt a two-layer adversarial approach. Use the features of convolution 4 layers and convolution 5 layers to predict the segmentation results of the output space, and then input them into the discriminator for discrimination. Let Xs and Xt be the datasets of the labeled source domain and the unlabeled target domain, each image xs, xtRH×W×3 in

Experimental datasets and setup

The migration from synthetic datasets to real dataset is the same as previous work. The real scene dataset Cityscapes is used as the target domain, and the virtual scene dataset GTA5 and SYNTHIA are used as the source domain. The target domain Cityscapes dataset and the source domain GTA5 dataset both contain 19 common classes, and the SYNTHIA dataset contains 16 common city classes. The size of the images in the Cityscapes dataset is 2048 × 1024 and contains 5000 annotated pictures. The image

Conclusions

We present an adaptative semantic segmentation method based on fine-grained alignment is proposed, which is based on the original two-level adversarial network. First, we use a self-supervised learning method to generate pseudo labels for the target domain, and use the pseudo labels to better align the two domains. Then the foreground classes and the background classes are aligned separately, which takes a more detailed account of the inter-domain difference in the spatial distribution between

CRediT authorship contribution statement

Jia Zhang: Conceptualization, Methodology, Investigation, Writing – original draft. Wei Li: Data curation, Visualization, Writing – original draft. Zhixin Li: Data curation, Methodology, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported by the National Natural Science Foundation of China (Nos. 61966004, 61866004), the Guangxi Natural Science Foundation (No. 2019GXNSFDA245018), the Guangxi”Bagui Scholar” Teams for Innovation and Research Project, the Guangxi Talent Highland Project of Big Data Intelligence and Application, Guangxi Collaborative Innovation Center of Multi-source Information Integration and Intelligent Processing.

References (38)

  • J. Hoffman et al.

    Fcns in the wild: Pixel-level adversarial and constraint-based adaptation

    arXiv

    (2022)
  • E. Tzeng et al.

    Adversarial discriminative domain adaptation

  • Y. Chen et al.

    ROAD: Reality Oriented Adaptation for Semantic Segmentation of Urban Scenes

  • Y.-H. Tsai et al.

    Learning to adapt structured output space for semantic segmentation

  • Y. Luo et al.

    Taking a closer look at domain shift: Category-level adversaries for semantics consistent domain adaptation

  • S. Laine et al.

    Temporal ensembling for semi-supervised learning

    arXiv

    (2022)
  • Y. Zou et al.

    Unsupervised domain adaptation for semantic segmentation via class-balanced self-training

  • Y. Li et al.

    Bidirectional learning for domain adaptation of semantic segmentation

  • H. Zhang et al.

    Self-attention generative adversarial networks

  • J. Hu et al.

    Squeeze-and-excitation networks

  • A.G. Roy et al.

    Concurrent spatial and channel ‘squeeze & excitation’ n fully convolutional networks

  • S.R. Richter et al.

    Playing for data: Ground truth from computer games

  • M. Cordts et al.

    The cityscapes dataset for semantic urban scene understanding

  • G. Ros et al.

    The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes

  • V.M. Patel et al.

    Visual domain adaptation: a survey of recent advances

    IEEE Signal Process. Mag.

    (2015)
  • S. Ben-David et al.

    A theory of learning from different domains

    Mach. Learn.

    (2010)
  • T.-H. Vu et al.

    Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation

  • Q. Zhou et al.

    Deep alignment network based multi-person tracking with occlusion and motion reasoning

    IEEE Trans. Multimedia

    (2018)
  • Q. Zhou et al.

    Fine-grained spatial alignment model for person re-identification with focal triplet loss

    IEEE Trans. Image Process.

    (2020)
  • Cited by (7)

    • Improving semantic segmentation with knowledge reasoning network

      2023, Journal of Visual Communication and Image Representation
    View all citing articles on Scopus
    View full text