Distinguishing foreground and background alignment for unsupervised domain adaptative semantic segmentation
Introduction
To effectively alleviate the domain gap is the key point to improve the performance of domain adaptative semantic segmentation tasks. At present, widely approaches [[1], [2], [3]] are to use adversarial learning method to align the global semantic features between different domains, e.g., [4] construct a multi-level adversarial network to effectively perform output space domain adaptation at different feature levels. [5] proposes to apply different adversarial weights to different regions to solve the problem of class-level alignment. Different with them, we consider to align for the background and foreground classes.
Moreover, the unsupervised domain adaptative semantic segmentation method based on pseudo-labels uses high-confidence prediction as the pseudo ground truth of unlabeled data in the target domain, thus, does fine-tune the model trained on the source domain. In [6], a self-supervised learning method of combining different outputs of the model is proposed, and then pseudo-labels are generated for unlabeled data to train the model. CBST [7] achieves domain adaptation by generating class-balanced pseudo-labels from images, and introduces a spatial prior to guide adaptation processing. After comprehensive analysis and consideration, we adopt the self-supervised learning (SSL) method proposed by bidirectional learning (BDL) [8]. The target domain with pseudo-labels is used to update the adaptative network, while excluding low-confidence prediction labels. This method combines the two domains better than existing methods to generate pseudo-labels through only one learning.
We notice that the spatial context is important to segmentation task, however, the balance of capturing rich context information and consuming complexity of computation needs our consideration. A lot of work has made great improvements on the joint encoding of spatial and channel information. The self-attention map in SAGAN [9] shows a good balance between the ability to simulate long-range dependencies and computational efficiency. Moreover, the self-attention module takes the weighted sum of the features at all positions on the feature map as the response of the position, and the calculation cost of the attention vector is small. The squeenze and excitation (SE) [10] module improves the expressive ability of the network by modeling the dependency between the channels of the convolutional features. The model selects and amplifies valuable channels from global information and suppresses useless feature channels. SAGAN focuses on obtaining the spatial position relationship between pixels, while the channel attention SE module discards spatial correlation through global average pooling. However, for semantic segmentation, which is a intensive prediction classification task, the dependence between the acquisition spatial and the channel is equally important. We introduce the spatial and channel parallel attention module (scSE) [11] to obtain the dependency relationship between the channels and the spatial position information.
The main contributions of this paper are:
- •
In view of the difference between the background classes and the foreground classes, the method of distinguishing and aligning the foreground classes and the background classes is adopted to improve the semantic level alignment.
- •
In the segmented network module of GAN, a parallel attention module of spatial and channel is introduced to capture the spatial position information and the dependence between channels.
- •
We add Focal Loss to the overall loss to reduce the impact of class imbalance on the adaptation process. And used the technique of spectral normalization(SN) to stabilize GAN training.
The proposed method is evaluated on two unsupervised domain adaptation tasks GTA5 [12] to Cityscapes [13] and SYNTHIA [14] to Cityscapes. And a high performance is achieved in these two domain adaptation tasks.
Section snippets
Related work
The main idea of the domain adaptative task is to align the feature distribution between the source domain and the target domain. Different from domain adaptation in image classification tasks, domain adaptation in semantic segmentation is a challenging task. When the knowledge learned from the virtual image is converted to the real image, it is necessary to correct the difference from the training to the test stage so that the model has better generalization ability during the test procedure [
Overview
Our overall framework is shown in Fig. 1, which mainly includes segmentation network G and discriminator network D. The same as AdaptSeg [4] is that we adopt a two-layer adversarial approach. Use the features of convolution 4 layers and convolution 5 layers to predict the segmentation results of the output space, and then input them into the discriminator for discrimination. Let Xs and Xt be the datasets of the labeled source domain and the unlabeled target domain, each image xs, xt ∈ RH×W×3 in
Experimental datasets and setup
The migration from synthetic datasets to real dataset is the same as previous work. The real scene dataset Cityscapes is used as the target domain, and the virtual scene dataset GTA5 and SYNTHIA are used as the source domain. The target domain Cityscapes dataset and the source domain GTA5 dataset both contain 19 common classes, and the SYNTHIA dataset contains 16 common city classes. The size of the images in the Cityscapes dataset is 2048 × 1024 and contains 5000 annotated pictures. The image
Conclusions
We present an adaptative semantic segmentation method based on fine-grained alignment is proposed, which is based on the original two-level adversarial network. First, we use a self-supervised learning method to generate pseudo labels for the target domain, and use the pseudo labels to better align the two domains. Then the foreground classes and the background classes are aligned separately, which takes a more detailed account of the inter-domain difference in the spatial distribution between
CRediT authorship contribution statement
Jia Zhang: Conceptualization, Methodology, Investigation, Writing – original draft. Wei Li: Data curation, Visualization, Writing – original draft. Zhixin Li: Data curation, Methodology, Writing – review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work is supported by the National Natural Science Foundation of China (Nos. 61966004, 61866004), the Guangxi Natural Science Foundation (No. 2019GXNSFDA245018), the Guangxi”Bagui Scholar” Teams for Innovation and Research Project, the Guangxi Talent Highland Project of Big Data Intelligence and Application, Guangxi Collaborative Innovation Center of Multi-source Information Integration and Intelligent Processing.
References (38)
- et al.
Fcns in the wild: Pixel-level adversarial and constraint-based adaptation
arXiv
(2022) - et al.
Adversarial discriminative domain adaptation
- et al.
ROAD: Reality Oriented Adaptation for Semantic Segmentation of Urban Scenes
- et al.
Learning to adapt structured output space for semantic segmentation
- et al.
Taking a closer look at domain shift: Category-level adversaries for semantics consistent domain adaptation
- et al.
Temporal ensembling for semi-supervised learning
arXiv
(2022) - et al.
Unsupervised domain adaptation for semantic segmentation via class-balanced self-training
- et al.
Bidirectional learning for domain adaptation of semantic segmentation
- et al.
Self-attention generative adversarial networks
- et al.
Squeeze-and-excitation networks
Concurrent spatial and channel ‘squeeze & excitation’ n fully convolutional networks
Playing for data: Ground truth from computer games
The cityscapes dataset for semantic urban scene understanding
The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes
Visual domain adaptation: a survey of recent advances
IEEE Signal Process. Mag.
A theory of learning from different domains
Mach. Learn.
Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation
Deep alignment network based multi-person tracking with occlusion and motion reasoning
IEEE Trans. Multimedia
Fine-grained spatial alignment model for person re-identification with focal triplet loss
IEEE Trans. Image Process.
Cited by (7)
R<sup>2</sup>-trans: Fine-grained visual categorization with redundancy reduction
2024, Image and Vision ComputingRBGAN: Realistic-generation and balanced-utility GAN for face de-identification
2024, Image and Vision ComputingImproving semantic segmentation with knowledge reasoning network
2023, Journal of Visual Communication and Image RepresentationMLCB-Net: a multi-level class balancing network for domain adaptive semantic segmentation
2023, Multimedia Systems