Learning intra-domain style-invariant representation for unsupervised domain adaptation of semantic segmentation

doi:10.1016/j.patcog.2022.108911

Pattern Recognition

Volume 132, December 2022, 108911

https://doi.org/10.1016/j.patcog.2022.108911 Get rights and content

Highlights

•
A novel conception of learning intra-domain style-invariant representation for unsupervised domain adaptation of semantic segmentation.
•
A self-ensembling method for learning the intra-domain style-invariant representation.
•
A semantic-aware multimodal image-to-image translation model for obtaining images with diversified intra-domain styles.
•
Extensive experiments and analyses for validating the effectiveness and superiority over state-of-the-art methods.

Abstract

In this paper, we aim to tackle the problem of unsupervised domain adaptation (UDA) of semantic segmentation and improve the UDA performance with a novel conception of learning intra-domain style-invariant representation. Previous UDA methods focused on reducing the inter-domain inconsistency between the source domain and the target domain. However, due to the different data distributions of the two domains, reducing the inter-domain inconsistency cannot ensure the generalization ability of the trained model in the target domain. Therefore, to improve the UDA performance, we take into consideration the intra-domain diversity of the target domain for the first time in studies on UDA and aim to train the model to generalize well to the diverse intra-domain styles. To achieve this, we propose a self-ensembling method to learn the intra-domain style-invariant representation and we introduce a semantic-aware multimodal image-to-image translation model to obtain images with diversified intra-domain styles. Our method achieves state-of-the-art performance on two synthetic-to-real adaptation benchmarks, and we demonstrate the effectiveness of our method by conducting extensive experiments.

Introduction

Unsupervised domain adaptation (UDA) aims to transfer knowledge from a domain that is rich in ground truth labels to an unlabeled domain. UDA is especially promising for tasks that have a shortage of ground truth labels such as semantic segmentation. In recent years, synthetic data (e.g., GTA5 [1] and SYNTHIA [2]) have drawn researchers’ interest as an appropriate candidate for the source domain in UDA of semantic segmentation. Labels of synthetic data can be produced automatically, and thus leveraging those synthetic data may considerably alleviate the burden of human annotation.

Unlike semi-supervised learning (SSL) in which labeled data and unlabeled data typically have the same distributions, distributions of the two domains in UDA are quite different and the images have major visual differences. Therefore, aligning the feature distributions of the two domains is considered the key to transferring the knowledge. Researchers have tried to achieve this by using various approaches such as modifying the images to make the two domains visually similar [3], [4], [5] and using adversarial learning to make the domain of the features or segmentation outputs indistinguishable [6], [7], [8]. Despite significant achievements, a problem that has not attracted sufficient attention is that alignment of the feature distributions cannot ensure generalization ability of the trained model in the target domain. Due to the different intrinsic data distributions and some nontransferable features, the two domains cannot be completely aligned, and a model trained with supervision signals from only the source domain may therefore not generalize well in the target domain. Although pseudo labels can provide supervision signals in the target domain [9], [10], [11], the final performance still depends on the model that generates the pseudo labels.

In this study, to tackle the problem of generalization in the target domain, we focused on learning intra-domain style-invariant representation for UDA of semantic segmentation. The underlying concept is that if the learned representation is invariant to the varied characteristics (e.g., brightness, saturation and texture, which are referred to as intra-domain styles in this paper) of the target domain, the segmentation model may perform well on the unknown samples in the target domain. This concept is somewhat similar to data augmentation, which is considered to be usually helpful for the enhancement of convolutional neural networks’ (CNNs’) generalization ability in supervised learning. However, in our study, the style of an image cannot be modified appropriately by using usual augmentation techniques. In addition, more importantly, the style-invariant representation is learned via not only supervised learning with labeled source domain samples but also unsupervised learning with unlabeled target domain samples. Therefore, we propose a self-ensembling method to integrate the supervised and unsupervised learning of intra-domain style-invariant representation and, additionally, construct a multimodal unpaired image-to-image (I2I) translation model to obtain images with diverse intra-domain styles.

The idea of self-ensembling originated from studies of SSL [12], [13]. SSL was used for UDA of semantic segmentation in a previous work [14] but only as a usual SSL technology that does not consider the generalization problem and the intra-domain styles. In this study, we used a self-ensembling architecture [13] that consists of a student model trained with style-diversified images and a teacher model updated as the exponential moving average of the student model. By training with images with diversified intra-domain styles, the learning of the intra-domain style-invariant representation integrates into a supervised loss of the source domain and a teacher-student consistency loss of the target domain. Moreover, pseudo labels are subsequently involved in the training to further improve the UDA performance.

As mentioned above, images with diversified intra-domain styles are indispensable for the realization of our conception. In our method, we translate the source domain images to different target domain styles and meanwhile diversify the styles of the target domain images. Such a task can be accomplished by an existing multimodal unpaired I2I translation method [15] named multimodal unsupervised image- to-image translation (MUNIT). However, we found that the existing method cannot meet an essential requirement in our study, which is the consistency of semantic contents in the translation results. An example is shown in Fig. 1. The semantic contents in the sky region are inconsistent in the translation results of MUNIT. To overcome this problem, we adapt the MUNIT architecture to content-consistent translation by introducing pixel-level semantic information as additional guidance for the translation. As shown in Fig. 1, the consistency of semantic contents in our translation results is enhanced compared to that in the results of MUNIT, and learning the style-invariant representation therefore becomes realizable.

In this paper, we make the following contributions.

•
We propose the conception of learning intra-domain style-invariant representation for UDA of semantic segmentation, which can make the trained model generalize better to the diverse intra-domain styles of the target domain.
•
We propose a self-ensembling method for learning the intra-domain style-invariant representation and construct a semantic-aware version of MUNIT for style diversification.
•
We achieved state-of-the-art UDA performance on GTA5-to-Cityscapes and SYNTHIA-to-Cityscapes benchmarks and we conducted extensive experiments for further analyses.

Section snippets

UDA of semantic segmentation

UDA of semantic segmentation is considered a challenging task due to the complexity of transferring pixel-level semantic knowledge. There are generally three main components in the technologies for UDA of semantic segmentation: I2I translation, adversarial learning, and semi-supervised learning.

I2I translation technologies can modify some characteristics (e.g., color and texture) that are collectively called “styles” of an image, typically for reducing the visual domain gap. Cycle-consistent

Proposed method

First, we provide the problem setting. Given a source domain dataset $S$ comprised of image-label pairs ${x^{s}, y^{s}} \in S$ and a target domain dataset $T$ comprised of unlabeled images ${x^{t}} \in T$ , we aim to transfer the semantic knowledge from $S$ to $T$ at the pixel level. Our method for learning intra-domain style-invariant representation is presented in Section 3.1. The I2I translation model that generates the style-diversified images used in the method presented in Section 3.1 is described in Section 3.2. All

Experiments

We conducted experiments on two benchmarks GTA5-to-Cityscapes and SYNTHIA-to-Cityscapes, both of which are synthetic-to-real adaptations. Datasets are first described in Section 4.1. The main results for the benchmarks are presented and compared to results of state-of-the-art methods in Section 4.2. Finally, results of extensive supplementary experiments to further analyze and validate the effectiveness of our method are shown in Section 4.3.

Conclusion

In this paper, we have proposed a novel concept of learning intra-domain style-invariant representation for UDA of semantic segmentation, and we constructed a method based on the proposed concept. Learning representation invariant to the diversified intra-domain styles contributes to the generalization in the target domain. To realize this, we first trained a semantic-aware multimodal I2I translation model to obtain images with diversified intra-domain styles and consistent semantic contents.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This study was partly supported by JSPS KAKENHI Grant Number JP21H03456. This study was conducted on the Data Science Computing System of Education and Research Center for Mathematical and Data Science, Hokkaido University.

Zongyao Li received the B.S. degree in flight vehicle design and engineering from Zhejiang University, China, in 2016 and the M.S. degree from the Graduate School of Information Science and Technology, Hokkaido University, Japan, in 2020. He is currently pursuing a Ph.D. degree at the Graduate School of Information Science and Technology, Hokkaido University.

References (47)

R. Li et al.
Simplified unsupervised image translation for semantic segmentation adaptation
Pattern Recognit
(2020)
W. Chen et al.
Generative attention adversarial classification network for unsupervised domain adaptation
Pattern Recognit
(2020)
Y. Chen et al.
Deep conditional adaptation networks and label correlation transfer for unsupervised domain adaptation
Pattern Recognit
(2020)
J. Liang et al.
Exploring uncertainty in pseudo-label guided unsupervised domain adaptation
Pattern Recognit
(2019)
J. Huang et al.
Multi-level adversarial network for domain adaptive semantic segmentation
Pattern Recognit
(2022)
D. Guan et al.
Scale variance minimization for unsupervised domain adaptation in image segmentation
Pattern Recognit
(2021)
S.R. Richter et al.
Playing for data: Ground truth from computer games
Proceedings of the European Conference on Computer Vision
(2016)
G. Ros et al.
The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2016)
J. Hoffman et al.
Cycada: Cycle-consistent adversarial domain adaptation
Proceedings of the International Conference on Machine Learning
(2018)
A. Dundar et al.
Domain stylization: a fast covariance matching framework towards domain adaptation
IEEE Trans Pattern Anal Mach Intell
(2021)

Y.-H. Tsai et al.

Learning to adapt structured output space for semantic segmentation

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2018)

Z. Zheng et al.

Unsupervised scene adaptation with memory regularization in vivo

Proceedings of the International Conference on International Joint Conferences on Artificial Intelligence

(2021)

Z. Zheng et al.

Rectifying pseudo label learning via uncertainty estimation for domain adaptive semantic segmentation

Int J Comput Vis

(2021)

S. Laine et al.

Temporal ensembling for semi-supervised learning

arXiv preprint arXiv:1610.02242

(2016)

A. Tarvainen et al.

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results

Advances in Neural Information Processing Systems

(2017)

J. Choi et al.

Self-ensembling with GAN-based data augmentation for domain adaptation in semantic segmentation

Proceedings of the IEEE International Conference on Computer Vision

(2019)

X. Huang et al.

Multimodal unsupervised image-to-image translation

Proceedings of the European Conference on Computer Vision

(2018)

J.-Y. Zhu et al.

Unpaired image-to-image translation using cycle-consistent adversarial networks

Proceedings of the IEEE International Conference on Computer Vision

(2017)

Y. Li et al.

Bidirectional learning for domain adaptation of semantic segmentation

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2019)

I. Goodfellow et al.

Generative adversarial nets

Advances in Neural Information Processing Systems

(2014)

Y. Zhang et al.

Fully convolutional adaptation networks for semantic segmentation

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2018)

Z. Wu et al.

Dcan: Dual channel-wise alignment networks for unsupervised scene adaptation

Proceedings of the European Conference on Computer Vision

(2018)

L.A. Gatys et al.

Image style transfer using convolutional neural networks

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

(2016)

Cited by (12)

DU-Net: A new double U-shaped network for single image dehazing
2024, Journal of Visual Communication and Image Representation
Convolutional neural networks have achieved remarkable success in single image dehazing tasks, and previous studies verified the dehazing performance of the U-shaped framework. However, most existing U-shaped architecture dehazing networks still face challenges in sufficiently dealing with a large area of haze with low visibility. In this paper, we propose a novel dehazing network named Double U-Net(DU-Net). Specifically, to reduce the interference of haze features in the encoder to the recovery stage when skip-connecting to the decoder directly, we develop a new architecture firstly, which is composed of an extended encoder–decoder. Besides, the hierarchical depth-wise convolution block(HDCB) is designed to gradually increase the receptive field by leveraging the depth-wise convolution, enriching the global information. Moreover, we propose a multi-branch interactive fusion(MIF) which achieves efficient cross-branch and cross-channel interaction through parallel multiple 1D convolutions. Extensive experiments on both synthetic and real-world hazy images demonstrate the effectiveness of our proposed method.
WCAL: Weighted and center-aware adaptation learning for partial domain adaptation
2024, Engineering Applications of Artificial Intelligence
Partial domain adaptation, which aims to transfer knowledge from a source domain with rich labels to a unlabeled target domain where target class space is a subspace of source class space, is a challenging task in pattern recognition. Previous partial domain adaptation approaches tend to immerse in filtering out anomaly categories by weighting and the importance of ignoring the transferability of generated features. In the light of this, this article proposes a novel partial domain adaptation method, dubbed Weighted and Center-aware Adaptation Learning (WCAL). Specifically, WCAL presents a weighted adversarial learning module learns a category classifier to filter out the outlier categories from source domain. Also, it seeks a domain discriminator for cross-domain to further address the negative transfer. Then, the Center-aware adaptation learning module minimizes the distribution discrepancy across domains, which makes the features more transferability for the adaptation model. Extensive experiments on popular domain adaptation datasets testify that the proposed WCAL approach exceeds state-of-the-art baselines significantly with a large margin, in terms of average classification result, for example, 3.36% and 1.71% on Office-Home and Office-Caltech, respectively.
Customized meta-dataset for automatic classifier accuracy evaluation
2024, Pattern Recognition
Automatic classifier accuracy evaluation (ACAEval) on unlabeled test sets is critical for unseen real-world environments. The use of dataset-level regression on synthesized meta-datasets (comprised of many sample sets) has shown promising results for ACAEval. However, the existing meta-dataset for ACAEval is created using simple image transformations such as rotation and background substitution, which can make it difficult to ensure a reasonable distribution shift between the sample set and the test set. When the distribution shift is large, it becomes challenging to estimate the classifier accuracy on the test set using those sample sets. To ensure more robust ACAEval, this paper attempts to customize a meta-dataset in which each sample set has a reasonable distribution shift to the test set. An intra-class cycle-consistent adversarial learning (ICAL) method is introduced to transfer the style of a labeled training set to the style of the test set, by jointly considering the domain shift issue, the label flipping issue (the semantic information may be changed after style transformation), and the diversity of multiple sample sets in the meta-dataset. Experiments validate that under the same experimental setup, our method outperforms the existing ACAEval methods by a good margin, and achieves state-of-the-art performance on several standard benchmark datasets, including digit classification and natural image classification.
Label-free model evaluation and weighted uncertainty sample selection for domain adaptive instance segmentation
2024, Engineering Applications of Artificial Intelligence
This paper addresses the challenges of model evaluation and optimization that arise from domain differences between the target and source domains during model deployment. Current methods for model accuracy evaluation require a fully annotated test set. However, obtaining additional human labels for every unique application scenario can be costly and time-intensive. To tackle this problem, this paper proposes an instance segmentation model evaluation method based on domain differences, which can give the prediction accuracy of the model on unlabeled test sets. Moreover, to enhance deployment accuracy cost-effectively, this paper proposes an “effective operation”-based labeling cost computation method and a weighted uncertainty sample selection method. The former accurately computes labeling costs for instance segmentation, while the latter selects the most valuable samples from the target domain for labeling and training. Model evaluation experiments demonstrate that this method’s root mean square error (RMSE) on Cityscapes is approximately 4% less than that of other existing model evaluation methods. Model optimization experiments demonstrate that the proposed method achieves greater model accuracy than comparative methods under four distinct data partitioning protocols. The code is available at https://github.com/licongguan/Lamer.
Unsupervised domain adaptation for crack detection
2023, Automation in Construction
The reliable and fast detection of cracks is crucial for assessing the condition and maintaining civil infrastructure. However, due to diverse construction materials, imaging conditions, and environmental interference, there exists a domain shift between crack images collected from civil infrastructure. This shift results in significant performance drops of crack detection models trained on one dataset when applied to another, limiting their cross-dataset applicability. To address this issue, this paper proposes DACrack, an unsupervised domain adaptation framework for crack detection of civil infrastructure. The proposed method performs domain adaptation at the input, feature, and output levels using contrastive mechanisms, adversarial learning, and variational autoencoders. Extensive experiments demonstrate the effectiveness and robustness of the proposed method for cross-dataset crack detection. By mitigating the impact of domain shift, DACrack offers a more reliable and accurate solution for assessing the condition of civil infrastructure.
Context Perturbation: A Simple Consistent Alignment Approach for Domain Adaptive Semantic Segmentation
2024, SSRN

View all citing articles on Scopus

Ren Togo received the B.S. degree in health sciences from Hokkaido University, Japan, in 2015 and the M.S. and Ph.D. degrees from the Graduate School of Information Science and Technology, Hokkaido University, in 2017 and 2019, respectively. He is currently a Specially Appointed Assistant Professor with the Faculty of Information Science and Technology, Hokkaido University.

Takahiro Ogawa received the B.S., M.S., and Ph.D. degrees in electronics and information engineering from Hokkaido University, Japan, in 2003, 2005, and 2007, respectively. He joined the Graduate School of Information Science and Technology, Hokkaido University, in 2008. He is currently an Associate Professor with the Faculty of Information Science and Technology, Hokkaido University.

Miki Haseyama received the B.S., M.S., and Ph.D. degrees in electronics from Hokkaido University, Japan, in 1986, 1988, and 1993, respectively. She joined the Graduate School of Information Science and Technology, Hokkaido University, in 1994. She is currently a Professor with the Faculty of Information Science and Technology, Hokkaido Universityhe is also a Vice-President of Hokkaido University.

View full text

Learning intra-domain style-invariant representation for unsupervised domain adaptation of semantic segmentation

Highlights

Abstract

Introduction

Section snippets

UDA of semantic segmentation

Proposed method

Experiments

Conclusion

Declaration of Competing Interest

Acknowledgements

Pattern Recognit

Pattern Recognit

Pattern Recognit

Pattern Recognit

Pattern Recognit

Pattern Recognit

Playing for data: Ground truth from computer games

Proceedings of the European Conference on Computer Vision

The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Cycada: Cycle-consistent adversarial domain adaptation

Proceedings of the International Conference on Machine Learning

Domain stylization: a fast covariance matching framework towards domain adaptation

IEEE Trans Pattern Anal Mach Intell

Learning to adapt structured output space for semantic segmentation

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Unsupervised scene adaptation with memory regularization in vivo

Proceedings of the International Conference on International Joint Conferences on Artificial Intelligence

Rectifying pseudo label learning via uncertainty estimation for domain adaptive semantic segmentation

Int J Comput Vis

Temporal ensembling for semi-supervised learning

arXiv preprint arXiv:1610.02242

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results

Advances in Neural Information Processing Systems

Self-ensembling with GAN-based data augmentation for domain adaptation in semantic segmentation

Proceedings of the IEEE International Conference on Computer Vision

Multimodal unsupervised image-to-image translation

Proceedings of the European Conference on Computer Vision

Unpaired image-to-image translation using cycle-consistent adversarial networks

Proceedings of the IEEE International Conference on Computer Vision

Bidirectional learning for domain adaptation of semantic segmentation

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Generative adversarial nets

Advances in Neural Information Processing Systems

Fully convolutional adaptation networks for semantic segmentation

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Dcan: Dual channel-wise alignment networks for unsupervised scene adaptation

Proceedings of the European Conference on Computer Vision

Image style transfer using convolutional neural networks

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition