Learning intra-domain style-invariant representation for unsupervised domain adaptation of semantic segmentation
Introduction
Unsupervised domain adaptation (UDA) aims to transfer knowledge from a domain that is rich in ground truth labels to an unlabeled domain. UDA is especially promising for tasks that have a shortage of ground truth labels such as semantic segmentation. In recent years, synthetic data (e.g., GTA5 [1] and SYNTHIA [2]) have drawn researchers’ interest as an appropriate candidate for the source domain in UDA of semantic segmentation. Labels of synthetic data can be produced automatically, and thus leveraging those synthetic data may considerably alleviate the burden of human annotation.
Unlike semi-supervised learning (SSL) in which labeled data and unlabeled data typically have the same distributions, distributions of the two domains in UDA are quite different and the images have major visual differences. Therefore, aligning the feature distributions of the two domains is considered the key to transferring the knowledge. Researchers have tried to achieve this by using various approaches such as modifying the images to make the two domains visually similar [3], [4], [5] and using adversarial learning to make the domain of the features or segmentation outputs indistinguishable [6], [7], [8]. Despite significant achievements, a problem that has not attracted sufficient attention is that alignment of the feature distributions cannot ensure generalization ability of the trained model in the target domain. Due to the different intrinsic data distributions and some nontransferable features, the two domains cannot be completely aligned, and a model trained with supervision signals from only the source domain may therefore not generalize well in the target domain. Although pseudo labels can provide supervision signals in the target domain [9], [10], [11], the final performance still depends on the model that generates the pseudo labels.
In this study, to tackle the problem of generalization in the target domain, we focused on learning intra-domain style-invariant representation for UDA of semantic segmentation. The underlying concept is that if the learned representation is invariant to the varied characteristics (e.g., brightness, saturation and texture, which are referred to as intra-domain styles in this paper) of the target domain, the segmentation model may perform well on the unknown samples in the target domain. This concept is somewhat similar to data augmentation, which is considered to be usually helpful for the enhancement of convolutional neural networks’ (CNNs’) generalization ability in supervised learning. However, in our study, the style of an image cannot be modified appropriately by using usual augmentation techniques. In addition, more importantly, the style-invariant representation is learned via not only supervised learning with labeled source domain samples but also unsupervised learning with unlabeled target domain samples. Therefore, we propose a self-ensembling method to integrate the supervised and unsupervised learning of intra-domain style-invariant representation and, additionally, construct a multimodal unpaired image-to-image (I2I) translation model to obtain images with diverse intra-domain styles.
The idea of self-ensembling originated from studies of SSL [12], [13]. SSL was used for UDA of semantic segmentation in a previous work [14] but only as a usual SSL technology that does not consider the generalization problem and the intra-domain styles. In this study, we used a self-ensembling architecture [13] that consists of a student model trained with style-diversified images and a teacher model updated as the exponential moving average of the student model. By training with images with diversified intra-domain styles, the learning of the intra-domain style-invariant representation integrates into a supervised loss of the source domain and a teacher-student consistency loss of the target domain. Moreover, pseudo labels are subsequently involved in the training to further improve the UDA performance.
As mentioned above, images with diversified intra-domain styles are indispensable for the realization of our conception. In our method, we translate the source domain images to different target domain styles and meanwhile diversify the styles of the target domain images. Such a task can be accomplished by an existing multimodal unpaired I2I translation method [15] named multimodal unsupervised image- to-image translation (MUNIT). However, we found that the existing method cannot meet an essential requirement in our study, which is the consistency of semantic contents in the translation results. An example is shown in Fig. 1. The semantic contents in the sky region are inconsistent in the translation results of MUNIT. To overcome this problem, we adapt the MUNIT architecture to content-consistent translation by introducing pixel-level semantic information as additional guidance for the translation. As shown in Fig. 1, the consistency of semantic contents in our translation results is enhanced compared to that in the results of MUNIT, and learning the style-invariant representation therefore becomes realizable.
In this paper, we make the following contributions.
- •
We propose the conception of learning intra-domain style-invariant representation for UDA of semantic segmentation, which can make the trained model generalize better to the diverse intra-domain styles of the target domain.
- •
We propose a self-ensembling method for learning the intra-domain style-invariant representation and construct a semantic-aware version of MUNIT for style diversification.
- •
We achieved state-of-the-art UDA performance on GTA5-to-Cityscapes and SYNTHIA-to-Cityscapes benchmarks and we conducted extensive experiments for further analyses.
Section snippets
UDA of semantic segmentation
UDA of semantic segmentation is considered a challenging task due to the complexity of transferring pixel-level semantic knowledge. There are generally three main components in the technologies for UDA of semantic segmentation: I2I translation, adversarial learning, and semi-supervised learning.
I2I translation technologies can modify some characteristics (e.g., color and texture) that are collectively called “styles” of an image, typically for reducing the visual domain gap. Cycle-consistent
Proposed method
First, we provide the problem setting. Given a source domain dataset comprised of image-label pairs and a target domain dataset comprised of unlabeled images , we aim to transfer the semantic knowledge from to at the pixel level. Our method for learning intra-domain style-invariant representation is presented in Section 3.1. The I2I translation model that generates the style-diversified images used in the method presented in Section 3.1 is described in Section 3.2. All
Experiments
We conducted experiments on two benchmarks GTA5-to-Cityscapes and SYNTHIA-to-Cityscapes, both of which are synthetic-to-real adaptations. Datasets are first described in Section 4.1. The main results for the benchmarks are presented and compared to results of state-of-the-art methods in Section 4.2. Finally, results of extensive supplementary experiments to further analyze and validate the effectiveness of our method are shown in Section 4.3.
Conclusion
In this paper, we have proposed a novel concept of learning intra-domain style-invariant representation for UDA of semantic segmentation, and we constructed a method based on the proposed concept. Learning representation invariant to the diversified intra-domain styles contributes to the generalization in the target domain. To realize this, we first trained a semantic-aware multimodal I2I translation model to obtain images with diversified intra-domain styles and consistent semantic contents.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
This study was partly supported by JSPS KAKENHI Grant Number JP21H03456. This study was conducted on the Data Science Computing System of Education and Research Center for Mathematical and Data Science, Hokkaido University.
Zongyao Li received the B.S. degree in flight vehicle design and engineering from Zhejiang University, China, in 2016 and the M.S. degree from the Graduate School of Information Science and Technology, Hokkaido University, Japan, in 2020. He is currently pursuing a Ph.D. degree at the Graduate School of Information Science and Technology, Hokkaido University.
References (47)
- et al.
Simplified unsupervised image translation for semantic segmentation adaptation
Pattern Recognit
(2020) - et al.
Generative attention adversarial classification network for unsupervised domain adaptation
Pattern Recognit
(2020) - et al.
Deep conditional adaptation networks and label correlation transfer for unsupervised domain adaptation
Pattern Recognit
(2020) - et al.
Exploring uncertainty in pseudo-label guided unsupervised domain adaptation
Pattern Recognit
(2019) - et al.
Multi-level adversarial network for domain adaptive semantic segmentation
Pattern Recognit
(2022) - et al.
Scale variance minimization for unsupervised domain adaptation in image segmentation
Pattern Recognit
(2021) - et al.
Playing for data: Ground truth from computer games
Proceedings of the European Conference on Computer Vision
(2016) - et al.
The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2016) - et al.
Cycada: Cycle-consistent adversarial domain adaptation
Proceedings of the International Conference on Machine Learning
(2018) - et al.
Domain stylization: a fast covariance matching framework towards domain adaptation
IEEE Trans Pattern Anal Mach Intell
(2021)
Learning to adapt structured output space for semantic segmentation
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Unsupervised scene adaptation with memory regularization in vivo
Proceedings of the International Conference on International Joint Conferences on Artificial Intelligence
Rectifying pseudo label learning via uncertainty estimation for domain adaptive semantic segmentation
Int J Comput Vis
Temporal ensembling for semi-supervised learning
arXiv preprint arXiv:1610.02242
Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results
Advances in Neural Information Processing Systems
Self-ensembling with GAN-based data augmentation for domain adaptation in semantic segmentation
Proceedings of the IEEE International Conference on Computer Vision
Multimodal unsupervised image-to-image translation
Proceedings of the European Conference on Computer Vision
Unpaired image-to-image translation using cycle-consistent adversarial networks
Proceedings of the IEEE International Conference on Computer Vision
Bidirectional learning for domain adaptation of semantic segmentation
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Generative adversarial nets
Advances in Neural Information Processing Systems
Fully convolutional adaptation networks for semantic segmentation
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Dcan: Dual channel-wise alignment networks for unsupervised scene adaptation
Proceedings of the European Conference on Computer Vision
Image style transfer using convolutional neural networks
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Cited by (12)
DU-Net: A new double U-shaped network for single image dehazing
2024, Journal of Visual Communication and Image RepresentationWCAL: Weighted and center-aware adaptation learning for partial domain adaptation
2024, Engineering Applications of Artificial IntelligenceCustomized meta-dataset for automatic classifier accuracy evaluation
2024, Pattern RecognitionLabel-free model evaluation and weighted uncertainty sample selection for domain adaptive instance segmentation
2024, Engineering Applications of Artificial IntelligenceUnsupervised domain adaptation for crack detection
2023, Automation in Construction
Zongyao Li received the B.S. degree in flight vehicle design and engineering from Zhejiang University, China, in 2016 and the M.S. degree from the Graduate School of Information Science and Technology, Hokkaido University, Japan, in 2020. He is currently pursuing a Ph.D. degree at the Graduate School of Information Science and Technology, Hokkaido University.
Ren Togo received the B.S. degree in health sciences from Hokkaido University, Japan, in 2015 and the M.S. and Ph.D. degrees from the Graduate School of Information Science and Technology, Hokkaido University, in 2017 and 2019, respectively. He is currently a Specially Appointed Assistant Professor with the Faculty of Information Science and Technology, Hokkaido University.
Takahiro Ogawa received the B.S., M.S., and Ph.D. degrees in electronics and information engineering from Hokkaido University, Japan, in 2003, 2005, and 2007, respectively. He joined the Graduate School of Information Science and Technology, Hokkaido University, in 2008. He is currently an Associate Professor with the Faculty of Information Science and Technology, Hokkaido University.
Miki Haseyama received the B.S., M.S., and Ph.D. degrees in electronics from Hokkaido University, Japan, in 1986, 1988, and 1993, respectively. She joined the Graduate School of Information Science and Technology, Hokkaido University, in 1994. She is currently a Professor with the Faculty of Information Science and Technology, Hokkaido Universityhe is also a Vice-President of Hokkaido University.