Abstract
At present, Domain generalization for semantic segmentation relying on deep neural networks has made little progress. Most of the current methods are mainly divided into domain randomization, standardization, and whitening. We propose a novel approach to achieve domain generalization for semantic segmentation: leveraging cross-modal information to supervise the model training and improve the generalization ability of the network. We align visual features with textual features in a subspace and enhance the contrast between categories. Our method enables the network to learn rich semantic knowledge from text features and clearer category boundaries. Our experiments also prove that our method can effectively improve the generalization ability of the network. We are the first to exploit multi-modal information for domain-generalized semantic segmentation.
Similar content being viewed by others
Code Availability
The code is available from the corresponding author by request.
References
Ren X, Zhao Y, Fan J, Wu H, Chen Q, Kubo T (2023) Semantic segmentation of superficial layer in intracoronary optical coherence tomography based on cropping-merging and deep learning. Infrared Phys Technol 9:04542
iu T, Wang J, Yang B, Wang X (2021) Ngdnet: Nonuniform gaussian-label distribution learning for infrared ead pose estimation and on-task behavior understanding in the classroom. Eurocomputing 36:10–220
Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359
Wang M, Deng W (2018) Deep visual domain adaptation: a survey. Neurocomputing 312:135–153
Zhou K, Liu Z, Qiao Y, Xiang T, Loy CC (2022) Domain generalization: a survey. IEEE Trans Pattern Anal Mach Intell
Choi S, Jung S, Yun H, Kim JT, Kim S, Choo J (2021) Robustnet: Improving domain generalization in urban-scene segmentation via instance selective whitening. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11580–11590
Peng D, Lei Y, Hayat M, Guo Y, Li W (2022) Semantic-aware domain generalized segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2594–2605
Lee S, Seong H, Lee S, Kim E (2022) Wildnet: Learning domain generalized semantic segmentation from the wild. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9936–9946
Huang W, Chen C, Li Y, Li J, Li C, Song F, Yan Y, Xiong Z (2023) Style projected clustering for domain generalized semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3061–3071
Min S, Park N, Kim S, Park S, Kim J (2022) Grounding visual representations with texts for domain generalization. In: Computer Vision–ECCV 2022: 17th european conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII, pp 37–53. Springer
Hoffman J, Wang D, Yu F, Darrell T (2016) Fcns in the wild: Pixel-level adversarial and constraint-based adaptation. arXiv:1612.02649
Zhang Y, David P, Gong B (2017) Curriculum domain adaptation for semantic segmentation of urban scenes. In: Proceedings of the IEEE international conference on computer vision, pp 2020–2030
Tzeng E, Hoffman J, Saenko K, Darrell T (2017) Adversarial discriminative domain adaptation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7167–7176
Bousmalis K, Silberman N, Dohan D, Erhan D, Krishnan D (2017) Unsupervised pixel-level domain adaptation with generative adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3722–3731
Kim M, Byun H (2020) Learning texture invariant representation for domain adaptation of semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12975–12984
Yang Y, Soatto S (2020) Fda: Fourier domain adaptation for semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4085–4095
Tarvainen A, Valpola H (2017) Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Adv Neural Inf Process Syst 30
Rosenberg C, Hebert M, Schneiderman H (2005) Semi-supervised self-training of object detection models
Huo X, Xie L, He J, Yang Z, Zhou W, Li H, Tian Q (2021) Atso: Asynchronous teacher-student optimization for semi-supervised image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1235–1244
Chen M, Xue H, Cai D (2019) Domain adaptation for semantic segmentation with maximum squares loss. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2090–2099
Zou Y, Yu Z, Liu X, Kumar B, Wang J (2019) Confidence regularized self-training. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5982–5991
Vu T-H, Jain H, Bucher M, Cord M, Pérez P (2019) Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2517–2526
Zou Y, Yu Z, Kumar B, Wang J (2018) Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In: Proceedings of the european conference on computer vision (ECCV), pp 289–305
Tranheden W, Olsson V, Pinto J, Svensson L (2021) Dacs: Domain adaptation via cross-domain mixed sampling. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 1379–1389
Gao L, Zhang J, Zhang L, Tao D (2021) Dsp: Dual soft-paste for unsupervised domain adaptive semantic segmentation. In: Proceedings of the 29th ACM international conference on multimedia, pp 2825–2833
Zhou Q, Feng Z, Gu Q, Pang J, Cheng G, Lu X, Shi J, Ma L (2022) Context-aware mixup for domain adaptive semantic segmentation. IEEE Trans Circuits Syst Video Technol
Hoffman J, Tzeng E, Park T, Zhu J-Y, Isola P, Saenko K, Efros A, Darrell T (2018) Cycada: Cycle-consistent adversarial domain adaptation. In: International conference on machine learning, pp 1989–1998. Pmlr
Zhu J-Y, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision, pp 2223–2232
Olsson V, Tranheden W, Pinto J, Svensson L (2021) Classmix: Segmentation-based data augmentation for semi-supervised learning. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 1369–1378
Zhang H, Cisse M, Dauphin YN, Lopez-Paz D (2017) mixup: Beyond empirical risk minimization. arXiv:1710.09412
Huo X, Xie L, Hu H, Zhou W, Li H, Tian Q (2022) Domain-agnostic prior for transfer semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7075–7085
Ulyanov D, Vedaldi A, Lempitsky V (2017) Improved texture networks: Maximizing quality and diversity in feed-forward stylization and texture synthesis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6924–6932
Pan X, Zhan X, Shi J, Tang X, Luo P (2019) Switchable whitening for deep representation learning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1863–1871
Peng D, Lei Y, Liu L, Zhang P, Liu J (2021) Global and local texture randomization for synthetic-to-real semantic segmentation. IEEE Trans Image Process 30:6594–6608
Yue X, Zhang Y, Zhao S, Sangiovanni-Vincentelli A, Keutzer K, Gong B (2019) Domain randomization and pyramid consistency: simulation-to-real generalization without accessing target domain data. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2100–2110
Tsimpoukelli M, Menick JL, Cabi S, Eslami S, Vinyals O, Hill F (2021) Multimodal few-shot learning with frozen language models. Adv Neural Inf Process Syst 34:200–212
Pahde F, Puscas M, Klein T, Nabi M (2021) Multimodal prototypical networks for few-shot learning. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 2644–2653
Baek D, Oh Y, Ham B (2021) Exploiting a joint embedding space for generalized zero-shot semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9536–9545
Wu W, Sun Z, Ouyang W (2022) Transferring textual knowledge for visual recognition. arXiv:2207.01297
Liu H, Liu T, Chen Y, Zhang Z, Li Y-F (2022) Ehpe: Skeleton cues-based gaussian coordinate encoding for efficient human pose estimation. Trans Multimed
Liu T, Wang J, Yang B, Wang X (2021) Facial expression recognition method with multi-label distribution learning for non-verbal behavior understanding in the classroom. Infrared Phys Technol 112:103594
Liu T, Liu H, Yang B, Zhang Z (2023) Ldcnet: Limb direction cues-aware network for flexible human pose estimation in industrial behavioral biometrics systems. IEEE Trans Industr Inform
Mikolov T, Sutskever I, Chen K, Corrado G.S, Dean J (2013) Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst 26
Dou Q, Coelho de Castro D, Kamnitsas K, Glocker B (2019) Domain generalization via model-agnostic learning of semantic features. Adv Neural Inf Process Syst 32
Motiian S, Piccirilli M, Adjeroh DA, Doretto G (2017) Unified deep supervised domain adaptation and generalization. In: Proceedings of the IEEE international conference on computer vision, pp 5715–5725
Kim D, Yoo Y, Park S, Kim J, Lee J (2021) Selfreg: Self-supervised contrastive regularization for domain generalization. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9619–9628
Chung I, Kim D, Kwak N (2022) Maximizing cosine similarity between spatial features for unsupervised domain adaptation in semantic segmentation. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 1351–1360
Richter SR, Vineet V, Roth S, Koltun V (2016) Playing for data: Ground truth from computer games. In: Computer Vision–ECCV 2016: 14th european conference, amsterdam, the netherlands, October 11-14, 2016, Proceedings, Part II 14, pp 102–118. Springer
Ros G, Sellart L, Materzynska J, Vazquez D, Lopez AM (2016) The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 3234–3243
Yu F, Chen H, Wang X, Xian W, Chen Y, Liu F, Madhavan V, Darrell T (2020) Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2636–2645
Neuhold G, Ollmann T, Rota Bulo S, Kontschieder P (2017) The mapillary vistas dataset for semantic understanding of street scenes. In: Proceedings of the IEEE international conference on computer vision, pp 4990–4999
Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B (2016) The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3213–3223
Pan X, Luo P, Shi J, Tang X (2018) Two at once: Enhancing learning and generalization capacities via ibn-net. In: Proceedings of the european conference on computer vision (ECCV), pp 464–479
Chen W, Yu Z, Wang Z, Anandkumar A (2020) Automated synthetic-to-real generalization. In: International conference on machine learning, pp 1746–1756 . PMLR
Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4510–4520
Ma N, Zhang X, Zheng H-T, Sun J (2018) Shufflenet v2: Practical guidelines for efficient cnn architecture design. In: Proceedings of the european conference on computer vision (ECCV), pp 116–131
Jiang Z, Li Y, Yang C, Gao P, Wang Y, Tai Y, Wang C (2022) Prototypical contrast adaptation for domain adaptive semantic segmentation. In: Computer vision–ECCV 2022: 17th european conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIV, pp 36–54. Springer
Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the european conference on computer vision (ECCV), pp 801–818
Liu W, Rabinovich A, Berg AC (2015) Parsenet: Looking wider to see better. arXiv:1506.04579
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, pp 8748–8763. PMLR
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Acknowledgements
Supported by the National Key Research and Development Program of China (2022YFF0607001), Guangdong Basic and Applied Basic Research Foundation (2023A1515010993), Guangdong Provincial Key Laboratory of Human Digital Twin (2022B1212010004), Guangzhou City Science and Technology Research Projects (2023B01J0011), Jiangmen Science and Technology Research Projects (2021080200070009151).
Author information
Authors and Affiliations
Contributions
Wanlin Yue: Conceptualization, Methodology, Project administration, Software, Writing - Review & Editing, Investigation
Zhiheng Zhou: Writing - Review & Editing, Supervision, Project administration, Funding acquisition
Yinglie Cao: Formal analysis, Data Curation, Validation, Resources
Weikang Wu: Data Curation, Visualization, Supervision.
Corresponding author
Ethics declarations
Competing of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Ethical and informed consent for data used
The manuscript has not been submitted to multiple journals for simultaneous consideration. The submitted work is original. The authors declare that data in this article has not been falsified or tampered with.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yue, W., Zhou, Z., Cao, Y. et al. Visual representations with texts domain generalization for semantic segmentation. Appl Intell 53, 30069–30079 (2023). https://doi.org/10.1007/s10489-023-05125-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-05125-y