Visual representations with texts domain generalization for semantic segmentation

Yue, Wanlin; Zhou, Zhiheng; Cao, Yinglie; Wu, Weikang

doi:10.1007/s10489-023-05125-y

Visual representations with texts domain generalization for semantic segmentation

Published: 09 November 2023

Volume 53, pages 30069–30079, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Wanlin Yue¹,
Zhiheng Zhou ORCID: orcid.org/0000-0003-4040-0175¹,
Yinglie Cao² &
…
Weikang Wu³

320 Accesses
Explore all metrics

Abstract

At present, Domain generalization for semantic segmentation relying on deep neural networks has made little progress. Most of the current methods are mainly divided into domain randomization, standardization, and whitening. We propose a novel approach to achieve domain generalization for semantic segmentation: leveraging cross-modal information to supervise the model training and improve the generalization ability of the network. We align visual features with textual features in a subspace and enhance the contrast between categories. Our method enables the network to learn rich semantic knowledge from text features and clearer category boundaries. Our experiments also prove that our method can effectively improve the generalization ability of the network. We are the first to exploit multi-modal information for domain-generalized semantic segmentation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Article 15 September 2023

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

Article 11 October 2019

Code Availability

The code is available from the corresponding author by request.

References

Ren X, Zhao Y, Fan J, Wu H, Chen Q, Kubo T (2023) Semantic segmentation of superficial layer in intracoronary optical coherence tomography based on cropping-merging and deep learning. Infrared Phys Technol 9:04542
Google Scholar
iu T, Wang J, Yang B, Wang X (2021) Ngdnet: Nonuniform gaussian-label distribution learning for infrared ead pose estimation and on-task behavior understanding in the classroom. Eurocomputing 36:10–220
Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359
Article Google Scholar
Wang M, Deng W (2018) Deep visual domain adaptation: a survey. Neurocomputing 312:135–153
Article Google Scholar
Zhou K, Liu Z, Qiao Y, Xiang T, Loy CC (2022) Domain generalization: a survey. IEEE Trans Pattern Anal Mach Intell
Choi S, Jung S, Yun H, Kim JT, Kim S, Choo J (2021) Robustnet: Improving domain generalization in urban-scene segmentation via instance selective whitening. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11580–11590
Peng D, Lei Y, Hayat M, Guo Y, Li W (2022) Semantic-aware domain generalized segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2594–2605
Lee S, Seong H, Lee S, Kim E (2022) Wildnet: Learning domain generalized semantic segmentation from the wild. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9936–9946
Huang W, Chen C, Li Y, Li J, Li C, Song F, Yan Y, Xiong Z (2023) Style projected clustering for domain generalized semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3061–3071
Min S, Park N, Kim S, Park S, Kim J (2022) Grounding visual representations with texts for domain generalization. In: Computer Vision–ECCV 2022: 17th european conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVII, pp 37–53. Springer
Hoffman J, Wang D, Yu F, Darrell T (2016) Fcns in the wild: Pixel-level adversarial and constraint-based adaptation. arXiv:1612.02649
Zhang Y, David P, Gong B (2017) Curriculum domain adaptation for semantic segmentation of urban scenes. In: Proceedings of the IEEE international conference on computer vision, pp 2020–2030
Tzeng E, Hoffman J, Saenko K, Darrell T (2017) Adversarial discriminative domain adaptation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7167–7176
Bousmalis K, Silberman N, Dohan D, Erhan D, Krishnan D (2017) Unsupervised pixel-level domain adaptation with generative adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3722–3731
Kim M, Byun H (2020) Learning texture invariant representation for domain adaptation of semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12975–12984
Yang Y, Soatto S (2020) Fda: Fourier domain adaptation for semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4085–4095
Tarvainen A, Valpola H (2017) Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Adv Neural Inf Process Syst 30
Rosenberg C, Hebert M, Schneiderman H (2005) Semi-supervised self-training of object detection models
Huo X, Xie L, He J, Yang Z, Zhou W, Li H, Tian Q (2021) Atso: Asynchronous teacher-student optimization for semi-supervised image segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1235–1244
Chen M, Xue H, Cai D (2019) Domain adaptation for semantic segmentation with maximum squares loss. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2090–2099
Zou Y, Yu Z, Liu X, Kumar B, Wang J (2019) Confidence regularized self-training. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5982–5991
Vu T-H, Jain H, Bucher M, Cord M, Pérez P (2019) Advent: Adversarial entropy minimization for domain adaptation in semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2517–2526
Zou Y, Yu Z, Kumar B, Wang J (2018) Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In: Proceedings of the european conference on computer vision (ECCV), pp 289–305
Tranheden W, Olsson V, Pinto J, Svensson L (2021) Dacs: Domain adaptation via cross-domain mixed sampling. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 1379–1389
Gao L, Zhang J, Zhang L, Tao D (2021) Dsp: Dual soft-paste for unsupervised domain adaptive semantic segmentation. In: Proceedings of the 29th ACM international conference on multimedia, pp 2825–2833
Zhou Q, Feng Z, Gu Q, Pang J, Cheng G, Lu X, Shi J, Ma L (2022) Context-aware mixup for domain adaptive semantic segmentation. IEEE Trans Circuits Syst Video Technol
Hoffman J, Tzeng E, Park T, Zhu J-Y, Isola P, Saenko K, Efros A, Darrell T (2018) Cycada: Cycle-consistent adversarial domain adaptation. In: International conference on machine learning, pp 1989–1998. Pmlr
Zhu J-Y, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision, pp 2223–2232
Olsson V, Tranheden W, Pinto J, Svensson L (2021) Classmix: Segmentation-based data augmentation for semi-supervised learning. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 1369–1378
Zhang H, Cisse M, Dauphin YN, Lopez-Paz D (2017) mixup: Beyond empirical risk minimization. arXiv:1710.09412
Huo X, Xie L, Hu H, Zhou W, Li H, Tian Q (2022) Domain-agnostic prior for transfer semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7075–7085
Ulyanov D, Vedaldi A, Lempitsky V (2017) Improved texture networks: Maximizing quality and diversity in feed-forward stylization and texture synthesis. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6924–6932
Pan X, Zhan X, Shi J, Tang X, Luo P (2019) Switchable whitening for deep representation learning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1863–1871
Peng D, Lei Y, Liu L, Zhang P, Liu J (2021) Global and local texture randomization for synthetic-to-real semantic segmentation. IEEE Trans Image Process 30:6594–6608
Article Google Scholar
Yue X, Zhang Y, Zhao S, Sangiovanni-Vincentelli A, Keutzer K, Gong B (2019) Domain randomization and pyramid consistency: simulation-to-real generalization without accessing target domain data. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 2100–2110
Tsimpoukelli M, Menick JL, Cabi S, Eslami S, Vinyals O, Hill F (2021) Multimodal few-shot learning with frozen language models. Adv Neural Inf Process Syst 34:200–212
Google Scholar
Pahde F, Puscas M, Klein T, Nabi M (2021) Multimodal prototypical networks for few-shot learning. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 2644–2653
Baek D, Oh Y, Ham B (2021) Exploiting a joint embedding space for generalized zero-shot semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9536–9545
Wu W, Sun Z, Ouyang W (2022) Transferring textual knowledge for visual recognition. arXiv:2207.01297
Liu H, Liu T, Chen Y, Zhang Z, Li Y-F (2022) Ehpe: Skeleton cues-based gaussian coordinate encoding for efficient human pose estimation. Trans Multimed
Liu T, Wang J, Yang B, Wang X (2021) Facial expression recognition method with multi-label distribution learning for non-verbal behavior understanding in the classroom. Infrared Phys Technol 112:103594
Article Google Scholar
Liu T, Liu H, Yang B, Zhang Z (2023) Ldcnet: Limb direction cues-aware network for flexible human pose estimation in industrial behavioral biometrics systems. IEEE Trans Industr Inform
Mikolov T, Sutskever I, Chen K, Corrado G.S, Dean J (2013) Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst 26
Dou Q, Coelho de Castro D, Kamnitsas K, Glocker B (2019) Domain generalization via model-agnostic learning of semantic features. Adv Neural Inf Process Syst 32
Motiian S, Piccirilli M, Adjeroh DA, Doretto G (2017) Unified deep supervised domain adaptation and generalization. In: Proceedings of the IEEE international conference on computer vision, pp 5715–5725
Kim D, Yoo Y, Park S, Kim J, Lee J (2021) Selfreg: Self-supervised contrastive regularization for domain generalization. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9619–9628
Chung I, Kim D, Kwak N (2022) Maximizing cosine similarity between spatial features for unsupervised domain adaptation in semantic segmentation. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 1351–1360
Richter SR, Vineet V, Roth S, Koltun V (2016) Playing for data: Ground truth from computer games. In: Computer Vision–ECCV 2016: 14th european conference, amsterdam, the netherlands, October 11-14, 2016, Proceedings, Part II 14, pp 102–118. Springer
Ros G, Sellart L, Materzynska J, Vazquez D, Lopez AM (2016) The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 3234–3243
Yu F, Chen H, Wang X, Xian W, Chen Y, Liu F, Madhavan V, Darrell T (2020) Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2636–2645
Neuhold G, Ollmann T, Rota Bulo S, Kontschieder P (2017) The mapillary vistas dataset for semantic understanding of street scenes. In: Proceedings of the IEEE international conference on computer vision, pp 4990–4999
Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B (2016) The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3213–3223
Pan X, Luo P, Shi J, Tang X (2018) Two at once: Enhancing learning and generalization capacities via ibn-net. In: Proceedings of the european conference on computer vision (ECCV), pp 464–479
Chen W, Yu Z, Wang Z, Anandkumar A (2020) Automated synthetic-to-real generalization. In: International conference on machine learning, pp 1746–1756 . PMLR
Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4510–4520
Ma N, Zhang X, Zheng H-T, Sun J (2018) Shufflenet v2: Practical guidelines for efficient cnn architecture design. In: Proceedings of the european conference on computer vision (ECCV), pp 116–131
Jiang Z, Li Y, Yang C, Gao P, Wang Y, Tai Y, Wang C (2022) Prototypical contrast adaptation for domain adaptive semantic segmentation. In: Computer vision–ECCV 2022: 17th european conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIV, pp 36–54. Springer
Chen L-C, Zhu Y, Papandreou G, Schroff F, Adam H (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Proceedings of the european conference on computer vision (ECCV), pp 801–818
Liu W, Rabinovich A, Berg AC (2015) Parsenet: Looking wider to see better. arXiv:1506.04579
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, pp 8748–8763. PMLR
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805

Download references

Acknowledgements

Supported by the National Key Research and Development Program of China (2022YFF0607001), Guangdong Basic and Applied Basic Research Foundation (2023A1515010993), Guangdong Provincial Key Laboratory of Human Digital Twin (2022B1212010004), Guangzhou City Science and Technology Research Projects (2023B01J0011), Jiangmen Science and Technology Research Projects (2021080200070009151).

Author information

Authors and Affiliations

School of Electronics and Information, South China University of Technology, Guangzhou, 510640, China
Wanlin Yue & Zhiheng Zhou
School of Electronic and Information Engineering, Guangzhou City University of Technology, Guangzhou, 510850, China
Yinglie Cao
The 54th Research Institute of China Electronics Technology Group Corporation, Shijiazhuang, 050050, China
Weikang Wu

Authors

Wanlin Yue
View author publications
You can also search for this author in PubMed Google Scholar
Zhiheng Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Yinglie Cao
View author publications
You can also search for this author in PubMed Google Scholar
Weikang Wu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Wanlin Yue: Conceptualization, Methodology, Project administration, Software, Writing - Review & Editing, Investigation

Zhiheng Zhou: Writing - Review & Editing, Supervision, Project administration, Funding acquisition

Yinglie Cao: Formal analysis, Data Curation, Validation, Resources

Weikang Wu: Data Curation, Visualization, Supervision.

Corresponding author

Correspondence to Zhiheng Zhou.

Ethics declarations

Competing of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical and informed consent for data used

The manuscript has not been submitted to multiple journals for simultaneous consideration. The submitted work is original. The authors declare that data in this article has not been falsified or tampered with.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yue, W., Zhou, Z., Cao, Y. et al. Visual representations with texts domain generalization for semantic segmentation. Appl Intell 53, 30069–30079 (2023). https://doi.org/10.1007/s10489-023-05125-y

Download citation

Accepted: 21 October 2023
Published: 09 November 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s10489-023-05125-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Visual representations with texts domain generalization for semantic segmentation

Abstract

Access this article

Similar content being viewed by others

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

Code Availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing of interest

Ethical and informed consent for data used

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Visual representations with texts domain generalization for semantic segmentation

Abstract

Access this article

Similar content being viewed by others

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization

Code Availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing of interest

Ethical and informed consent for data used

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation