skip to main content
10.1145/3581783.3611879acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Unsupervised Domain Adaptation for Referring Semantic Segmentation

Published: 27 October 2023 Publication History

Abstract

In this paper, we study the task of referring semantic segmentation in a highly practical setting, in which labeled visual data with corresponding text descriptions are available in the source, but only unlabeled visual data (without text descriptions) are available in the target. It is a challenging task that has many difficulties: (1) how to obtain proper queries for the target domain; (2) how to adapt visual-text joint distribution shifts; (3) how to maintain the original segmentation performance. Thus, we propose a cycle-consistent vision-language matching network to narrow down the domain gap and ease adaptation difficulty. Our model has significant practical applications since they are capable generalising to new data sources without requiring corresponding text annotations. First, a pseudo-text selector is devised to handle the missing modality, through the pre-trained clip model to measure the gap between query features of the source and visual features of the target. Next, a cross-domain segmentation predictor is adopted, which prompts the joint representations to be domain invariant and minimize the discrepancy between two domains. Then, we present a cycle-consistent query matcher to learn discriminative features via reconstructing visual features from masks. Instead of doing the textual comparison, we match the visual features to the pseudo queries. Extensive experiments show the effectiveness of our method.

Supplemental Material

MP4 File
We present our paper "Unsupervised Domain Adaptation for Referring Semantic Segmentation" in this video. In this paper, we study the task of referring semantic segmentation in a highly practical setting, in which labeled visual data with corresponding text descriptions are available in the source, but only unlabeled visual data (without text descriptions) are available in the target. We propose a cycle-consistent vision-language matching network (CVMN) to narrow down the domain gap and ease adaptation difficulty. First, a pseudo-text selector is devised to handle the missing modality. Next, a cross-domain segmentation predictor is adopted, which prompts the joint representations to be domain invariant and minimize the discrepancy between two domains. Then, we present a cycle-consistent query matcher to learn discriminative features via reconstructing visual features from masks. Extensive experiments show the effectiveness of our method.

References

[1]
Karsten M. Borgwardt, Arthur Gretton, Malte J. Rasch, Hans-Peter Kriegel, Bernhard Schö lkopf, and Alexander J. Smola. 2006. Integrating structured biological data by Kernel Maximum Mean Discrepancy. In Proceedings 14th International Conference on Intelligent Systems for Molecular Biology 2006. 49--57. https://doi.org/10.1093/bioinformatics/btl242
[2]
Adam Botach, Evgenii Zheltonozhskii, and Chaim Baskin. 2022. End-to-End Referring Video Object Segmentation with Multimodal Transformers. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022. IEEE, 4975--4985.
[3]
Qingchao Chen, Yang Liu, and Samuel Albanie. 2021. Mind-the-Gap! Unsupervised Domain Adaptation for Text-Video Retrieval. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021. AAAI Press, 1072--1080. https://ojs.aaai.org/index.php/AAAI/article/view/16192
[4]
Xize Cheng, Tao Jin, Linjun Li, Wang Lin, Xinyu Duan, and Zhou Zhao. 2023 a. OpenSR: Open-Modality Speech Recognition via Maintaining Multi-Modality Alignment. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL. Association for Computational Linguistics, 6592--6607. https://aclanthology.org/2023.acl-long.363
[5]
Xize Cheng, Linjun Li, Tao Jin, Rongjie Huang, Wang Lin, Zehan Wang, Huangdai Liu, Ye Wang, Aoxiong Yin, and Zhou Zhao. 2023 b. MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition. CoRR, Vol. abs/2303.05309 (2023). https://doi.org/10.48550/arXiv.2303.05309 showeprint[arXiv]2303.05309
[6]
Weijian Deng, Liang Zheng, Yifan Sun, and Jianbin Jiao. 2021. Rethinking Triplet Loss for Domain Adaptation. IEEE Trans. Circuits Syst. Video Technol., Vol. 31, 1 (2021), 29--37. https://doi.org/10.1109/TCSVT.2020.2968484
[7]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 4171--4186. https://doi.org/10.18653/v1/n19-1423
[8]
Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2014. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014 (JMLR Workshop and Conference Proceedings, Vol. 32). JMLR.org, 647--655.
[9]
Liang Du, Jingang Tan, Hongye Yang, Jianfeng Feng, Xiangyang Xue, Qibao Zheng, Xiaoqing Ye, and Xiaolin Zhang. 2019. SSF-DAN: Separated Semantic Feature Based Domain Adaptation Network for Semantic Segmentation. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV. IEEE, 982--991. https://doi.org/10.1109/ICCV.2019.00107
[10]
Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Francc ois Laviolette, Mario Marchand, and Victor S. Lempitsky. 2017. Domain-Adversarial Training of Neural Networks. In Domain Adaptation in Computer Vision Applications, Gabriela Csurka (Ed.). Springer, 189--209. https://doi.org/10.1007/978-3-319-58347-1_10
[11]
Kirill Gavrilyuk, Amir Ghodrati, Zhenyang Li, and Cees G. M. Snoek. 2018. Actor and Action Video Segmentation From a Sentence. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. IEEE Computer Society, 5958--5966. https://doi.org/10.1109/CVPR.2018.00624
[12]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR. IEEE Computer Society, 770--778. https://doi.org/10.1109/CVPR.2016.90
[13]
Judy Hoffman, Eric Tzeng, Jeff Donahue, Yangqing Jia, Kate Saenko, and Trevor Darrell. 2014. One-Shot Adaptation of Supervised Deep Convolutional Models. In 2nd International Conference on Learning Representations, ICLR 2014, Yoshua Bengio and Yann LeCun (Eds.).
[14]
Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei A. Efros, and Trevor Darrell. 2018. CyCADA: Cycle-Consistent Adversarial Domain Adaptation. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018 (Proceedings of Machine Learning Research, Vol. 80), Jennifer G. Dy and Andreas Krause (Eds.). PMLR, 1994--2003.
[15]
Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. 2016a. Segmentation from Natural Language Expressions. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 9905). Springer, 108--124. https://doi.org/10.1007/978-3-319-46448-0_7
[16]
Ronghang Hu, Marcus Rohrbach, Subhashini Venugopalan, and Trevor Darrell. 2016b. Utilizing Large Scale Vision and Text Datasets for Image Segmentation from Referring Expressions. CoRR, Vol. abs/1608.08305 (2016). arxiv: 1608.08305
[17]
Zhiwei Hu, Guang Feng, Jiayu Sun, Lihe Zhang, and Huchuan Lu. 2020. Bi-Directional Relationship Inferring Network for Referring Image Segmentation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. IEEE, 4423--4432. https://doi.org/10.1109/CVPR42600.2020.00448
[18]
Shaofei Huang, Tianrui Hui, Si Liu, Guanbin Li, Yunchao Wei, Jizhong Han, Luoqi Liu, and Bo Li. 2020. Referring Image Segmentation via Cross-Modal Progressive Comprehension. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. IEEE, 10485--10494. https://doi.org/10.1109/CVPR42600.2020.01050
[19]
Ya Jing, Tao Kong, Wei Wang, Liang Wang, Lei Li, and Tieniu Tan. 2021. Locate then Segment: A Strong Pipeline for Referring Image Segmentation. CoRR, Vol. abs/2103.16284 (2021). arxiv: 2103.16284
[20]
Guoliang Kang, Lu Jiang, Yunchao Wei, Yi Yang, and Alexander Hauptmann. 2022. Contrastive Adaptation Network for Single- and Multi-Source Domain Adaptation. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 44, 4 (2022), 1793--1804. https://doi.org/10.1109/TPAMI.2020.3029948
[21]
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara L. Berg. 2014. ReferItGame: Referring to Objects in Photographs of Natural Scenes. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP, Alessandro Moschitti, Bo Pang, and Walter Daelemans (Eds.). ACL, 787--798. https://doi.org/10.3115/v1/d14-1086
[22]
Anna Khoreva, Anna Rohrbach, and Bernt Schiele. 2018. Video Object Segmentation with Language Referring Expressions. In Computer Vision - ACCV 2018 - 14th Asian Conference on Computer Vision, Perth, Australia, December 2-6, 2018, Revised Selected Papers, Part IV (Lecture Notes in Computer Science, Vol. 11364). Springer, 123--141. https://doi.org/10.1007/978-3-030-20870-7_8
[23]
Kuan-Hui Lee, Germá n Ros, Jie Li, and Adrien Gaidon. 2019. SPIGAN: Privileged Adversarial Learning from Simulation. In 7th International Conference on Learning Representations, ICLR 2019. OpenReview.net.
[24]
James Lee-Thorp, Joshua Ainslie, Ilya Eckstein, and Santiago Onta n ó n. 2021. FNet: Mixing Tokens with Fourier Transforms. CoRR, Vol. abs/2105.03824 (2021). showeprint[arXiv]2105.03824
[25]
Ruiyu Li, Kai-Can Li, Yi-Chun Kuo, Michelle Shu, Xiaojuan Qi, Xiaoyong Shen, and Jiaya Jia. 2018. Referring Image Segmentation via Recurrent Refinement Networks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. IEEE Computer Society, 5745--5753. https://doi.org/10.1109/CVPR.2018.00602
[26]
Qing Lian, Lixin Duan, Fengmao Lv, and Boqing Gong. 2019. Constructing Self-Motivated Pyramid Curriculums for Cross-Domain Semantic Segmentation: A Non-Adversarial Approach. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV. IEEE, 6757--6766. https://doi.org/10.1109/ICCV.2019.00686
[27]
Kevin Lin, Lijuan Wang, Kun Luo, Yinpeng Chen, Zicheng Liu, and Ming-Ting Sun. 2021. Cross-Domain Complementary Learning Using Pose for Multi-Person Part Segmentation. IEEE Trans. Circuits Syst. Video Technol., Vol. 31, 3 (2021), 1066--1078. https://doi.org/10.1109/TCSVT.2020.2995122
[28]
Chenxi Liu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, and Alan L. Yuille. 2017. Recurrent Multimodal Interaction for Referring Image Segmentation. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017. IEEE Computer Society, 1280--1289. https://doi.org/10.1109/ICCV.2017.143
[29]
Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I. Jordan. 2015. Learning Transferable Features with Deep Adaptation Networks. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015 (JMLR Workshop and Conference Proceedings, Vol. 37), Francis R. Bach and David M. Blei (Eds.). JMLR.org, 97--105.
[30]
Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I. Jordan. 2017. Deep Transfer Learning with Joint Adaptation Networks. In Proceedings of the 34th International Conference on Machine Learning,ICML 2017 (Proceedings of Machine Learning Research, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, 2208--2217.
[31]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada. 13--23.
[32]
Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Liujuan Cao, Chenglin Wu, Cheng Deng, and Rongrong Ji. 2020. Multi-Task Collaborative Network for Joint Referring Expression Comprehension and Segmentation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. IEEE, 10031--10040. https://doi.org/10.1109/CVPR42600.2020.01005
[33]
Yawei Luo, Ping Liu, Tao Guan, Junqing Yu, and Yi Yang. 2019a. Significance-Aware Information Bottleneck for Domain Adaptive Semantic Segmentation. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV. IEEE, 6777--6786. https://doi.org/10.1109/ICCV.2019.00688
[34]
Yawei Luo, Liang Zheng, Tao Guan, Junqing Yu, and Yi Yang. 2019b. Taking a Closer Look at Domain Shift: Category-Level Adversaries for Semantics Consistent Domain Adaptation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019. Computer Vision Foundation / IEEE, 2507--2516. https://doi.org/10.1109/CVPR.2019.00261
[35]
Edgar Margffoy-Tuay, Juan C. Pérez, Emilio Botero, and Pablo Arbeláez. 2018. Dynamic Multimodal Instance Segmentation Guided by Natural Language Queries. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XI (Lecture Notes in Computer Science, Vol. 11215). Springer, 656--672. https://doi.org/10.1007/978-3-030-01252-6_39
[36]
Zak Murez, Soheil Kolouri, David J. Kriegman, Ravi Ramamoorthi, and Kyungnam Kim. 2018. Image to Image Translation for Domain Adaptation. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR. Computer Vision Foundation / IEEE Computer Society, 4500--4509. https://doi.org/10.1109/CVPR.2018.00473
[37]
Ke Ning, Lingxi Xie, Fei Wu, and Qi Tian. 2020. Polar Relative Positional Encoding for Video-Language Segmentation. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020. ijcai.org, 948--954. https://doi.org/10.24963/ijcai.2020/132
[38]
Wenwen Pan, Haonan Shi, Zhou Zhao, Jieming Zhu, Xiuqiang He, Zhigeng Pan, Lianli Gao, Jun Yu, Fei Wu, and Qi Tian. 2022. Wnet: Audio-Guided Video Object Segmentation via Wavelet-Based Cross- Modal Denoising Networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022. IEEE, 1310--1321. https://doi.org/10.1109/CVPR52688.2022.00138
[39]
Shuang Qiu, Yao Zhao, Jianbo Jiao, Yunchao Wei, and Shikui Wei. 2020. Referring Image Segmentation by Generative Adversarial Learning. IEEE Trans. Multim., Vol. 22, 5 (2020), 1333--1344. https://doi.org/10.1109/TMM.2019.2942480
[40]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 8748--8763. http://proceedings.mlr.press/v139/radford21a.html
[41]
Stephan R. Richter, Vibhav Vineet, Stefan Roth, and Vladlen Koltun. 2016. Playing for Data: Ground Truth from Computer Games. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11--14, 2016, Proceedings, Part II (Lecture Notes in Computer Science, Vol. 9906), Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.). Springer, 102--118.
[42]
Germán Ros, Laura Sellart, Joanna Materzynska, David Vázquez, and Antonio M. Ló pez. 2016. The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR. IEEE Computer Society, 3234--3243.
[43]
Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. 2018. Maximum Classifier Discrepancy for Unsupervised Domain Adaptation. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR. Computer Vision Foundation / IEEE Computer Society, 3723--3732. https://doi.org/10.1109/CVPR.2018.00392
[44]
Seonguk Seo, Joon-Young Lee, and Bohyung Han. 2020. URVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmark. In Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XV (Lecture Notes in Computer Science, Vol. 12360). Springer, 208--223. https://doi.org/10.1007/978-3-030-58555-6_13
[45]
Hengcan Shi, Hongliang Li, Fanman Meng, and Qingbo Wu. 2018. Key-Word-Aware Network for Referring Expression Image Segmentation. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VI (Lecture Notes in Computer Science, Vol. 11210). Springer, 38--54. https://doi.org/10.1007/978-3-030-01231-1_3
[46]
Hengcan Shi, Hongliang Li, Qingbo Wu, and King Ngi Ngan. 2021. Query Reconstruction Network for Referring Expression Image Segmentation. IEEE Trans. Multim., Vol. 23 (2021), 995--1007. https://doi.org/10.1109/TMM.2020.2991504
[47]
Jianlin Su, Jiarun Cao, Weijie Liu, and Yangyiwen Ou. 2021. Whitening Sentence Representations for Better Semantics and Faster Retrieval. CoRR, Vol. abs/2103.15316 (2021). showeprint[arXiv]2103.15316
[48]
Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association for Computational Linguistics, 5099--5110. https://doi.org/10.18653/v1/D19-1514
[49]
Yi-Hsuan Tsai, Wei-Chih Hung, Samuel Schulter, Kihyuk Sohn, Ming-Hsuan Yang, and Manmohan Chandraker. 2018. Learning to Adapt Structured Output Space for Semantic Segmentation. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR. Computer Vision Foundation / IEEE Computer Society, 7472--7481. https://doi.org/10.1109/CVPR.2018.00780
[50]
Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J. Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers. Association for Computational Linguistics, 6558--6569. https://doi.org/10.18653/v1/p19-1656
[51]
Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. 2014. Deep Domain Confusion: Maximizing for Domain Invariance. CoRR, Vol. abs/1412.3474 (2014). showeprint[arXiv]1412.3474
[52]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA. 5998--6008.
[53]
Tuan-Hung Vu, Himalaya Jain, Maxime Bucher, Matthieu Cord, and Patrick Pé rez. 2019. DADA: Depth-Aware Domain Adaptation in Semantic Segmentation. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV. IEEE, 7363--7372. https://doi.org/10.1109/ICCV.2019.00746
[54]
Hao Wang, Cheng Deng, Fan Ma, and Yi Yang. 2020. Context Modulated Dynamic Networks for Actor and Action Video Segmentation with Language Queries. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020. AAAI Press, 12152--12159.
[55]
Hao Wang, Cheng Deng, Junchi Yan, and Dacheng Tao. 2019a. Asymmetric Cross-Guided Attention Network for Actor and Action Video Segmentation From Natural Language Query. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV. IEEE, 3938--3947. https://doi.org/10.1109/ICCV.2019.00404
[56]
Qi Wang, Junyu Gao, and Xuelong Li. 2019b. Weakly Supervised Adversarial Domain Adaptation for Semantic Segmentation in Urban Scenes. IEEE Trans. Image Process., Vol. 28, 9 (2019), 4376--4386. https://doi.org/10.1109/TIP.2019.2910667
[57]
Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, and Huaxia Xia. 2021. End-to-End Video Instance Segmentation With Transformers. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021. Computer Vision Foundation / IEEE, 8741--8750.
[58]
Yuxin Wu and Kaiming He. 2020. Group Normalization. Int. J. Comput. Vis., Vol. 128, 3 (2020), 742--755. https://doi.org/10.1007/s11263-019-01198-w
[59]
Jinheng Xie, Xianxu Hou, Kai Ye, and Linlin Shen. 2022. CLIMS: Cross Language Image Matching for Weakly Supervised Semantic Segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022. IEEE, 4473--4482. https://doi.org/10.1109/CVPR52688.2022.00444
[60]
Chenliang Xu, Shao-Hang Hsieh, Caiming Xiong, and Jason J. Corso. 2015. Can humans fly? Action understanding with multiple classes of actors. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. IEEE Computer Society, 2264--2273. https://doi.org/10.1109/CVPR.2015.7298839
[61]
Yanchao Yang and Stefano Soatto. 2020. FDA: Fourier Domain Adaptation for Semantic Segmentation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR. Computer Vision Foundation / IEEE, 4084--4094. https://doi.org/10.1109/CVPR42600.2020.00414
[62]
Linwei Ye, Mrigank Rochan, Zhi Liu, and Yang Wang. 2019. Cross-Modal Self-Attention Network for Referring Image Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 10502--10511. https://doi.org/10.1109/CVPR.2019.01075
[63]
Qiyue Yin, Shu Wu, and Liang Wang. 2017. Unified subspace learning for incomplete and unlabeled multi-view data. Pattern Recognit., Vol. 67 (2017), 313--327. https://doi.org/10.1016/j.patcog.2017.01.035
[64]
Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. 2018. MAttNet: Modular Attention Network for Referring Expression Comprehension. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. IEEE Computer Society, 1307--1315. https://doi.org/10.1109/CVPR.2018.00142
[65]
Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, and Tamara L. Berg. 2016. Modeling Context in Referring Expressions. In Computer Vision - ECCV 2016 - 14th European Conference (Lecture Notes in Computer Science, Vol. 9906). Springer, 69--85. https://doi.org/10.1007/978-3-319-46475-6_5
[66]
Lei Zhang, Peng Wang, Wei Wei, Hao Lu, Chunhua Shen, Anton van den Hengel, and Yanning Zhang. 2019 Unsupervised Domain Adaptation Using Robust Class-Wise Matching. IEEE Trans. Circuits Syst. Video Technol., Vol. 29, 5 (2019), 1339--1349. https://doi.org/10.1109/TCSVT.2018.2842206
[67]
Yiheng Zhang, Zhaofan Qiu, Ting Yao, Dong Liu, and Tao Mei. 2018. Fully Convolutional Adaptation Networks for Semantic Segmentation. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR. Computer Vision Foundation / IEEE Computer Society, 6810--6818. https://doi.org/10.1109/CVPR.2018.00712
[68]
Yang Zou, Zhiding Yu, B. V. K. Vijaya Kumar, and Jinsong Wang. 2018. Unsupervised Domain Adaptation for Semantic Segmentation via Class-Balanced Self-training. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8--14, 2018, Proceedings, Part III (Lecture Notes in Computer Science, Vol. 11207). Springer, 297--313. https://doi.org/10.1007/978-3-030-01219-9_18

Cited By

View all
  • (2024)Pseudo-RIS: Distinctive Pseudo-Supervision Generation for Referring Image SegmentationComputer Vision – ECCV 202410.1007/978-3-031-73113-6_2(18-36)Online publication date: 21-Nov-2024
  • (2024)Referring Atomic Video Action RecognitionComputer Vision – ECCV 202410.1007/978-3-031-72655-2_10(166-185)Online publication date: 6-Dec-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. domain adaptation
  2. multi-modal learning
  3. referring semantic segmentation
  4. unsupervised learning

Qualifiers

  • Research-article

Funding Sources

Conference

MM '23
Sponsor:
MM '23: The 31st ACM International Conference on Multimedia
October 29 - November 3, 2023
Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)142
  • Downloads (Last 6 weeks)2
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Pseudo-RIS: Distinctive Pseudo-Supervision Generation for Referring Image SegmentationComputer Vision – ECCV 202410.1007/978-3-031-73113-6_2(18-36)Online publication date: 21-Nov-2024
  • (2024)Referring Atomic Video Action RecognitionComputer Vision – ECCV 202410.1007/978-3-031-72655-2_10(166-185)Online publication date: 6-Dec-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media