research-article

Mutually-Guided Hierarchical Multi-Modal Feature Learning for Referring Image Segmentation

Authors:

Yongjian LiuAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 20, Issue 12

Article No.: 393, Pages 1 - 18

https://doi.org/10.1145/3698771

Published: 25 November 2024 Publication History

Abstract

Referring image segmentation aims to locate and segment the target region based on a given textual expression query. The primary challenge is to understand semantics from visual and textual modalities and achieve alignment and matching. Prior works have attempted to address this challenge by leveraging separately pretrained unimodal models to extract global visual and textual features and perform straightforward fusion to establish cross-modal semantic associations. However, these methods often concentrate solely on the global semantics, disregarding the hierarchical semantics of expression and image and struggling with complex and open real scenarios, thus failing to capture critical cross-modal information. To address these limitations, this article introduces an innovative mutually-guided hierarchical multi-modal feature learning scheme. By leveraging the guidance of global visual features, the model mines hierarchical text features from different stages of the text encoder. Simultaneously, the guidance of global textual features is leveraged to aggregate multi-scale visual features. This mutually guided hierarchical feature learning effectively addresses the semantically inaccurate cause by free-form text and naturally occurring scale variations. Furthermore, a Segment Detail Refinement (SDR) module is designed to enhance the model’s spatial detail awareness through attention mapping of low-level visual features and cross-modal features. To evaluate the effectiveness of the proposed approach, extensive experiments are conducted on three widely used referring image object segmentation datasets. The results demonstrate the superiority of the presented method in accurately locating and segmenting objects in images.

References

[1]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. arXiv:1607.06450. Retrieved from https://doi.org/10.48550/arXiv.1607.06450

[2]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In Proceedings of the International Conference on European Conference on Computer Vision. Springer, 213–229.

Digital Library

[3]

Xiangyu Chen, Xintao Wang, Jiantao Zhou, and Chao Dong. 2023. Activating more pixels in image super-resolution transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 22367–22377.

[4]

Yi-Wen Chen, Yi-Hsuan Tsai, Tiantian Wang, Yen-Yu Lin, and Ming-Hsuan Yang. 2020. Referring expression object segmentation with caption-aware consistency. In 30th British Machine Vision Conference, BMVC 2019.

[5]

Zhigang Dai, Bolun Cai, Yugeng Lin, and Junying Chen. 2021. Up-detr: Unsupervised pre-training for object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1601–1610.

[6]

Henghui Ding, Chang Liu, Suchen Wang, and Xudong Jiang. 2021. Vision-language transformer and query generation for referring segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 16321–16330.

[7]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth \(16\times 16\) Words: Transformers for image recognition at scale. In Proceedings of the International Conference on International Conference on Learning Representations.

[8]

Patrick Esser, Robin Rombach, and Bjorn Ommer. 2021. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12873–12883.

[9]

Guang Feng, Zhiwei Hu, Lihe Zhang, and Huchuan Lu. 2021. Encoder fusion network with co-attention embedding for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15506–15515.

[10]

Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven C. H. Hoi, Xiaogang Wang, and Hongsheng Li. 2019. Dynamic fusion with intra- and inter-modality attention flow for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 6639–6648.

[11]

Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifier neural networks. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics. JMLR, 315–323.

[12]

Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, et al. 2022. A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 1 (2022), 87–110.

[13]

Yaru Hao, Li Dong, Furu Wei, and Ke Xu. 2019. Visualizing and understanding the effectiveness of BERT. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.), ACL, 4143–4152. DOI:

[14]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778.

[15]

Zhiwei Hu, Guang Feng, Jiayu Sun, Lihe Zhang, and Huchuan Lu. 2020. Bi-directional relationship inferring network for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4424–4433.

[16]

Shaofei Huang, Tianrui Hui, Si Liu, Guanbin Li, Yunchao Wei, Jizhong Han, Luoqi Liu, and Bo Li. 2020. Referring image segmentation via cross-modal progressive comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10488–10497.

[17]

Tianrui Hui, Si Liu, Shaofei Huang, Guanbin Li, Sansi Yu, Faxi Zhang, and Jizhong Han. 2020. Linguistic structure guided context modeling for referring image segmentation. In Proceedings of the 16th European Conference on Computer Vision (ECCV ’20). Springer, 59–75.

Digital Library

[18]

Yifan Jiang, Shiyu Chang, and Zhangyang Wang. 2021. Transgan: Two transformers can make one strong gan. Advances in Neural Information Processing Systems 34 (2021), 14745–14758.

[19]

Ya Jing, Tao Kong, Wei Wang, Liang Wang, Lei Li, and Tieniu Tan. 2021. Locate then segment: A strong pipeline for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9858–9867.

[20]

Rie Johnson and Tong Zhang. 2017. Deep pyramid convolutional neural networks for text categorization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 562–570.

[21]

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. 2014. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 787–798.

[22]

Dezhuang Li, Ruoqi Li, Lijun Wang, Yifan Wang, Jinqing Qi, Lu Zhang, Ting Liu, Qingquan Xu, and Huchuan Lu. 2022. You only infer once: Cross-modal meta-transfer for referring video object segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36, 1297–1305.

[23]

Ruiyu Li, Kaican Li, Yi-Chun Kuo, Michelle Shu, Xiaojuan Qi, Xiaoyong Shen, and Jiaya Jia. 2018. Referring image segmentation via recurrent refinement networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5745–5753.

[24]

Chen Liang, Yawei Luo, Yu Wu, and Yi Yang. 2021. Contrastive video-language segmentation. arXiv:2109.14131. Retrieved from https://doi.org/10.48550/arXiv.2109.14131

[25]

Liang Lin, Pengxiang Yan, Xiaoqian Xu, Sibei Yang, Kun Zeng, and Guanbin Li. 2021. Structured attention network for referring image segmentation. IEEE Transactions on Multimedia 24 (2021), 1922–1932.

[26]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proceedings of the 13th European Conference on Computer Vision (ECCV ’14). Springer, 740–755.

[27]

Chenxi Liu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, and Alan Yuille. 2017. Recurrent multimodal interaction for referring image segmentation. In Proceedings of the IEEE International Conference on Computer Vision, 1271–1280.

[28]

Daqing Liu, Hanwang Zhang, Feng Wu, and Zheng-Jun Zha. 2019. Learning to assemble neural module tree networks for visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4673–4682.

[29]

Jiang Liu, Hui Ding, Zhaowei Cai, Yuting Zhang, Ravi Kumar Satzoda, Vijay Mahadevan, and R Manmatha. 2023. PolyFormer: Referring image segmentation as sequential polygon generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18653–18663.

[30]

Song Liu, Haoqi Fan, Shengsheng Qian, Yiru Chen, Wenkui Ding, and Zhongyuan Wang. 2021. Hit: Hierarchical transformer with momentum contrast for video-text retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 11915–11925.

[31]

Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. 2016. Hierarchical question-image co-attention for visual question answering. In Advances in Neural Information Processing Systems. D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett (Eds.), Vol. 29, Curran Associates, Inc.

[32]

Gen Luo, Yiyi Zhou, Rongrong Ji, Xiaoshuai Sun, Jinsong Su, Chia-Wen Lin, and Qi Tian. 2020. Cascade grouped attention network for referring expression segmentation. In Proceedings of the 28th ACM International Conference on Multimedia, 1274–1282.

Digital Library

[33]

Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Liujuan Cao, Chenglin Wu, Cheng Deng, and Rongrong Ji. 2020. Multi-task collaborative network for joint referring expression comprehension and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10034–10043.

[34]

Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan Yuille, and Kevin Murphy. 2016. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 11–20. DOI:

[35]

Edgar Margffoy-Tuay, Juan C Pérez, Emilio Botero, and Pablo Arbeláez. 2018. Dynamic multimodal instance segmentation guided by natural language queries. In Proceedings of the European Conference on Computer Vision (ECCV), 630–645.

Digital Library

[36]

Haiyang Mei, Letian Yu, Ke Xu, Yang Wang, Xin Yang, Xiaopeng Wei, and Rynson W. H. Lau. 2023. Mirror segmentation via semantic-aware contextual contrasted feature learning. ACM Transactions on Multimedia Computing, Communications, and Applications 19, 2s (Feb. 2023), Article 100, 22 pages. DOI:

Digital Library

[37]

Jie Nie, Lei Huang, Chengyu Zheng, Xiaowei Lv, and Rui Wang. 2023. Cross-scale graph interaction network for semantic segmentation of remote sensing images. ACM Transactions on Multimedia Computing, Communications, and Applications 19, 6 (May 2023), Article 185, 18 pages. 1551–6857. DOI:

Digital Library

[38]

Zizheng Pan, Jianfei Cai, and Bohan Zhuang. 2022. Fast vision transformers with hilo attention. In Proceedings of the 36th International Conference on Neural Information Processing Systems, 14541–14554.

Digital Library

[39]

Raghavendra Pappagari, Piotr Zelasko, Jesús Villalba, Yishay Carmiel, and Najim Dehak. 2019. Hierarchical transformers for long document classification. In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 838–844.

[40]

Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. 2018. Image transformer. In Proceedings of the International conference on machine learning. PMLR, 4055–4064.

[41]

Shengsheng Qian, Jinguang Wang, Jun Hu, Quan Fang, and Changsheng Xu. 2021. Hierarchical multi-modal contextual attention network for fake news detection. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 153–162.

Digital Library

[42]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning. PMLR, 8748–8763.

[43]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.

[44]

Chao Shang, Hongliang Li, Heqian Qiu, Qingbo Wu, Fanman Meng, Taijin Zhao, and King Ngi Ngan. 2023. Cross-modal recurrent semantic comprehension for referring image segmentation. IEEE Transactions on Circuits and Systems for Video Technology 33, 7 (2023), 3229–3242. DOI:

Digital Library

[45]

Ran Shi, Jing Ma, King Ngi Ngan, Jian Xiong, and Tong Qiao. 2022. Objective object segmentation visual quality evaluation: Quality measure and pooling method. ACM ransactions on Multimedia Computing, Communications, and Applications 18, 3 (Mar. 2022), Article 73, 19 pages. DOI:

Digital Library

[46]

Chenyang Si, Weihao Yu, Pan Zhou, Yichen Zhou, Xinchao Wang, and Shuicheng Yan. 2022. Inception transformer. In Advances in Neural Information Processing Systems. S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35, Curran Associates, Inc., 23495–23509.

[47]

Duyu Tang, Bing Qin, and Ting Liu. 2015. Document modeling with gated recurrent neural network for sentiment classification. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 1422–1432.

[48]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems.

Digital Library

[49]

Huiyu Wang, Yukun Zhu, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. 2021. Max-deeplab: End-to-end panoptic segmentation with mask transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5463–5474.

[50]

Wenjing Wang, Lilang Lin, Zejia Fan, and Jiaying Liu. 2023. Semi-supervised learning for mars imagery classification and segmentation. ACM Transactions on Multimedia Computing, Communications, and Applications 19, 4 (Feb. 2023), Article 144, 23 pages. DOI:

Digital Library

[51]

Wenxuan Wang, Jing Liu, Xingjian He, Yisi Zhang, Chen Chen, Jiachen Shen, Yan Zhang, and Jiangyun Li. 2024. CM-MaskSD: Cross-modality masked self-distillation for referring image segmentation. IEEE Transactions on Multimedia 26 (2024), 6906–6916.

Digital Library

[52]

Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, and Huaxia Xia. 2021a. End-to-end video instance segmentation with transformers. In IEEE/CVF Conference on Computer Vision And Pattern Recognition, 8741–8750.

[53]

Zhaoqing Wang, Yu Lu, Qiang Li, Xunqiang Tao, Yandong Guo, Mingming Gong, and Tongliang Liu. 2022. CRIS: CLIP-driven referring image segmentation. In 2022 Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11676–11685. DOI:

[54]

Yuto Watanabe, Ren Togo, Keisuke Maeda, Takahiro Ogawa, and Miki Haseyama. 2022. Generative adversarial network including referring image segmentation for text-guided image manipulation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4818–4822. DOI:

[55]

Zhichao Wei, Xiaohao Chen, Mingqiang Chen, and Siyu Zhu. 2023. Learning aligned cross-modal representations for referring image segmentation. Retrieved from https://dblp.org/rec/journals/corr/abs-2301-06429

[56]

Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), 3–19.

Digital Library

[57]

Huisi Wu, Zhaoze Wang, Zhuoying Li, Zhenkun Wen, and Jing Qin. 2023. Context prior guided semantic modeling for biomedical image segmentation. ACM Transactions on Multimedia Computing, Communications, and Applications 19, 2s (Mar. 2023), Article 90, 19 pages. 1551–6857. DOI:

Digital Library

[58]

Jiannan Wu, Yi Jiang, Peize Sun, Zehuan Yuan, and Ping Luo. 2022. Language as queries for referring video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4974–4984.

[59]

Yichen Yan, Xingjian He, Wenxuan Wan, and Jing Liu. 2023. MMNet: Multi-mask network for referring image segmentation. arXiv:2305.14969. Retrieved from https://doi.org/10.48550/arXiv.2305.14969

[60]

Zhao Yang, Jiaqi Wang, Yansong Tang, Kai Chen, Hengshuang Zhao, and Philip HS Torr. 2022. Lavt: Language-aware vision transformer for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18155–18165.

[61]

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1480–1489.

[62]

Linwei Ye, Mrigank Rochan, Zhi Liu, and Yang Wang. 2019. Cross-modal self-attention network for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10502–10511.

[63]

Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. 2018. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1307–1315.

[64]

Wenjing Zhang, Quange Tan, Pengxin Li, Qi Zhang, and Rong Wang. 2023. Cross-modal transformer with language query for referring image segmentation. Neurocomputing 536 (2023), 191–205.

Digital Library

[65]

Zicheng Zhang, Yi Zhu, Jianzhuang Liu, Xiaodan Liang, and Wei Ke. 2022. CoupAlign: Coupling word-pixel with sentence-mask alignments for referring image segmentation. In Proceedings of the 36th International Conference on Neural Information Processing Systems, 14729–14742.

[66]

Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip H. S. Torr, et al. 2021. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6881–6890.

[67]

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. 2020. Deformable DETR: Deformable transformers for end-to-end object detection. In International Conference on Learning Representations.

Cited By

Zhu YZheng YLiu JLi YZha Z(2024)Noise-Resistance Learning via Multi-Granularity Consistency for Unsupervised Domain Adaptive Person Re-IdentificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3702328Online publication date: 2-Nov-2024
https://dl.acm.org/doi/10.1145/3702328
Cui KLiu SFeng WDeng XGao LCheng MLu HYang L(2024)Correlation-aware Cross-modal Attention Network for Fashion Compatibility Modeling in UGC SystemsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3698772Online publication date: 5-Oct-2024
https://dl.acm.org/doi/10.1145/3698772
Song PZhou YYang XLiu DHu ZWang DWang M(2024)Efficiently Gluing Pre-Trained Language and Vision Models for Image CaptioningACM Transactions on Intelligent Systems and Technology10.1145/368206715:6(1-16)Online publication date: 19-Nov-2024
https://dl.acm.org/doi/10.1145/3682067
Show More Cited By

Index Terms

Mutually-Guided Hierarchical Multi-Modal Feature Learning for Referring Image Segmentation
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
      1. Novelty in information retrieval

Recommendations

CARIS: Context-Aware Referring Image Segmentation
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Referring image segmentation aims to segment the target object described by a natural-language utterance. Recent approaches typically distinguish pixels by aligning pixel-wise visual features with linguistic features extracted from the referring ...
Hierarchical Multi-modal Image Registration by Learning Common Feature Representations
Machine Learning in Medical Imaging
Abstract
Mutual information (MI) has been widely used for registering images with different modalities. Since most inter-modality registration methods simply estimate deformations in a local scale, but optimizing MI from the entire image, the estimated ...
Cross-modal attention guided visual reasoning for referring image segmentation
Abstract
The goal of referring image segmentation (RIS) is to generate the foreground mask of the object described by a natural language expression. The key of RIS is to learn the valid multimodal features between visual and linguistic modalities to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 20, Issue 12

December 2024

721 pages

EISSN:1551-6865

DOI:10.1145/3618076

Editor:
Abuabdulmotaleb El Saddik
University of Ottowa

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 November 2024

Online AM: 05 October 2024

Accepted: 25 September 2024

Revised: 29 June 2024

Received: 28 February 2024

Published in TOMM Volume 20, Issue 12

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Key Research Program of Hubei

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
195
Total Downloads

Downloads (Last 12 months)195
Downloads (Last 6 weeks)17

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhu YZheng YLiu JLi YZha Z(2024)Noise-Resistance Learning via Multi-Granularity Consistency for Unsupervised Domain Adaptive Person Re-IdentificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3702328Online publication date: 2-Nov-2024
https://dl.acm.org/doi/10.1145/3702328
Cui KLiu SFeng WDeng XGao LCheng MLu HYang L(2024)Correlation-aware Cross-modal Attention Network for Fashion Compatibility Modeling in UGC SystemsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3698772Online publication date: 5-Oct-2024
https://dl.acm.org/doi/10.1145/3698772
Song PZhou YYang XLiu DHu ZWang DWang M(2024)Efficiently Gluing Pre-Trained Language and Vision Models for Image CaptioningACM Transactions on Intelligent Systems and Technology10.1145/368206715:6(1-16)Online publication date: 19-Nov-2024
https://dl.acm.org/doi/10.1145/3682067
Ye CChen WLi JZhang LMao ZCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Dual-path Collaborative Generation Network for Emotional Video CaptioningProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681603(496-505)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681603
Wen HSong XChen XWei YNie LChua THui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657727(229-239)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657727

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents