research-article

Unlocking the Power of Cross-Dimensional Semantic Dependency for Image-Text Matching

Authors:

Zhendong MaoAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 4828 - 4837

https://doi.org/10.1145/3581783.3611703

Published: 27 October 2023 Publication History

Abstract

Image-text matching, as a fundamental cross-modal task, bridges vision and language. The key challenge lies in accurately learning the semantic similarity of these two heterogeneous modalities. To determine the semantic similarity between visual and textual features, existing paradigm typically first maps them into a d-dimensional shared representation space, then independently aggregates all dimensional correspondences of cross-modal features to reflect it, e.g., the inner product. However, in this paper, we are motivated by an insightful finding that dimensions are not mutually independent, but there are intrinsic dependencies among dimensions to jointly represent latent semantics. Ignoring this intrinsic information probably leads to suboptimal aggregation for semantic similarity, impairing cross-modal matching learning. To solve this issue, we propose a novel cross-dimensional semantic dependency-aware model (called X-Dim), which explicitly and adaptively mines the semantic dependencies between dimensions in the shared space, enabling dimensions with joint dependencies to be enhanced and utilized. X-Dim (1) designs a generalized framework to learn dimensions' semantic dependency degrees, and (2) devises the adaptive sparse probabilistic learning to autonomously make the model capture precise dependencies. Theoretical analysis and extensive experiments demonstrate the superiority of X-Dim over state-of-the-art methods, achieving 5.9%-7.3% rSum improvements on Flickr30K and MS-COCO benchmarks.

References

[1]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR. 6077--6086.

[2]

Mikhail Belkin, Siyuan Ma, and Soumik Mandal. 2018. To understand deep learning we need to understand kernel learning. In ICML. PMLR, 541--549.

[3]

Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. 2020b. Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In CVPR. 12655--12663.

[4]

Jiacheng Chen, Hexiang Hu, Hao Wu, Yuning Jiang, and Changhu Wang. 2021. Learning the best pooling strategy for visual semantic embedding. In CVPR. 15789--15798.

[5]

Tianlang Chen, Jiajun Deng, and Jiebo Luo. 2020a. Adaptive offline quintuplet loss for image-text matching. In ECCV. 549--565.

[6]

Tianlang Chen and Jiebo Luo. 2020. Expressing objects just like words: Recurrent visual embedding for image-text matching. In AAAI, Vol. 34. 10583--10590.

[7]

Haiwen Diao, Ying Zhang, Lin Ma, and Huchuan Lu. 2021. Similarity Reasoning and Filtration for Image-Text Matching. In AAAI.

[8]

Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. Vse: Improving visual-semantic embeddings with hard negatives. In arXiv preprint arXiv:1707.05612.

[9]

Zheren Fu, Yan Li, Zhendong Mao, Quan Wang, and Yongdong Zhang. 2021a. Deep metric learning with self-supervised ranking. In AAAI, Vol. 35. 1370--1378.

[10]

Zheren Fu, Zhendong Mao, Bo Hu, An-An Liu, and Yongdong Zhang. 2022. Intra-class adaptive augmentation with neighbor correction for deep metric learning. IEEE T-MM (2022).

Digital Library

[11]

Zheren Fu, Zhendong Mao, Yan Song, and Yongdong Zhang. 2023. Learning Semantic Relationship Among Instances for Image-Text Matching. In CVPR. 15159--15168.

[12]

Zheren Fu, Zhendong Mao, Chenggang Yan, An-An Liu, Hongtao Xie, and Yongdong Zhang. 2021b. Self-supervised synthesis ranking for deep metric learning. IEEE T-CSVT, Vol. 32, 7 (2021), 4736--4750.

[13]

Mengqi Huang, Zhendong Mao, Penghui Wang, Quan Wang, and Yongdong Zhang. 2022. Dse-gan: Dynamic semantic evolution generative adversarial network for text-to-image generation. In ACM MM. 4345--4354.

[14]

Yan Huang and Liang Wang. 2019. Acmm: Aligned cross-modal memory for few-shot image and sentence matching. In ICCV. 5774--5783.

[15]

Zhong Ji, Kexin Chen, and Haoran Wang. 2021. Step-Wise Hierarchical Alignment Network for Image-Text Matching. In IJCAI.

[16]

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR. 3128--3137.

[17]

Andrej Karpathy, Armand Joulin, and Li Fei-Fei. 2014. Deep fragment embeddings for bidirectional image sentence mapping. NeurIPS (2014).

[18]

Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. 4171--4186.

[19]

Benjamin Klein, Guy Lev, Gil Sadeh, and Lior Wolf. 2015. Associating neural word embeddings with deep image representations using fisher vectors. In CVPR. 4437--4446.

[20]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. In IJCV, Vol. 123. 32--73.

Digital Library

[21]

Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In ECCV. 201--216.

[22]

Jingyu Li, Zhendong Mao, Shancheng Fang, and Hao Li. 2022b. ER-SAN: Enhanced-Adaptive Relation Self-Attention Network for Image Captioning. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22. IJCAI, 1081--1087.

[23]

Jiangtong Li, Li Niu, and Liqing Zhang. 2022c. Action-Aware Embedding Enhancement for Image-Text Retrieval. In AAAI, Vol. 36. 1323--1331.

[24]

Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019. Visual semantic reasoning for image-text matching. In ICCV. 4654--4662.

[25]

Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2022 e. Image-text embedding learning via visual and textual semantic reasoning. IEEE T-PAMI, Vol. 45, 1 (2022), 641--656.

[26]

Rengang Li, Cong Xu, Zhenhua Guo, Baoyu Fan, Runze Zhang, Wei Liu, Yaqian Zhao, Weifeng Gong, and Endong Wang. 2022d. AI-VQA: Visual Question Answering based on Agent Interaction with Interpretability. In ACM MM. 5274--5282.

[27]

Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV. Springer, 121--137.

[28]

Zheng Li, Caili Guo, Zerun Feng, Jenq-Neng Hwang, and Zhongtian Du. 2023. Integrating Language Guidance into Image-Text Matching for Correcting False Negatives. IEEE T-MM (2023).

Digital Library

[29]

Zheng Li, Caili Guo, Zerun Feng, Jenq-Neng Hwang, and Xijun Xue. 2022a. Multi-View Visual Semantic Embedding. In IJCAI. 1130--1136.

[30]

Wentong Liao, Kai Hu, Michael Ying Yang, and Bodo Rosenhahn. 2022. Text to image generation with semantic-spatial aware gan. In CVPR. 18187--18196.

[31]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV. Springer, 740--755.

[32]

Chunxiao Liu, Zhendong Mao, An-An Liu, Tianzhu Zhang, Bin Wang, and Yongdong Zhang. 2019. Focus your attention: A bidirectional focal attention network for image-text matching. In ACM MM. 3--11.

[33]

Chunxiao Liu, Zhendong Mao, Tianzhu Zhang, Hongtao Xie, Bin Wang, and Yongdong Zhang. 2020. Graph structured network for image-text matching. In CVPR. 10921--10930.

[34]

Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In EMNLP. 1532--1543.

[35]

Leigang Qu, Meng Liu, Da Cao, Liqiang Nie, and Qi Tian. 2020. Context-aware multi-view summarization network for image-text matching. In ACM MM. 1047--1055.

[36]

Leigang Qu, Meng Liu, Jianlong Wu, Zan Gao, and Liqiang Nie. 2021. Dynamic modality interaction modeling for image-text retrieval. In ACM SIGIR. 1104--1113.

[37]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In ICML. 8748--8763.

[38]

Bernhard Schölkopf, Alexander J Smola, Francis Bach, et al. 2002. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press.

[39]

Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE T-SP, Vol. 45, 11 (1997), 2673--2681.

Digital Library

[40]

Botian Shi, Lei Ji, Pan Lu, Zhendong Niu, and Nan Duan. 2019. Knowledge Aware Semantic Concept Expansion for Image-Text Matching. In IJCAI, Vol. 1. 2.

[41]

Alex Smola, Zoltán Ovári, and Robert C Williamson. 2000. Regularization with dot-product kernels. Advances in neural information processing systems, Vol. 13 (2000).

[42]

Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research, Vol. 9, 11 (2008).

[43]

Sijin Wang, Ruiping Wang, Ziwei Yao, Shiguang Shan, and Xilin Chen. 2020. Cross-modal scene graph matching for relationship-aware image-text retrieval. In WACV. 1508--1517.

[44]

Yaxiong Wang, Hao Yang, Xueming Qian, Lin Ma, Jing Lu, Biao Li, and Xin Fan. 2019b. Position focused attention network for image-text matching. In IJCAI. 3792--3798.

[45]

Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng, Junjie Yan, Xiaogang Wang, and Jing Shao. 2019a. Camp: Cross-modal adaptive message passing for text-image retrieval. In ICCV. 5764--5773.

[46]

Jônatas Wehrmann, Camila Kolling, and Rodrigo C Barros. 2020. Adaptive cross-modal embeddings for image-text alignment. In AAAI, Vol. 34. 12313--12320.

[47]

Hao Wei, Shuhui Wang, Xinzhe Han, Zhe Xue, Bin Ma, Xiaoming Wei, and Xiaolin Wei. 2022. Synthesizing Counterfactual Samples for Effective Image-Text Matching. In ACM MM. 4355--4364.

[48]

Jiwei Wei, Xing Xu, Yang Yang, Yanli Ji, Zheng Wang, and Heng Tao Shen. 2020. Universal weighting metric learning for cross-modal matching. In CVPR. 13005--13014.

[49]

Keyu Wen, Xiaodong Gu, and Qingrong Cheng. 2020. Learning dual semantic relations with graph attention for image-text matching. IEEE T-CSVT, Vol. 31, 7 (2020), 2866--2879.

[50]

Yiling Wu, Shuhui Wang, Guoli Song, and Qingming Huang. 2019. Learning fragment self-attention embeddings for image-text matching. In ACM MM. 2088--2096.

[51]

Xing Xu, Kaiyi Lin, Yang Yang, Alan Hanjalic, and Heng Tao Shen. 2020. Joint feature synthesis and embedding: Adversarial cross-modal retrieval revisited. IEEE T-PAMI, Vol. 44, 6 (2020), 3030--3047.

[52]

Shiyang Yan, Li Yu, and Yuan Xie. 2021. Discrete-continuous Action Space Policy Gradient-based Attention for Image-Text Matching. In CVPR. 8096--8105.

[53]

Zhixian Zeng, Jianjun Cao, Nianfeng Weng, Guoquan Jiang, Yizhuo Rao, and Yuxin Xu. 2021. Softmax Pooling for Super Visual Semantic Embedding. In IEEE 12th Annual Information Technology, Electronics and Mobile Communication Conference. 0258--0265.

[54]

Huatian Zhang, Zhendong Mao, Kun Zhang, and Yongdong Zhang. 2022d. Show your faith: Cross-modal confidence-aware network for image-text matching. In AAAI, Vol. 36. 3262--3270.

[55]

Jingjing Zhang, Shancheng Fang, Zhendong Mao, Zhiwei Zhang, and Yongdong Zhang. 2022a. Fine-tuning with Multi-modal Entity Prompts for News Image Captioning. In ACM MM. 4365--4373.

[56]

Kun Zhang, Zhendong Mao, An-An Liu, and Yongdong Zhang. 2022b. Unified adaptive relevance distinguishable attention network for image-text matching. IEEE T-MM, Vol. 25 (2022), 1320--1332.

Digital Library

[57]

Kun Zhang, Zhendong Mao, Quan Wang, and Yongdong Zhang. 2022c. Negative-aware attention framework for image-text matching. In CVPR. 15661--15670.

[58]

Qi Zhang, Zhen Lei, Zhaoxiang Zhang, and Stan Z Li. 2020. Context-aware attention network for image-text retrieval. In CVPR. 3536--3545.

[59]

Mo Zhou, Zhenxing Niu, Le Wang, Zhanning Gao, Qilin Zhang, and Gang Hua. 2020. Ladder loss for coherent visual-semantic embedding. In AAAI, Vol. 34. 13050--13057.

[60]

Hongguang Zhu, Chunjie Zhang, Yunchao Wei, Shujuan Huang, and Yao Zhao. 2023. ESA: External Space Attention Aggregation for Image-Text Retrieval. IEEE T-CSVT (2023), 1--13.

[61]

Jianwei Zhu, Zhixin Li, Yufei Zeng, Jiahui Wei, and Huifang Ma. 2022. Image-Text Matching with Fine-Grained Relational Dependency and Bidirectional Attention-Based Generative Networks. In ACM MM. 395--403.

Cited By

Su XSong DLi WRen TLiu A(2025)Generating counterfactual negative samples for image-text matchingInformation Processing & Management10.1016/j.ipm.2024.10399062:3(103990)Online publication date: May-2025
https://doi.org/10.1016/j.ipm.2024.103990
Ma XLi XFang LZhang CCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Bridging the Modality Gap: Dimension Information Alignment and Sparse Spatial Constraint for Image-Text MatchingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681424(5074-5082)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681424
Ihnaini BAbuhaija BMills EMahmuddin M(2024)Semantic similarity on multimodal data: A comprehensive survey with applicationsJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2024.102263(102263)Online publication date: Dec-2024
https://doi.org/10.1016/j.jksuci.2024.102263

Index Terms

Unlocking the Power of Cross-Dimensional Semantic Dependency for Image-Text Matching
1. Information systems
  1. Information retrieval

Recommendations

Giving Text More Imagination Space for Image-text Matching
MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Image-text matching is a hot topic in multi-modal analysis. The existing image-text matching algorithms focus on bridging the heterogeneity gap and mapping the feature into a common space under strong alignment assumption. However, these methods have ...
Cross-modal Semantic Interference Suppression for image-text matching
Abstract
Image-text matching, which aims at precisely measuring the visual-semantic similarities between images and texts, is a fundamental research topic in multimedia analysis domain. Current methods have obtained an impressive performance by taking ...
Multi-view inter-modality representation with progressive fusion for image-text matching
Abstract
Recently, image-text matching has been intensively explored to bridge vision and language. Previous methods explore an inter-modality relationship between an image-text pair from the single-view feature. However, it is difficult to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
National Key Research and Development Project of China

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
256
Total Downloads

Downloads (Last 12 months)143
Downloads (Last 6 weeks)14

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Su XSong DLi WRen TLiu A(2025)Generating counterfactual negative samples for image-text matchingInformation Processing & Management10.1016/j.ipm.2024.10399062:3(103990)Online publication date: May-2025
https://doi.org/10.1016/j.ipm.2024.103990
Ma XLi XFang LZhang CCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Bridging the Modality Gap: Dimension Information Alignment and Sparse Spatial Constraint for Image-Text MatchingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681424(5074-5082)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681424
Ihnaini BAbuhaija BMills EMahmuddin M(2024)Semantic similarity on multimodal data: A comprehensive survey with applicationsJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2024.102263(102263)Online publication date: Dec-2024
https://doi.org/10.1016/j.jksuci.2024.102263

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten