skip to main content
10.1145/3581783.3611703acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Unlocking the Power of Cross-Dimensional Semantic Dependency for Image-Text Matching

Published: 27 October 2023 Publication History

Abstract

Image-text matching, as a fundamental cross-modal task, bridges vision and language. The key challenge lies in accurately learning the semantic similarity of these two heterogeneous modalities. To determine the semantic similarity between visual and textual features, existing paradigm typically first maps them into a d-dimensional shared representation space, then independently aggregates all dimensional correspondences of cross-modal features to reflect it, e.g., the inner product. However, in this paper, we are motivated by an insightful finding that dimensions are not mutually independent, but there are intrinsic dependencies among dimensions to jointly represent latent semantics. Ignoring this intrinsic information probably leads to suboptimal aggregation for semantic similarity, impairing cross-modal matching learning. To solve this issue, we propose a novel cross-dimensional semantic dependency-aware model (called X-Dim), which explicitly and adaptively mines the semantic dependencies between dimensions in the shared space, enabling dimensions with joint dependencies to be enhanced and utilized. X-Dim (1) designs a generalized framework to learn dimensions' semantic dependency degrees, and (2) devises the adaptive sparse probabilistic learning to autonomously make the model capture precise dependencies. Theoretical analysis and extensive experiments demonstrate the superiority of X-Dim over state-of-the-art methods, achieving 5.9%-7.3% rSum improvements on Flickr30K and MS-COCO benchmarks.

References

[1]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR. 6077--6086.
[2]
Mikhail Belkin, Siyuan Ma, and Soumik Mandal. 2018. To understand deep learning we need to understand kernel learning. In ICML. PMLR, 541--549.
[3]
Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. 2020b. Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In CVPR. 12655--12663.
[4]
Jiacheng Chen, Hexiang Hu, Hao Wu, Yuning Jiang, and Changhu Wang. 2021. Learning the best pooling strategy for visual semantic embedding. In CVPR. 15789--15798.
[5]
Tianlang Chen, Jiajun Deng, and Jiebo Luo. 2020a. Adaptive offline quintuplet loss for image-text matching. In ECCV. 549--565.
[6]
Tianlang Chen and Jiebo Luo. 2020. Expressing objects just like words: Recurrent visual embedding for image-text matching. In AAAI, Vol. 34. 10583--10590.
[7]
Haiwen Diao, Ying Zhang, Lin Ma, and Huchuan Lu. 2021. Similarity Reasoning and Filtration for Image-Text Matching. In AAAI.
[8]
Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. Vse: Improving visual-semantic embeddings with hard negatives. In arXiv preprint arXiv:1707.05612.
[9]
Zheren Fu, Yan Li, Zhendong Mao, Quan Wang, and Yongdong Zhang. 2021a. Deep metric learning with self-supervised ranking. In AAAI, Vol. 35. 1370--1378.
[10]
Zheren Fu, Zhendong Mao, Bo Hu, An-An Liu, and Yongdong Zhang. 2022. Intra-class adaptive augmentation with neighbor correction for deep metric learning. IEEE T-MM (2022).
[11]
Zheren Fu, Zhendong Mao, Yan Song, and Yongdong Zhang. 2023. Learning Semantic Relationship Among Instances for Image-Text Matching. In CVPR. 15159--15168.
[12]
Zheren Fu, Zhendong Mao, Chenggang Yan, An-An Liu, Hongtao Xie, and Yongdong Zhang. 2021b. Self-supervised synthesis ranking for deep metric learning. IEEE T-CSVT, Vol. 32, 7 (2021), 4736--4750.
[13]
Mengqi Huang, Zhendong Mao, Penghui Wang, Quan Wang, and Yongdong Zhang. 2022. Dse-gan: Dynamic semantic evolution generative adversarial network for text-to-image generation. In ACM MM. 4345--4354.
[14]
Yan Huang and Liang Wang. 2019. Acmm: Aligned cross-modal memory for few-shot image and sentence matching. In ICCV. 5774--5783.
[15]
Zhong Ji, Kexin Chen, and Haoran Wang. 2021. Step-Wise Hierarchical Alignment Network for Image-Text Matching. In IJCAI.
[16]
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In CVPR. 3128--3137.
[17]
Andrej Karpathy, Armand Joulin, and Li Fei-Fei. 2014. Deep fragment embeddings for bidirectional image sentence mapping. NeurIPS (2014).
[18]
Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT. 4171--4186.
[19]
Benjamin Klein, Guy Lev, Gil Sadeh, and Lior Wolf. 2015. Associating neural word embeddings with deep image representations using fisher vectors. In CVPR. 4437--4446.
[20]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. In IJCV, Vol. 123. 32--73.
[21]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In ECCV. 201--216.
[22]
Jingyu Li, Zhendong Mao, Shancheng Fang, and Hao Li. 2022b. ER-SAN: Enhanced-Adaptive Relation Self-Attention Network for Image Captioning. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22. IJCAI, 1081--1087.
[23]
Jiangtong Li, Li Niu, and Liqing Zhang. 2022c. Action-Aware Embedding Enhancement for Image-Text Retrieval. In AAAI, Vol. 36. 1323--1331.
[24]
Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019. Visual semantic reasoning for image-text matching. In ICCV. 4654--4662.
[25]
Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2022 e. Image-text embedding learning via visual and textual semantic reasoning. IEEE T-PAMI, Vol. 45, 1 (2022), 641--656.
[26]
Rengang Li, Cong Xu, Zhenhua Guo, Baoyu Fan, Runze Zhang, Wei Liu, Yaqian Zhao, Weifeng Gong, and Endong Wang. 2022d. AI-VQA: Visual Question Answering based on Agent Interaction with Interpretability. In ACM MM. 5274--5282.
[27]
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In ECCV. Springer, 121--137.
[28]
Zheng Li, Caili Guo, Zerun Feng, Jenq-Neng Hwang, and Zhongtian Du. 2023. Integrating Language Guidance into Image-Text Matching for Correcting False Negatives. IEEE T-MM (2023).
[29]
Zheng Li, Caili Guo, Zerun Feng, Jenq-Neng Hwang, and Xijun Xue. 2022a. Multi-View Visual Semantic Embedding. In IJCAI. 1130--1136.
[30]
Wentong Liao, Kai Hu, Michael Ying Yang, and Bodo Rosenhahn. 2022. Text to image generation with semantic-spatial aware gan. In CVPR. 18187--18196.
[31]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In ECCV. Springer, 740--755.
[32]
Chunxiao Liu, Zhendong Mao, An-An Liu, Tianzhu Zhang, Bin Wang, and Yongdong Zhang. 2019. Focus your attention: A bidirectional focal attention network for image-text matching. In ACM MM. 3--11.
[33]
Chunxiao Liu, Zhendong Mao, Tianzhu Zhang, Hongtao Xie, Bin Wang, and Yongdong Zhang. 2020. Graph structured network for image-text matching. In CVPR. 10921--10930.
[34]
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In EMNLP. 1532--1543.
[35]
Leigang Qu, Meng Liu, Da Cao, Liqiang Nie, and Qi Tian. 2020. Context-aware multi-view summarization network for image-text matching. In ACM MM. 1047--1055.
[36]
Leigang Qu, Meng Liu, Jianlong Wu, Zan Gao, and Liqiang Nie. 2021. Dynamic modality interaction modeling for image-text retrieval. In ACM SIGIR. 1104--1113.
[37]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In ICML. 8748--8763.
[38]
Bernhard Schölkopf, Alexander J Smola, Francis Bach, et al. 2002. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press.
[39]
Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE T-SP, Vol. 45, 11 (1997), 2673--2681.
[40]
Botian Shi, Lei Ji, Pan Lu, Zhendong Niu, and Nan Duan. 2019. Knowledge Aware Semantic Concept Expansion for Image-Text Matching. In IJCAI, Vol. 1. 2.
[41]
Alex Smola, Zoltán Ovári, and Robert C Williamson. 2000. Regularization with dot-product kernels. Advances in neural information processing systems, Vol. 13 (2000).
[42]
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research, Vol. 9, 11 (2008).
[43]
Sijin Wang, Ruiping Wang, Ziwei Yao, Shiguang Shan, and Xilin Chen. 2020. Cross-modal scene graph matching for relationship-aware image-text retrieval. In WACV. 1508--1517.
[44]
Yaxiong Wang, Hao Yang, Xueming Qian, Lin Ma, Jing Lu, Biao Li, and Xin Fan. 2019b. Position focused attention network for image-text matching. In IJCAI. 3792--3798.
[45]
Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng, Junjie Yan, Xiaogang Wang, and Jing Shao. 2019a. Camp: Cross-modal adaptive message passing for text-image retrieval. In ICCV. 5764--5773.
[46]
Jônatas Wehrmann, Camila Kolling, and Rodrigo C Barros. 2020. Adaptive cross-modal embeddings for image-text alignment. In AAAI, Vol. 34. 12313--12320.
[47]
Hao Wei, Shuhui Wang, Xinzhe Han, Zhe Xue, Bin Ma, Xiaoming Wei, and Xiaolin Wei. 2022. Synthesizing Counterfactual Samples for Effective Image-Text Matching. In ACM MM. 4355--4364.
[48]
Jiwei Wei, Xing Xu, Yang Yang, Yanli Ji, Zheng Wang, and Heng Tao Shen. 2020. Universal weighting metric learning for cross-modal matching. In CVPR. 13005--13014.
[49]
Keyu Wen, Xiaodong Gu, and Qingrong Cheng. 2020. Learning dual semantic relations with graph attention for image-text matching. IEEE T-CSVT, Vol. 31, 7 (2020), 2866--2879.
[50]
Yiling Wu, Shuhui Wang, Guoli Song, and Qingming Huang. 2019. Learning fragment self-attention embeddings for image-text matching. In ACM MM. 2088--2096.
[51]
Xing Xu, Kaiyi Lin, Yang Yang, Alan Hanjalic, and Heng Tao Shen. 2020. Joint feature synthesis and embedding: Adversarial cross-modal retrieval revisited. IEEE T-PAMI, Vol. 44, 6 (2020), 3030--3047.
[52]
Shiyang Yan, Li Yu, and Yuan Xie. 2021. Discrete-continuous Action Space Policy Gradient-based Attention for Image-Text Matching. In CVPR. 8096--8105.
[53]
Zhixian Zeng, Jianjun Cao, Nianfeng Weng, Guoquan Jiang, Yizhuo Rao, and Yuxin Xu. 2021. Softmax Pooling for Super Visual Semantic Embedding. In IEEE 12th Annual Information Technology, Electronics and Mobile Communication Conference. 0258--0265.
[54]
Huatian Zhang, Zhendong Mao, Kun Zhang, and Yongdong Zhang. 2022d. Show your faith: Cross-modal confidence-aware network for image-text matching. In AAAI, Vol. 36. 3262--3270.
[55]
Jingjing Zhang, Shancheng Fang, Zhendong Mao, Zhiwei Zhang, and Yongdong Zhang. 2022a. Fine-tuning with Multi-modal Entity Prompts for News Image Captioning. In ACM MM. 4365--4373.
[56]
Kun Zhang, Zhendong Mao, An-An Liu, and Yongdong Zhang. 2022b. Unified adaptive relevance distinguishable attention network for image-text matching. IEEE T-MM, Vol. 25 (2022), 1320--1332.
[57]
Kun Zhang, Zhendong Mao, Quan Wang, and Yongdong Zhang. 2022c. Negative-aware attention framework for image-text matching. In CVPR. 15661--15670.
[58]
Qi Zhang, Zhen Lei, Zhaoxiang Zhang, and Stan Z Li. 2020. Context-aware attention network for image-text retrieval. In CVPR. 3536--3545.
[59]
Mo Zhou, Zhenxing Niu, Le Wang, Zhanning Gao, Qilin Zhang, and Gang Hua. 2020. Ladder loss for coherent visual-semantic embedding. In AAAI, Vol. 34. 13050--13057.
[60]
Hongguang Zhu, Chunjie Zhang, Yunchao Wei, Shujuan Huang, and Yao Zhao. 2023. ESA: External Space Attention Aggregation for Image-Text Retrieval. IEEE T-CSVT (2023), 1--13.
[61]
Jianwei Zhu, Zhixin Li, Yufei Zeng, Jiahui Wei, and Huifang Ma. 2022. Image-Text Matching with Fine-Grained Relational Dependency and Bidirectional Attention-Based Generative Networks. In ACM MM. 395--403.

Cited By

View all
  • (2025)Generating counterfactual negative samples for image-text matchingInformation Processing & Management10.1016/j.ipm.2024.10399062:3(103990)Online publication date: May-2025
  • (2024)Bridging the Modality Gap: Dimension Information Alignment and Sparse Spatial Constraint for Image-Text MatchingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681424(5074-5082)Online publication date: 28-Oct-2024
  • (2024)Semantic similarity on multimodal data: A comprehensive survey with applicationsJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2024.102263(102263)Online publication date: Dec-2024

Index Terms

  1. Unlocking the Power of Cross-Dimensional Semantic Dependency for Image-Text Matching

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. image-text matching
    2. learning dimensional semantic dependency

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)143
    • Downloads (Last 6 weeks)14
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Generating counterfactual negative samples for image-text matchingInformation Processing & Management10.1016/j.ipm.2024.10399062:3(103990)Online publication date: May-2025
    • (2024)Bridging the Modality Gap: Dimension Information Alignment and Sparse Spatial Constraint for Image-Text MatchingProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681424(5074-5082)Online publication date: 28-Oct-2024
    • (2024)Semantic similarity on multimodal data: A comprehensive survey with applicationsJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2024.102263(102263)Online publication date: Dec-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media