skip to main content
10.1145/3404835.3463031acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper

Cross-Graph Attention Enhanced Multi-Modal Correlation Learning for Fine-Grained Image-Text Retrieval

Published: 11 July 2021 Publication History

Abstract

Fine-grained Image-text retrieval is challenging but vital technology in the field of multimedia analysis. Existing methods mainly focus on learning the common embedding space of images (or patches) and sentences (or words), whereby their mapping features in such embedding space can be directly measured. Nevertheless, most existing image-text retrieval works rarely consider the shared semantic concepts that potentially correlated the heterogeneous modalities, which can enhance the discriminative power of learning such embedding space. Toward this end, we propose a Cross-Graph Attention model (CGAM) to explicitly learn the shared semantic concepts, which can be well utilized to guide the feature learning process of each modality and promote the common embedding learning. More specifically, we build semantic-embedded graph for each modality, and smooth the discrepancy between two modalities via cross-graph attention model to obtain shared semantic-enhanced features. Meanwhile, we reconstruct image and text features via the shared semantic concepts and original embedding representations, and leverage multi-head mechanism for similarity calculation. Accordingly, the semantic-enhanced cross-modal embedding between image and text is discriminatively obtained to benefit the fine-grained retrieval with high retrieval performance. Extensive experiments evaluated on benchmark datasets show the performance improvements in comparison with state-of-the-arts.

Supplementary Material

MP4 File (SIGIR2021-presentation.mp4)
Presentation video - short version

References

[1]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of the IEEE CVPR. 6077--6086.
[2]
Yewang Chen, Xiaoliang Hu, Wentao Fan, Lianlian Shen, Zheng Zhang, Xin Liu, Jixiang Du, Haibo Li, Yi Chen, and Hailin Li. 2020. Fast density peak clustering for large scale data based on kNN. Knowledge Based Systems, Vol. 187 (2020). Article No. 104824.
[3]
Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2018. VSE+: Improving visual-semantic embeddings with hard negatives. In Proceedings of BMVC. 1--14.
[4]
Dehong Gao, Linbo Jin, Ben Chen, Minghui Qiu, Peng Li, Yi Wei, Yi Hu, and Hao Wang. 2020. FashionBERT: Text and Image Matching with Adaptive Loss for Cross-Modal Retrieval. In Proceedings of ACM SIGIR. 2251--2260.
[5]
Peng Hu, Liangli Zhen, Dezhong Peng, and Pei Liu. 2019. Scalable Deep Multimodal Learning for Cross-Modal Retrieval. In Proceedings of ACM SIGIR. 635--644.
[6]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In Proceedings of ECCV. 201--216.
[7]
Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019. Visual Semantic Reasoning for Image-Text Matching. In Proceedings of IEEE ICCV. 4654--4662.
[8]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proceedings of ECCV. 740--755.
[9]
Chunxiao Liu, Zhendong Mao, Tianzhu Zhang, Hongtao Xie, Bin Wang, and Yongdong Zhang. 2020. Graph structured network for image-text matching. In Proceedings of IEEE CVPR. 10921--10930.
[10]
Xin Liu, Zhikai Hu, Haibin Ling, and Yiu ming Cheung. 2021. MTFH: A Matrix Tri-Factorization Hashing Framework for Efficient Cross-Modal Retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 43, 3 (2021), 964--981.
[11]
Petar Velivc ković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2018. Graph attention networks. In Proceedings of ICRL .
[12]
Sijin Wang, Ruiping Wang, Ziwei Yao, Shiguang Shan, and Xilin Chen. 2020. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 1497--1506.
[13]
Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng, Junjie Yan, Xiaogang Wang, and Jing Shao. 2019. CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval. In Proceedings of the IEEE ICCV. 5764--5773.
[14]
Xing Xu, Tan Wang, Yang Yang, Lin Zuo, Fumin Shen, and Heng Tao Shen. 2020. Cross-Modal Attention With Semantic Consistence for Image-Text Matching. IEEE Transactions on Neural Networks and Learning Systems, Vol. 31, 12 (2020), 5412--5425.
[15]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, Vol. 2 (2014), 67--78.

Cited By

View all
  • (2025)GADNet: Improving image–text matching via graph-based aggregation and disentanglementPattern Recognition10.1016/j.patcog.2024.110900157(110900)Online publication date: Jan-2025
  • (2025)Integrated Global Semantics and Local Details for Image-Text RetrievalWeb and Big Data. APWeb-WAIM 2024 International Workshops10.1007/978-981-96-0055-7_5(57-68)Online publication date: 31-Jan-2025
  • (2024)UGNCL: Uncertainty-Guided Noisy Correspondence Learning for Efficient Cross-Modal MatchingProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657806(852-861)Online publication date: 10-Jul-2024
  • Show More Cited By

Index Terms

  1. Cross-Graph Attention Enhanced Multi-Modal Correlation Learning for Fine-Grained Image-Text Retrieval

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval
    July 2021
    2998 pages
    ISBN:9781450380379
    DOI:10.1145/3404835
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 July 2021

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. cross-graph attention
    2. image-text retrieval
    3. multi-head mechanism
    4. shared cemantic concept

    Qualifiers

    • Short-paper

    Funding Sources

    • National Science Foundation of China

    Conference

    SIGIR '21
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)84
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 08 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)GADNet: Improving image–text matching via graph-based aggregation and disentanglementPattern Recognition10.1016/j.patcog.2024.110900157(110900)Online publication date: Jan-2025
    • (2025)Integrated Global Semantics and Local Details for Image-Text RetrievalWeb and Big Data. APWeb-WAIM 2024 International Workshops10.1007/978-981-96-0055-7_5(57-68)Online publication date: 31-Jan-2025
    • (2024)UGNCL: Uncertainty-Guided Noisy Correspondence Learning for Efficient Cross-Modal MatchingProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657806(852-861)Online publication date: 10-Jul-2024
    • (2024)Relation-Aggregated Cross-Graph Correlation Learning for Fine-Grained Image–Text RetrievalIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.318856935:2(2194-2207)Online publication date: Feb-2024
    • (2024)Feature First: Advancing Image-Text Retrieval Through Improved Visual FeaturesIEEE Transactions on Multimedia10.1109/TMM.2023.331607726(3827-3841)Online publication date: 1-Jan-2024
    • (2024)Cross-Lingual Cross-Modal Retrieval With Noise-Robust Fine-TuningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.340006036:11(5860-5873)Online publication date: 1-Nov-2024
    • (2024)Common-Memory Bridged Cross-Modal Adaptive Graph Embedding for Image-Text Retrieval2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10688103(1-6)Online publication date: 15-Jul-2024
    • (2024)Context‐aware relation enhancement and similarity reasoning for image‐text retrievalIET Computer Vision10.1049/cvi2.12270Online publication date: 30-Jan-2024
    • (2024)Identifying the reaction centers of molecule based on dual-view representationKnowledge-Based Systems10.1016/j.knosys.2024.111606292:COnline publication date: 23-May-2024
    • (2024)Multi-level knowledge-driven feature representation and triplet loss optimization network for image–text retrievalInformation Processing and Management: an International Journal10.1016/j.ipm.2023.10357561:1Online publication date: 1-Feb-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media