skip to main content
10.1145/3503161.3548058acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Image-Text Matching with Fine-Grained Relational Dependency and Bidirectional Attention-Based Generative Networks

Published: 10 October 2022 Publication History

Abstract

Generally, most existing cross-modal retrieval methods only consider global or local semantic embeddings, lacking fine-grained dependencies between objects. At the same time, it is usually ignored that the mutual transformation between modalities also facilitates the embedding of modalities. Given these problems, we propose a method called BiKA (Bidirectional Knowledge-assisted embedding and Attention-based generation). The model uses a bidirectional graph convolutional neural network to establish dependencies between objects. In addition, it employs a bidirectional attention-based generative network to achieve the mutual transformation between modalities. Specifically, the knowledge graph is used for local matching to constrain the local expression of the modalities, in which the generative network is used for mutual transformation to constrain the global expression of the modalities. In addition, we also propose a new position relation embedding network to embed position relation information between objects. The experiments on two public datasets show that the performance of our method has been dramatically improved compared to many state-of-the-art models.

Supplementary Material

MP4 File (MM22-fp1289.mp4)
Presentation video

References

[1]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077--6086.
[2]
Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. 2020a. Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 12655--12663.
[3]
Shengjia Chen, Zhixin Li, Feicheng Huang, Canlong Zhang, and Huifang Ma. 2020c. Improving Object Detection with Relation Mining Network. In 2020 IEEE International Conference on Data Mining. IEEE, 52--61.
[4]
Shengjia Chen, Zhixin Li, and Zhenjun Tang. 2020b. Relation R-CNN: A graph based relation-aware network for object detection. IEEE Signal Processing Letters, Vol. 27 (2020), 1680--1684.
[5]
Zhao-Min Chen, Xiu-Shen Wei, Peng Wang, and Yanwen Guo. 2019. Multi-label image recognition with graph convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5177--5186.
[6]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[7]
Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017a. Vse: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017).
[8]
Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017b. Vse: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017).
[9]
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 580--587.
[10]
Jiuxiang Gu, Jianfei Cai, Shafiq R Joty, Li Niu, and Gang Wang. 2018a. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7181--7189.
[11]
Jiuxiang Gu, Jianfei Cai, Shafiq R Joty, Li Niu, and Gang Wang. 2018b. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7181--7189.
[12]
Xiaodong He, Li Deng, and Wu Chou. 2008. Discriminative learning in sequential pattern recognition. IEEE Signal Processing Magazine, Vol. 25, 5 (2008), 14 -- 36.
[13]
Simao Herdade, Armin Kappeler, Kofi Boakye, and Joao Soares. 2019. Image captioning: Transforming objects into words. arXiv preprint arXiv:1906.05963 (2019).
[14]
Chuanwen Hou, Zhixin Li, and Jingli Wu. 2022. Unsupervised hash retrieval based on multiple similarity matrices and text self-attention mechanism. Applied Intelligence, Vol. 52, 7 (2022), 7670--7685.
[15]
Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. 2018. Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3588--3597.
[16]
Feicheng Huang, Zhixin Li, Haiyang Wei, Canlong Zhang, and Huifang Ma. 2020. Boost image captioning with knowledge reasoning. Machine Learning, Vol. 109, 12 (2020), 2313--2332.
[17]
Zhong Ji, Kexin Chen, and Haoran Wang. 2021. Step-Wise Hierarchical Alignment Network for Image-Text Matching. arXiv preprint arXiv:2106.06509 (2021).
[18]
Biing-Hwang Juang, Wu Hou, and Chin-Hui Lee. 1997. Minimum lassification error rate methods for speech recognition. IEEE Transactions on Speech and Audio Processing, Vol. 5, 3 (1997), 257--265.
[19]
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128--3137.
[20]
Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
[21]
Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2014a. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014).
[22]
Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2014b. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014).
[23]
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019).
[24]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision. 201--216.
[25]
Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019. Visual semantic reasoning for image-text matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4654--4662.
[26]
Zhixin Li, Xiumin Xie, Feng Ling, Huifang Ma, and Zhiping Shi. 2021. Matching images and texts with multi-head attention network for cross-media hashing retrieval. Engineering Applications of Artificial Intelligence, Vol. 106 (2021), 104475.
[27]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision. Springer, 740--755.
[28]
Chunxiao Liu, Zhendong Mao, An-An Liu, Tianzhu Zhang, Bin Wang, and Yongdong Zhang. 2019. Focus your attention: A bidirectional focal attention network for image-text matching. In Proceedings of the 27th ACM International Conference on Multimedia. 3--11.
[29]
Yuxin Peng and Jinwei Qi. 2019. CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Transactions on Multimedia Computing, Communications, and Applications, Vol. 15, 1 (2019), 1--24.
[30]
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.
[31]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems, Vol. 28 (2015), 91--99.
[32]
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research, Vol. 9, 11 (2008).
[33]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems, Vol. 30 (2017), 5998--6008.
[34]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164.
[35]
Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5005--5013.
[36]
Sijin Wang, Ruiping Wang, Ziwei Yao, Shiguang Shan, and Xilin Chen. 2020. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1508--1517.
[37]
Yaxiong Wang, Hao Yang, Xueming Qian, Lin Ma, Jing Lu, Biao Li, and Xin Fan. 2019b. Position focused attention network for image-text matching. arXiv preprint arXiv:1907.09748 (2019).
[38]
Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng, Junjie Yan, Xiaogang Wang, and Jing Shao. 2019a. Camp: Cross-modal adaptive message passing for text-image retrieval. In Proceedings of the IEEE International Conference on Computer Vision. 5764--5773.
[39]
Tiantao Xian, Zhixin Li, Canlong Zhang, and Huifang Ma. 2022. Dual global enhanced transformer for image captioning. Neural Networks, Vol. 148 (2022), 129--141.
[40]
Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. 2015b. Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853 (2015).
[41]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015a. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. PMLR, 2048--2057.
[42]
Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. 2018. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1316--1324.
[43]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, Vol. 2 (2014), 67--78.
[44]
Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. 2017. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision. 5907--5915.
[45]
Jianwei Zhu, Zhixin Li, Jiahui Wei, and Huifang Ma. 2022. PBGN: Phased bidirectional generation network in text-to-image synthesis. Neural Processing Letters (2022), 1--21.iogr

Cited By

View all
  • (2025)GADNet: Improving image–text matching via graph-based aggregation and disentanglementPattern Recognition10.1016/j.patcog.2024.110900157(110900)Online publication date: Jan-2025
  • (2025)Efficient Parameter-free Adaptive Hashing for Large-Scale Cross-Modal RetrievalInternational Journal of Approximate Reasoning10.1016/j.ijar.2025.109383(109383)Online publication date: Feb-2025
  • (2024)Distilled Cross-Combination Transformer for Image Captioning with Dual Refined Visual FeaturesProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681161(4465-4474)Online publication date: 28-Oct-2024
  • Show More Cited By

Index Terms

  1. Image-Text Matching with Fine-Grained Relational Dependency and Bidirectional Attention-Based Generative Networks

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        MM '22: Proceedings of the 30th ACM International Conference on Multimedia
        October 2022
        7537 pages
        ISBN:9781450392037
        DOI:10.1145/3503161
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 10 October 2022

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. attentional generative network
        2. cross-attention
        3. cross-modal retrieval
        4. graph convolutional network
        5. knowledge embedding

        Qualifiers

        • Research-article

        Funding Sources

        Conference

        MM '22
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)53
        • Downloads (Last 6 weeks)7
        Reflects downloads up to 18 Feb 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2025)GADNet: Improving image–text matching via graph-based aggregation and disentanglementPattern Recognition10.1016/j.patcog.2024.110900157(110900)Online publication date: Jan-2025
        • (2025)Efficient Parameter-free Adaptive Hashing for Large-Scale Cross-Modal RetrievalInternational Journal of Approximate Reasoning10.1016/j.ijar.2025.109383(109383)Online publication date: Feb-2025
        • (2024)Distilled Cross-Combination Transformer for Image Captioning with Dual Refined Visual FeaturesProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681161(4465-4474)Online publication date: 28-Oct-2024
        • (2024)Team HUGE: Image-Text Matching via Hierarchical and Unified Graph EnhancingProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658001(704-712)Online publication date: 30-May-2024
        • (2024)Mining Similarity Relationships for Unsupervised Cross-Modal Hashing2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10687927(1-6)Online publication date: 15-Jul-2024
        • (2024)Robust VQA via Internal and External Interaction of Modal Information and Question Transformation2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10687903(1-6)Online publication date: 15-Jul-2024
        • (2024)A Novel Approach for Image-Text Matching Cross-Modal Space Learning2024 14th International Conference on Computer and Knowledge Engineering (ICCKE)10.1109/ICCKE65377.2024.10874683(093-098)Online publication date: 19-Nov-2024
        • (2024)Robust visual question answering via polarity enhancement and contrastNeural Networks10.1016/j.neunet.2024.106560179(106560)Online publication date: Nov-2024
        • (2024)Large-Scale Cross-Modal Hashing with Unified Learning and Multi-Object Regional Correlation ReasoningNeural Networks10.1016/j.neunet.2023.12.018171:C(276-292)Online publication date: 17-Apr-2024
        • (2024)Enhancement, integration, expansionNeural Networks10.1016/j.neunet.2023.11.003169:C(532-541)Online publication date: 4-Mar-2024
        • Show More Cited By

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media