research-article

Image-Text Matching with Fine-Grained Relational Dependency and Bidirectional Attention-Based Generative Networks

Authors:

Huifang MaAuthors Info & Claims

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Pages 395 - 403

https://doi.org/10.1145/3503161.3548058

Published: 10 October 2022 Publication History

Abstract

Generally, most existing cross-modal retrieval methods only consider global or local semantic embeddings, lacking fine-grained dependencies between objects. At the same time, it is usually ignored that the mutual transformation between modalities also facilitates the embedding of modalities. Given these problems, we propose a method called BiKA (Bidirectional Knowledge-assisted embedding and Attention-based generation). The model uses a bidirectional graph convolutional neural network to establish dependencies between objects. In addition, it employs a bidirectional attention-based generative network to achieve the mutual transformation between modalities. Specifically, the knowledge graph is used for local matching to constrain the local expression of the modalities, in which the generative network is used for mutual transformation to constrain the global expression of the modalities. In addition, we also propose a new position relation embedding network to embed position relation information between objects. The experiments on two public datasets show that the performance of our method has been dramatically improved compared to many state-of-the-art models.

Supplementary Material

MP4 File (MM22-fp1289.mp4)

Presentation video

Download
146.45 MB

References

[1]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077--6086.

[2]

Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. 2020a. Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 12655--12663.

[3]

Shengjia Chen, Zhixin Li, Feicheng Huang, Canlong Zhang, and Huifang Ma. 2020c. Improving Object Detection with Relation Mining Network. In 2020 IEEE International Conference on Data Mining. IEEE, 52--61.

[4]

Shengjia Chen, Zhixin Li, and Zhenjun Tang. 2020b. Relation R-CNN: A graph based relation-aware network for object detection. IEEE Signal Processing Letters, Vol. 27 (2020), 1680--1684.

[5]

Zhao-Min Chen, Xiu-Shen Wei, Peng Wang, and Yanwen Guo. 2019. Multi-label image recognition with graph convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5177--5186.

[6]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[7]

Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017a. Vse: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017).

[8]

Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017b. Vse: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612 (2017).

[9]

Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 580--587.

Digital Library

[10]

Jiuxiang Gu, Jianfei Cai, Shafiq R Joty, Li Niu, and Gang Wang. 2018a. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7181--7189.

[11]

Jiuxiang Gu, Jianfei Cai, Shafiq R Joty, Li Niu, and Gang Wang. 2018b. Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7181--7189.

[12]

Xiaodong He, Li Deng, and Wu Chou. 2008. Discriminative learning in sequential pattern recognition. IEEE Signal Processing Magazine, Vol. 25, 5 (2008), 14 -- 36.

[13]

Simao Herdade, Armin Kappeler, Kofi Boakye, and Joao Soares. 2019. Image captioning: Transforming objects into words. arXiv preprint arXiv:1906.05963 (2019).

[14]

Chuanwen Hou, Zhixin Li, and Jingli Wu. 2022. Unsupervised hash retrieval based on multiple similarity matrices and text self-attention mechanism. Applied Intelligence, Vol. 52, 7 (2022), 7670--7685.

Digital Library

[15]

Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. 2018. Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3588--3597.

[16]

Feicheng Huang, Zhixin Li, Haiyang Wei, Canlong Zhang, and Huifang Ma. 2020. Boost image captioning with knowledge reasoning. Machine Learning, Vol. 109, 12 (2020), 2313--2332.

Digital Library

[17]

Zhong Ji, Kexin Chen, and Haoran Wang. 2021. Step-Wise Hierarchical Alignment Network for Image-Text Matching. arXiv preprint arXiv:2106.06509 (2021).

[18]

Biing-Hwang Juang, Wu Hou, and Chin-Hui Lee. 1997. Minimum lassification error rate methods for speech recognition. IEEE Transactions on Speech and Audio Processing, Vol. 5, 3 (1997), 257--265.

[19]

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128--3137.

[20]

Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).

[21]

Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2014a. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014).

[22]

Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2014b. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014).

[23]

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019).

[24]

Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision. 201--216.

Digital Library

[25]

Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019. Visual semantic reasoning for image-text matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4654--4662.

[26]

Zhixin Li, Xiumin Xie, Feng Ling, Huifang Ma, and Zhiping Shi. 2021. Matching images and texts with multi-head attention network for cross-media hashing retrieval. Engineering Applications of Artificial Intelligence, Vol. 106 (2021), 104475.

Digital Library

[27]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision. Springer, 740--755.

[28]

Chunxiao Liu, Zhendong Mao, An-An Liu, Tianzhu Zhang, Bin Wang, and Yongdong Zhang. 2019. Focus your attention: A bidirectional focal attention network for image-text matching. In Proceedings of the 27th ACM International Conference on Multimedia. 3--11.

Digital Library

[29]

Yuxin Peng and Jinwei Qi. 2019. CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Transactions on Multimedia Computing, Communications, and Applications, Vol. 15, 1 (2019), 1--24.

Digital Library

[30]

Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.

[31]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems, Vol. 28 (2015), 91--99.

Digital Library

[32]

Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research, Vol. 9, 11 (2008).

[33]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems, Vol. 30 (2017), 5998--6008.

[34]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156--3164.

[35]

Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5005--5013.

[36]

Sijin Wang, Ruiping Wang, Ziwei Yao, Shiguang Shan, and Xilin Chen. 2020. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1508--1517.

[37]

Yaxiong Wang, Hao Yang, Xueming Qian, Lin Ma, Jing Lu, Biao Li, and Xin Fan. 2019b. Position focused attention network for image-text matching. arXiv preprint arXiv:1907.09748 (2019).

[38]

Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng, Junjie Yan, Xiaogang Wang, and Jing Shao. 2019a. Camp: Cross-modal adaptive message passing for text-image retrieval. In Proceedings of the IEEE International Conference on Computer Vision. 5764--5773.

[39]

Tiantao Xian, Zhixin Li, Canlong Zhang, and Huifang Ma. 2022. Dual global enhanced transformer for image captioning. Neural Networks, Vol. 148 (2022), 129--141.

Digital Library

[40]

Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. 2015b. Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853 (2015).

[41]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015a. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. PMLR, 2048--2057.

Digital Library

[42]

Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. 2018. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1316--1324.

[43]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, Vol. 2 (2014), 67--78.

[44]

Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. 2017. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision. 5907--5915.

[45]

Jianwei Zhu, Zhixin Li, Jiahui Wei, and Huifang Ma. 2022. PBGN: Phased bidirectional generation network in text-to-image synthesis. Neural Processing Letters (2022), 1--21.iogr

Cited By

Pu XWang ZYuan LWu YJing LGao X(2025)GADNet: Improving image–text matching via graph-based aggregation and disentanglementPattern Recognition10.1016/j.patcog.2024.110900157(110900)Online publication date: Jan-2025
https://doi.org/10.1016/j.patcog.2024.110900
Li BWu YLi Z(2025)Efficient Parameter-free Adaptive Hashing for Large-Scale Cross-Modal RetrievalInternational Journal of Approximate Reasoning10.1016/j.ijar.2025.109383(109383)Online publication date: Feb-2025
https://doi.org/10.1016/j.ijar.2025.109383
Hu JLi ZCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Distilled Cross-Combination Transformer for Image Captioning with Dual Refined Visual FeaturesProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681161(4465-4474)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681161
Show More Cited By

Index Terms

Image-Text Matching with Fine-Grained Relational Dependency and Bidirectional Attention-Based Generative Networks
1. Computing methodologies
  1. Artificial intelligence
    1. Search methodologies
  2. Machine learning
    1. Machine learning algorithms
2. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

Fine-Grained Bidirectional Attention-Based Generative Networks for Image-Text Matching
Machine Learning and Knowledge Discovery in Databases
Abstract
In this paper, we propose a method called BiKA (Bidirectional Knowledge-assisted embedding and Attention-based generation) for the task of image-text matching. It mainly improves the embedding ability of images and texts from two aspects: first, ...
Fine-grained bidirectional attentional generation and knowledge-assisted networks for cross-modal retrieval
Abstract
Generally, most existing cross-modal retrieval methods only consider global or local semantic embeddings, lacking fine-grained dependencies between objects. At the same time, it is usually ignored that the mutual transformation between ...
Structure-Aware Adaptive Hybrid Interaction Modeling for Image-Text Matching
MultiMedia Modeling
Abstract
Image-text matching is a rapidly evolving field in multimodal learning, aiming to measure the similarity between image and text. Despite significant progress in image-text matching in recent years, most existing methods for image-text interaction ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '22: Proceedings of the 30th ACM International Conference on Multimedia

October 2022

7537 pages

ISBN:9781450392037

DOI:10.1145/3503161

General Chairs:
João Magalhães
NOVA University of Lisbon, Portugal
,
Alberto del Bimbo
University of Florence, Italy
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Nicu Sebe
University of Trento, Italy
,
Program Chairs:
Xavier Alameda-Pineda
Inria, Grenoble, France
,
Qin Jin
Renmin University of China, China
,
Vincent Oria
New Jersey Institute of Technology, USA
,
Laura Toni
University College London, UK

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 October 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Gusngxi Natural Science Foundation

Conference

MM '22

Sponsor:

SIGMM

MM '22: The 30th ACM International Conference on Multimedia

October 10 - 14, 2022

Lisboa, Portugal

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

26
Total Citations
View Citations
288
Total Downloads

Downloads (Last 12 months)53
Downloads (Last 6 weeks)7

Reflects downloads up to 18 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Pu XWang ZYuan LWu YJing LGao X(2025)GADNet: Improving image–text matching via graph-based aggregation and disentanglementPattern Recognition10.1016/j.patcog.2024.110900157(110900)Online publication date: Jan-2025
https://doi.org/10.1016/j.patcog.2024.110900
Li BWu YLi Z(2025)Efficient Parameter-free Adaptive Hashing for Large-Scale Cross-Modal RetrievalInternational Journal of Approximate Reasoning10.1016/j.ijar.2025.109383(109383)Online publication date: Feb-2025
https://doi.org/10.1016/j.ijar.2025.109383
Hu JLi ZCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Distilled Cross-Combination Transformer for Image Captioning with Dual Refined Visual FeaturesProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681161(4465-4474)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681161
Li BWu YLi ZGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)Team HUGE: Image-Text Matching via Hierarchical and Unified Graph EnhancingProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658001(704-712)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3658001
Wu YLi Z(2024)Mining Similarity Relationships for Unsupervised Cross-Modal Hashing2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10687927(1-6)Online publication date: 15-Jul-2024
https://doi.org/10.1109/ICME57554.2024.10687927
Peng DShen RLi Z(2024)Robust VQA via Internal and External Interaction of Modal Information and Question Transformation2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10687903(1-6)Online publication date: 15-Jul-2024
https://doi.org/10.1109/ICME57554.2024.10687903
Ebrahimi AParseh MRasti P(2024)A Novel Approach for Image-Text Matching Cross-Modal Space Learning2024 14th International Conference on Computer and Knowledge Engineering (ICCKE)10.1109/ICCKE65377.2024.10874683(093-098)Online publication date: 19-Nov-2024
https://doi.org/10.1109/ICCKE65377.2024.10874683
Peng DLi Z(2024)Robust visual question answering via polarity enhancement and contrastNeural Networks10.1016/j.neunet.2024.106560179(106560)Online publication date: Nov-2024
https://doi.org/10.1016/j.neunet.2024.106560
Li BLi Z(2024)Large-Scale Cross-Modal Hashing with Unified Learning and Multi-Object Regional Correlation ReasoningNeural Networks10.1016/j.neunet.2023.12.018171:C(276-292)Online publication date: 17-Apr-2024
https://dl.acm.org/doi/10.1016/j.neunet.2023.12.018
Ning EWang YWang CZhang HNing X(2024)Enhancement, integration, expansionNeural Networks10.1016/j.neunet.2023.11.003169:C(532-541)Online publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1016/j.neunet.2023.11.003
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten