research-article

Learning Fragment Self-Attention Embeddings for Image-Text Matching

Authors:

Qingming HuangAuthors Info & Claims

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Pages 2088 - 2096

https://doi.org/10.1145/3343031.3350940

Published: 15 October 2019 Publication History

Abstract

In image-text matching task, the key to good matching quality is to capture the rich contextual dependencies between fragments of image and text. However, previous works either simply aggregate the similarity of all possible pairs of image regions and words, or take multi-step cross attention to attend to image regions and words with each other as context, which requires exhaustive similarity computation between all image region and word pairs. In this paper, we propose Self-Attention Embeddings (SAEM) to exploit fragment relations in images or texts by self-attention mechanism, and aggregate fragment information into visual and textual embeddings. Specifically, SAEM extracts salient image regions based on bottom-up attention, and takes WordPiece tokens as sentence fragments. The self-attention layers are built to model subtle and fine-grained fragment relation in image and text respectively, which consists of multi-head self-attention sub-layer and position-wise feed-forward network sub-layer. Consequently, the fragment self-attention mechanism can discover the fragment relations and identify the semantically salient regions in images or words in sentences, and capture their interaction more accurately. By simultaneously exploiting the fine-grained fragment relation in both visual and textual modalities, our method produces more semantically consistent embeddings for representing images and texts, and demonstrates promising image-text matching accuracy and high efficiency on Flickr30K and MSCOCO datasets.

References

[1]

S. Abhishek, K. Abhishek, H. Daume, and D. W. Jacobs. 2012. Generalized multiview analysis: A discriminative latent space. In CVPR . 2160--2167.

[2]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077--6086.

[3]

Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. Deep canonical correlation analysis. In ICML. 1247--1255.

[4]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[5]

Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. Vse

[6]

: Improved visual-semantic embeddings. arXiv preprint arXiv:1707.05612, Vol. 2, 7 (2017), 8.

[7]

Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A multi-view embedding space for modeling internet images, tags, and their semantics. IJCV, Vol. 106, 2 (2014), 210--233.

Digital Library

[8]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[9]

H. Hotelling. 1936. Relations between two sets of variates. Biometrika, Vol. 28, 3/4 (1936), 321--377.

[10]

Yan Hua, Shuhui Wang, Siyuan Liu, Anni Cai, and Qingming Huang. 2016. Cross-Modal Correlation Learning by Adaptive Hierarchical Semantic Aggregation. IEEE Trans. Multimedia, Vol. 18, 6 (2016), 1201--1216.

Digital Library

[11]

Yan Huang, Wei Wang, and Liang Wang. 2017. Instance-aware image and sentence matching with selective multimodal lstm. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 2310--2318.

[12]

Yan Huang, Qi Wu, Chunfeng Song, and Liang Wang. 2018. Learning semantic concepts and order for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 6163--6171.

[13]

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition . 3128--3137.

[14]

Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014. 1746--1751.

[15]

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations (ICLR) .

[16]

Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2014. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models. CoRR, Vol. abs/1411.2539 (2014).

[17]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et almbox. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, Vol. 123, 1 (2017), 32--73.

Digital Library

[18]

Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV). 201--216.

Digital Library

[19]

Shuang Li, Tong Xiao, Hongsheng Li, Wei Yang, and Xiaogang Wang. 2017. Identity-aware textual-visual matching with latent co-attention. In Proceedings of the IEEE International Conference on Computer Vision. 1890--1899.

[20]

Min Lin, Qiang Chen, and Shuicheng Yan. 2013. Network in network. arXiv preprint arXiv:1312.4400 (2013).

[21]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740--755.

[22]

Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015).

[23]

Lin Ma, Zhengdong Lu, Lifeng Shang, and Hang Li. 2015. Multimodal convolutional neural networks for matching image and sentence. In Proceedings of the IEEE international conference on computer vision . 2623--2631.

Digital Library

[24]

Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2015. Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN). ICLR (2015).

[25]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111--3119.

[26]

Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. 2017. Dual attention networks for multimodal reasoning and matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 299--307.

[27]

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. (2017).

[28]

Yuxin Peng, Jinwei Qi, and Yuxin Yuan. 2018. Modality-specific cross-modal similarity measurement with recurrent attention network. IEEE Transactions on Image Processing, Vol. 27, 11 (2018), 5585--5599.

Digital Library

[29]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) . 1532--1543.

[30]

Guoping Qiu. 2002. Indexing chromatic and achromatic patterns for content-based colour image retrieval. Pattern Recognition, Vol. 35, 8 (2002), 1675--1686.

[31]

V. Ranjan, N. Rasiwasia, and CV Jawahar. 2015. Multi-Label Cross-modal Retrieval. In ICCV. 4094--4102.

[32]

Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In ACM Multimedia . 251--260.

[33]

Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems. 91--99.

[34]

Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685 (2015).

[35]

Kihyuk Sohn. 2016. Improved deep metric learning with multi-class n-pair loss objective. In Advances in Neural Information Processing Systems. 1857--1865.

[36]

Guoli Song, Shuhui Wang, Qingming Huang, and Qi Tian. 2017. Multimodal Similarity Gaussian Process Latent Variable Model. IEEE Trans. Image Processing, Vol. 26, 9 (2017), 4168--4181.

Digital Library

[37]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008.

[38]

Jian Wang, Feng Zhou, Shilei Wen, Xiao Liu, and Yuanqing Lin. 2017. Deep metric learning with angular loss. In Proceedings of the IEEE International Conference on Computer Vision. 2593--2601.

[39]

Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016b. Learning deep structure-preserving image-text embeddings. In CVPR . 5005--5013.

[40]

Shuhui Wang, Yangyu Chen, Junbao Zhuo, Qingming Huang, and Qi Tian. 2018a. Joint Global and Co-Attentive Representation Learning for Image-Sentence Retrieval. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 1398--1406.

[41]

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018b. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7794--7803.

[42]

Yequan Wang, Minlie Huang, Li Zhao, et almbox. 2016a. Attention-based LSTM for aspect-level sentiment classification. In Proceedings of the 2016 conference on empirical methods in natural language processing . 606--615.

[43]

Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV). 3--19.

Digital Library

[44]

Yiling Wu, Shuhui Wang, and Qingming Huang. 2017. Online Asymmetric Similarity Learning for Cross-Modal Retrieval. In CVPR . 4269--4278.

[45]

Yiling Wu, Shuhui Wang, Guoli Song, and Qingming Huang. 2019. Online Asymmetric Metric Learning With Multi-Layer Similarity Aggregation for Cross-Modal Retrieval. IEEE Trans. Image Processing, Vol. 28, 9 (2019), 4299--4312.

[46]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning. 2048--2057.

Digital Library

[47]

Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In CVPR. 3441--3450.

[48]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, Vol. 2 (2014), 67--78.

[49]

Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5831--5840.

[50]

Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. 2018. Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318 (2018).

[51]

Ying Zhang and Huchuan Lu. 2018. Deep Cross-Modal Projection Learning for Image-Text Matching. In Proceedings of the European Conference on Computer Vision (ECCV). 686--701.

Digital Library

[52]

Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, and Yi-Dong Shen. 2017. Dual-Path Convolutional Image-Text Embedding with Instance Loss. arXiv preprint arXiv:1711.05535 (2017).

Cited By

Zhao YZhang MYang BZhang ZKang JGong J(2025)LuoJiaHOG: A hierarchy oriented geo-aware image caption dataset for remote sensing image–text retrievalISPRS Journal of Photogrammetry and Remote Sensing10.1016/j.isprsjprs.2025.02.009222(130-151)Online publication date: Apr-2025
https://doi.org/10.1016/j.isprsjprs.2025.02.009
Li MGao YZhao HLi RChen J(2025)Progressive semantic aggregation and structured cognitive enhancement for image–text matchingExpert Systems with Applications10.1016/j.eswa.2025.126943274(126943)Online publication date: May-2025
https://doi.org/10.1016/j.eswa.2025.126943
Liu HSong YWang XZhu XLi ZSong WLi T(2025)Flickr30K-CFQ: A Compact and Fragmented Query Dataset for Text-image RetrievalDatabase Systems for Advanced Applications10.1007/978-981-97-5555-4_30(419-434)Online publication date: 12-Jan-2025
https://doi.org/10.1007/978-981-97-5555-4_30
Show More Cited By

Index Terms

Learning Fragment Self-Attention Embeddings for Image-Text Matching
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Learning latent representations
      2. Neural networks
2. Information systems
  1. Information retrieval

Recommendations

Context-Aware Multi-View Summarization Network for Image-Text Matching
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Image-text matching is a vital yet challenging task in the field of multimedia analysis. Over the past decades, great efforts have been made to bridge the semantic gap between the visual and textual modalities. Despite the significance and value, most ...
Global-Guided Asymmetric Attention Network for Image-Text Matching
Abstract
Image-text matching is a vital yet challenging task in the field of vision and language. Unlike previous methods that usually adopt a symmetrical network to independently embed images and sentences into a joint latent space, we propose a novel ...
Cross-modal multi-relationship aware reasoning for image-text matching
Abstract
Cross-modal image-text matching has attracted considerable interest in both computer vision and natural language processing communities. The main issue of image-text matching is to learn the compact cross-modal representations and the correlation ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

October 2019

2794 pages

ISBN:9781450368896

DOI:10.1145/3343031

General Chairs:
Laurent Amsaleg
CNRS-IRISA, France
,
Benoit Huet
EURECOM, France
,
Martha Larson
Radboud University and TU Delft (Netherlands)
,
Program Chairs:
Guillaume Gravier
CNRS-IRISA, France
,
Hayley Hung
Delft University of Technology Netherlands
,
Chong-Wah Ngo
City University of Hong Kong Hong Kong
,
Wei Tsang Ooi
National University of Singapore Singapore

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '19

Sponsor:

SIGMM

MM '19: The 27th ACM International Conference on Multimedia

October 21 - 25, 2019

Nice, France

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

95
Total Citations
View Citations
1,124
Total Downloads

Downloads (Last 12 months)107
Downloads (Last 6 weeks)7

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhao YZhang MYang BZhang ZKang JGong J(2025)LuoJiaHOG: A hierarchy oriented geo-aware image caption dataset for remote sensing image–text retrievalISPRS Journal of Photogrammetry and Remote Sensing10.1016/j.isprsjprs.2025.02.009222(130-151)Online publication date: Apr-2025
https://doi.org/10.1016/j.isprsjprs.2025.02.009
Li MGao YZhao HLi RChen J(2025)Progressive semantic aggregation and structured cognitive enhancement for image–text matchingExpert Systems with Applications10.1016/j.eswa.2025.126943274(126943)Online publication date: May-2025
https://doi.org/10.1016/j.eswa.2025.126943
Liu HSong YWang XZhu XLi ZSong WLi T(2025)Flickr30K-CFQ: A Compact and Fragmented Query Dataset for Text-image RetrievalDatabase Systems for Advanced Applications10.1007/978-981-97-5555-4_30(419-434)Online publication date: 12-Jan-2025
https://doi.org/10.1007/978-981-97-5555-4_30
Zhou HHu YLiu SZhou GXu JChen AWang YLi LHu Y(2024)A Precise Framework for Rice Leaf Disease Image–Text Retrieval Using FHTW-NetPlant Phenomics10.34133/plantphenomics.01686(0168)Online publication date: 2024
https://doi.org/10.34133/plantphenomics.0168
Xu JZhou HHu YXue YZhou GLi LDai WLi J(2024)High-Accuracy Tomato Leaf Disease Image-Text Retrieval Method Utilizing LAFANetPlants10.3390/plants1309117613:9(1176)Online publication date: 23-Apr-2024
https://doi.org/10.3390/plants13091176
Sun HQin XLiu X(2024)Learning hierarchical embedding space for image-text matchingIntelligent Data Analysis10.3233/IDA-23021428:3(647-665)Online publication date: 28-May-2024
https://doi.org/10.3233/IDA-230214
Liang XYang EDeng CYang Y(2024)CrossFormer: Cross-Modal Representation Learning via Heterogeneous Graph TransformerACM Transactions on Multimedia Computing, Communications, and Applications10.1145/368880120:12(1-21)Online publication date: 20-Sep-2024
https://dl.acm.org/doi/10.1145/3688801
Yu JDing YDong JLi YGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)Dynamic Soft Labeling for Visual Semantic EmbeddingProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658068(220-228)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3658068
Zhuo YLi B(2024)FELGA: Unsupervised Fragment Embedding for Fine-Grained Cross-Modal Association2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00554(5623-5633)Online publication date: 3-Jan-2024
https://doi.org/10.1109/WACV57701.2024.00554
Pang SZeng YZhao JXue J(2024)A Mutually Textual and Visual Refinement Network for Image-Text MatchingIEEE Transactions on Multimedia10.1109/TMM.2024.336996826(7555-7566)Online publication date: 26-Feb-2024
https://dl.acm.org/doi/10.1109/TMM.2024.3369968
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten