skip to main content
10.1145/3343031.3350940acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Learning Fragment Self-Attention Embeddings for Image-Text Matching

Published: 15 October 2019 Publication History

Abstract

In image-text matching task, the key to good matching quality is to capture the rich contextual dependencies between fragments of image and text. However, previous works either simply aggregate the similarity of all possible pairs of image regions and words, or take multi-step cross attention to attend to image regions and words with each other as context, which requires exhaustive similarity computation between all image region and word pairs. In this paper, we propose Self-Attention Embeddings (SAEM) to exploit fragment relations in images or texts by self-attention mechanism, and aggregate fragment information into visual and textual embeddings. Specifically, SAEM extracts salient image regions based on bottom-up attention, and takes WordPiece tokens as sentence fragments. The self-attention layers are built to model subtle and fine-grained fragment relation in image and text respectively, which consists of multi-head self-attention sub-layer and position-wise feed-forward network sub-layer. Consequently, the fragment self-attention mechanism can discover the fragment relations and identify the semantically salient regions in images or words in sentences, and capture their interaction more accurately. By simultaneously exploiting the fine-grained fragment relation in both visual and textual modalities, our method produces more semantically consistent embeddings for representing images and texts, and demonstrates promising image-text matching accuracy and high efficiency on Flickr30K and MSCOCO datasets.

References

[1]
S. Abhishek, K. Abhishek, H. Daume, and D. W. Jacobs. 2012. Generalized multiview analysis: A discriminative latent space. In CVPR . 2160--2167.
[2]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077--6086.
[3]
Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. Deep canonical correlation analysis. In ICML. 1247--1255.
[4]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[5]
Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. 2017. Vse
[6]
: Improved visual-semantic embeddings. arXiv preprint arXiv:1707.05612, Vol. 2, 7 (2017), 8.
[7]
Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A multi-view embedding space for modeling internet images, tags, and their semantics. IJCV, Vol. 106, 2 (2014), 210--233.
[8]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[9]
H. Hotelling. 1936. Relations between two sets of variates. Biometrika, Vol. 28, 3/4 (1936), 321--377.
[10]
Yan Hua, Shuhui Wang, Siyuan Liu, Anni Cai, and Qingming Huang. 2016. Cross-Modal Correlation Learning by Adaptive Hierarchical Semantic Aggregation. IEEE Trans. Multimedia, Vol. 18, 6 (2016), 1201--1216.
[11]
Yan Huang, Wei Wang, and Liang Wang. 2017. Instance-aware image and sentence matching with selective multimodal lstm. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 2310--2318.
[12]
Yan Huang, Qi Wu, Chunfeng Song, and Liang Wang. 2018. Learning semantic concepts and order for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 6163--6171.
[13]
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition . 3128--3137.
[14]
Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014. 1746--1751.
[15]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations (ICLR) .
[16]
Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. 2014. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models. CoRR, Vol. abs/1411.2539 (2014).
[17]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et almbox. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, Vol. 123, 1 (2017), 32--73.
[18]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV). 201--216.
[19]
Shuang Li, Tong Xiao, Hongsheng Li, Wei Yang, and Xiaogang Wang. 2017. Identity-aware textual-visual matching with latent co-attention. In Proceedings of the IEEE International Conference on Computer Vision. 1890--1899.
[20]
Min Lin, Qiang Chen, and Shuicheng Yan. 2013. Network in network. arXiv preprint arXiv:1312.4400 (2013).
[21]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740--755.
[22]
Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015).
[23]
Lin Ma, Zhengdong Lu, Lifeng Shang, and Hang Li. 2015. Multimodal convolutional neural networks for matching image and sentence. In Proceedings of the IEEE international conference on computer vision . 2623--2631.
[24]
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2015. Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN). ICLR (2015).
[25]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111--3119.
[26]
Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. 2017. Dual attention networks for multimodal reasoning and matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 299--307.
[27]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. (2017).
[28]
Yuxin Peng, Jinwei Qi, and Yuxin Yuan. 2018. Modality-specific cross-modal similarity measurement with recurrent attention network. IEEE Transactions on Image Processing, Vol. 27, 11 (2018), 5585--5599.
[29]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) . 1532--1543.
[30]
Guoping Qiu. 2002. Indexing chromatic and achromatic patterns for content-based colour image retrieval. Pattern Recognition, Vol. 35, 8 (2002), 1675--1686.
[31]
V. Ranjan, N. Rasiwasia, and CV Jawahar. 2015. Multi-Label Cross-modal Retrieval. In ICCV. 4094--4102.
[32]
Nikhil Rasiwasia, Jose Costa Pereira, Emanuele Coviello, Gabriel Doyle, Gert R. G. Lanckriet, Roger Levy, and Nuno Vasconcelos. 2010. A new approach to cross-modal multimedia retrieval. In ACM Multimedia . 251--260.
[33]
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems. 91--99.
[34]
Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685 (2015).
[35]
Kihyuk Sohn. 2016. Improved deep metric learning with multi-class n-pair loss objective. In Advances in Neural Information Processing Systems. 1857--1865.
[36]
Guoli Song, Shuhui Wang, Qingming Huang, and Qi Tian. 2017. Multimodal Similarity Gaussian Process Latent Variable Model. IEEE Trans. Image Processing, Vol. 26, 9 (2017), 4168--4181.
[37]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008.
[38]
Jian Wang, Feng Zhou, Shilei Wen, Xiao Liu, and Yuanqing Lin. 2017. Deep metric learning with angular loss. In Proceedings of the IEEE International Conference on Computer Vision. 2593--2601.
[39]
Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016b. Learning deep structure-preserving image-text embeddings. In CVPR . 5005--5013.
[40]
Shuhui Wang, Yangyu Chen, Junbao Zhuo, Qingming Huang, and Qi Tian. 2018a. Joint Global and Co-Attentive Representation Learning for Image-Sentence Retrieval. In 2018 ACM Multimedia Conference on Multimedia Conference. ACM, 1398--1406.
[41]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018b. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7794--7803.
[42]
Yequan Wang, Minlie Huang, Li Zhao, et almbox. 2016a. Attention-based LSTM for aspect-level sentiment classification. In Proceedings of the 2016 conference on empirical methods in natural language processing . 606--615.
[43]
Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV). 3--19.
[44]
Yiling Wu, Shuhui Wang, and Qingming Huang. 2017. Online Asymmetric Similarity Learning for Cross-Modal Retrieval. In CVPR . 4269--4278.
[45]
Yiling Wu, Shuhui Wang, Guoli Song, and Qingming Huang. 2019. Online Asymmetric Metric Learning With Multi-Layer Similarity Aggregation for Cross-Modal Retrieval. IEEE Trans. Image Processing, Vol. 28, 9 (2019), 4299--4312.
[46]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning. 2048--2057.
[47]
Fei Yan and Krystian Mikolajczyk. 2015. Deep correlation for matching images and text. In CVPR. 3441--3450.
[48]
Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, Vol. 2 (2014), 67--78.
[49]
Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5831--5840.
[50]
Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. 2018. Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318 (2018).
[51]
Ying Zhang and Huchuan Lu. 2018. Deep Cross-Modal Projection Learning for Image-Text Matching. In Proceedings of the European Conference on Computer Vision (ECCV). 686--701.
[52]
Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, and Yi-Dong Shen. 2017. Dual-Path Convolutional Image-Text Embedding with Instance Loss. arXiv preprint arXiv:1711.05535 (2017).

Cited By

View all
  • (2025)LuoJiaHOG: A hierarchy oriented geo-aware image caption dataset for remote sensing image–text retrievalISPRS Journal of Photogrammetry and Remote Sensing10.1016/j.isprsjprs.2025.02.009222(130-151)Online publication date: Apr-2025
  • (2025)Progressive semantic aggregation and structured cognitive enhancement for image–text matchingExpert Systems with Applications10.1016/j.eswa.2025.126943274(126943)Online publication date: May-2025
  • (2025)Flickr30K-CFQ: A Compact and Fragmented Query Dataset for Text-image RetrievalDatabase Systems for Advanced Applications10.1007/978-981-97-5555-4_30(419-434)Online publication date: 12-Jan-2025
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '19: Proceedings of the 27th ACM International Conference on Multimedia
October 2019
2794 pages
ISBN:9781450368896
DOI:10.1145/3343031
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. fragment embeddings
  2. image-text matching
  3. self-attention

Qualifiers

  • Research-article

Conference

MM '19
Sponsor:

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)107
  • Downloads (Last 6 weeks)7
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)LuoJiaHOG: A hierarchy oriented geo-aware image caption dataset for remote sensing image–text retrievalISPRS Journal of Photogrammetry and Remote Sensing10.1016/j.isprsjprs.2025.02.009222(130-151)Online publication date: Apr-2025
  • (2025)Progressive semantic aggregation and structured cognitive enhancement for image–text matchingExpert Systems with Applications10.1016/j.eswa.2025.126943274(126943)Online publication date: May-2025
  • (2025)Flickr30K-CFQ: A Compact and Fragmented Query Dataset for Text-image RetrievalDatabase Systems for Advanced Applications10.1007/978-981-97-5555-4_30(419-434)Online publication date: 12-Jan-2025
  • (2024)A Precise Framework for Rice Leaf Disease Image–Text Retrieval Using FHTW-NetPlant Phenomics10.34133/plantphenomics.01686(0168)Online publication date: 2024
  • (2024)High-Accuracy Tomato Leaf Disease Image-Text Retrieval Method Utilizing LAFANetPlants10.3390/plants1309117613:9(1176)Online publication date: 23-Apr-2024
  • (2024)Learning hierarchical embedding space for image-text matchingIntelligent Data Analysis10.3233/IDA-23021428:3(647-665)Online publication date: 28-May-2024
  • (2024)CrossFormer: Cross-Modal Representation Learning via Heterogeneous Graph TransformerACM Transactions on Multimedia Computing, Communications, and Applications10.1145/368880120:12(1-21)Online publication date: 20-Sep-2024
  • (2024)Dynamic Soft Labeling for Visual Semantic EmbeddingProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658068(220-228)Online publication date: 30-May-2024
  • (2024)FELGA: Unsupervised Fragment Embedding for Fine-Grained Cross-Modal Association2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00554(5623-5633)Online publication date: 3-Jan-2024
  • (2024)A Mutually Textual and Visual Refinement Network for Image-Text MatchingIEEE Transactions on Multimedia10.1109/TMM.2024.336996826(7555-7566)Online publication date: 26-Feb-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media