research-article

Deep Neural Architecture for Multi-Modal Retrieval based on Joint Embedding Space for Text and Images

Authors:

Saeid Balaneshin-kordan,

Alexander KotovAuthors Info & Claims

WSDM '18: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining

Pages 28 - 36

https://doi.org/10.1145/3159652.3159735

Published: 02 February 2018 Publication History

Abstract

Recent advances in deep learning and distributed representations of images and text have resulted in the emergence of several neural architectures for cross-modal retrieval tasks, such as searching collections of images in response to textual queries and assigning textual descriptions to images. However, the multi-modal retrieval scenario, when a query can be either a text or an image and the goal is to retrieve both a textual fragment and an image, which should be considered as an atomic unit, has been significantly less studied. In this paper, we propose a gated neural architecture to project image and keyword queries as well as multi-modal retrieval units into the same low-dimensional embedding space and perform semantic matching in this space. The proposed architecture is trained to minimize structured hinge loss and can be applied to both cross- and multi-modal retrieval. Experimental results for six different cross- and multi-modal retrieval tasks obtained on publicly available datasets indicate superior retrieval accuracy of the proposed architecture in comparison to the state-of-art baselines.

References

[1]

Alexandr Andoni and Ilya Razenshteyn. 2015. Optimal data-dependent hashing for approximate near neighbors Proceedings of STOC. 793--801.

Digital Library

[2]

Saeid Balaneshin-kordan and Alexander Kotov. 2017. Embedding-based Query Expansion for Weighted Sequential Dependence Retrieval Model Proceedings of ACM SIGIR. 1213--1216.

Digital Library

[3]

Yue Cao, Mingsheng Long, Jianmin Wang, Qiang Yang, and Philip S. Yu. 2016 b. Deep Visual-Semantic Hashing for Cross-Modal Retrieval Proceedings of ACM SIGKDD. 1445--1454.

Digital Library

[4]

Zhangjie Cao, Mingsheng Long, and Qiang Yang. 2016. Transitive Hashing Network for Heterogeneous Multimedia Retrieval. arXiv preprint arXiv:1608.04307 (2016).

[5]

Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. NUS-WIDE: a real-world web image database from National University of Singapore Proceedings of ICIVR.

Digital Library

[6]

Guillem Collell, Ted Zhang, and Marie-Francine Moens. 2017. Imagined Visual Representations as Multimodal Embeddings Proceedings of AAAI. 4378--4384.

[7]

Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the Association for Information Science and Technology (1990), 391--407.

[8]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database Proceedings of IEEE CVPR. 248--255.

[9]

Fernando Diaz, Bhaskar Mitra, and Nick Craswell. 2016. Query expansion with locally-trained word embeddings. arXiv preprint arXiv:1605.07891 (2016).

[10]

Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, and others. 2015. From captions to visual concepts and back. In Proceedings of IEEE CVPR. 1473--1482.

[11]

Andrea Frome, Greg S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc'Aurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: A deep visual-semantic embedding model. Proceedings of NIPS. 2121--2129.

Digital Library

[12]

Debasis Ganguly, Dwaipayan Roy, Mandar Mitra, and Gareth J.F. Jones. 2015. Word embedding based generalized language model for information retrieval Proceedings of ACM SIGIR. 795--798.

Digital Library

[13]

Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A multi-view embedding space for modeling internet images, tags, and their semantics. International Journal of Computer Vision (2014), 210--233.

Digital Library

[14]

Yunchao Gong, Liwei Wang, Micah Hodosh, Julia Hockenmaier, and Svetlana Lazebnik. 2014. Improving image-sentence embeddings using large weakly annotated photo collections Proceedings of ECCV. 529--545.

[15]

Jiafeng Guo, Yixing Fan, Qingyao Ai, and W. Bruce Croft. 2016. A Deep Relevance Matching Model for Ad-hoc Retrieval Proceedings of ACM CIKM. 55--64.

Digital Library

[16]

Kiana Hajebi, Yasin Abbasi-Yadkori, Hossein Shahbazi, and Hong Zhang. 2011. Fast approximate nearest-neighbor search with k-nearest neighbor graph Artificial Intelligence Journal. 1312--1317.

Digital Library

[17]

David R. Hardoon, Sandor Szedmak, and John Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural computation (2004), 2639--2664.

Digital Library

[18]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of IEEE CVPR. 770--778.

[19]

Felix Hill and Anna Korhonen. 2014. Learning Abstract Concept Embeddings from Multi-Modal Data: Since You Probably Can't See What I Mean. In Proceedings of EMNLP. 255--265.

[20]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.

Digital Library

[21]

Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research Vol. 47 (2013), 853--899.

Digital Library

[22]

Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data Proceedings of ACM CIKM. 2333--2338.

Digital Library

[23]

Qing-Yuan Jiang and Wu-Jun Li. 2016. Deep Cross-Modal Hashing. arXiv preprint arXiv:1602.02255 (2016).

[24]

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions Proceedings of IEEE CVPR. 3128--3137.

[25]

Andrej Karpathy, Armand Joulin, and Fei Fei Li. 2014. Deep fragment embeddings for bidirectional image sentence mapping Proceedings of NIPS. 1889--1897.

Digital Library

[26]

Ryan Kiros, Ruslan Salakhutdinov, and Richard Zemel. 2014. Multimodal neural language models. In Proceedings of ICML. 595--603.

Digital Library

[27]

Ryan Kiros, Ruslan Salakhutdinov, and Richard Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014).

[28]

Alexander Kotov, Vineeth Rakesh, Eugene Agichtein, and Chandan K. Reddy. 2015. Geographical Latent Variable Models for Microblog Retrieval Proceedings of ECIR. 635--647.

[29]

Alexander Kotov, Yu Wang, and Eugene Agichtein. 2013. Leveraging geographical metadata to improve search over social media Proceedings of WWW. 151--152.

Digital Library

[30]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks Proceedings of NIPS. 1097--1105.

Digital Library

[31]

Saar Kuzi, Anna Shtok, and Oren Kurland. 2016. Query expansion using word embeddings. In Proceedings of ACM CIKM. 1929--1932.

Digital Library

[32]

Angeliki Lazaridou, Nghia The Pham, and Marco Baroni. 2015. Combining Language and Vision with a Multimodal Skip-gram Model Proceedings of NACL-HLT. 153--163.

[33]

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proceedings of IEEE (1998), 2278--2324.

[34]

Mingsheng Long, Yue Cao, Jianmin Wang, and Philip S. Yu. 2016. Composite Correlation Quantization for Efficient Multimodal Retrieval Proceedings of ACM SIGIR. 579--588.

Digital Library

[35]

Lin Ma, Zhengdong Lu, Lifeng Shang, and Hang Li. 2015. Multimodal convolutional neural networks for matching image and sentence Proceedings of IEEE CVPR. 2623--2631.

Digital Library

[36]

Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L. Yuille. 2014. Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410.1090 (2014).

[37]

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality Proceedings of NIPS. 3111--3119.

Digital Library

[38]

Bhaskar Mitra, Fernando Diaz, and Nick Craswell. 2017. Learning to Match using Local and Distributed Representations of Text for Web Search Proceedings of WWW. 1291--1299.

Digital Library

[39]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of EMNLP. 1532--1543.

[40]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, and others. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision (2015), 211--252.

Digital Library

[41]

Aliaksei Severyn and Alessandro Moschitti. 2015. Learning to rank short text pairs with convolutional deep neural networks Proceedings of ACM SIGIR. 373--382.

Digital Library

[42]

Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. 2014. A latent semantic model with convolutional-pooling structure for information retrieval Proceedings of ACM CIKM. 101--110.

Digital Library

[43]

Richard Socher, Milind Ganjoo, Christopher D. Manning, and Andrew Ng. 2013. Zero-shot learning through cross-modal transfer. Proceedings of NIPS. 935--943.

Digital Library

[44]

Richard Socher, Andrej Karpathy, Quoc V. Le, Christopher D. Manning, and Andrew Y. Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. TACL (2014), 207--218.

[45]

Rupesh K. Srivastava, Klaus Greff, and Jürgen Schmidhuber. 2015. Training very deep networks. In Proceedings of NIPS. 2377--2385.

Digital Library

[46]

Christian Szegedy, Sergey Ioffe, and Vincent Vanhoucke. 2016. Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261 (2016).

[47]

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision Proceedings of IEEE CVPR. 2818--2826.

[48]

Ming Tan, Cicero dos Santos, Bing Xiang, and Bowen Zhou. 2016. LSTM-based deep learning models for non-factoid answer selection. Proceedings of ICLR (2016).

[49]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2016. Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence (2016), 652--663.

Digital Library

[50]

Ivan Vulić and Marie-Francine Moens. 2015. Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings Proceedings of ACM SIGIR. 363--372.

Digital Library

[51]

Di Wang and Eric Nyberg. 2015. A long short-term memory model for answer sentence selection in question answering Proceedings of ACL. 707--712.

[52]

Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning deep structure-preserving image-text embeddings Proceedings of IEEE CVPR. 5005--5013.

[53]

Wei Wang, Xiaoyan Yang, Beng Chin Ooi, Dongxiang Zhang, and Yueting Zhuang. 2016. Effective deep learning-based multi-modal retrieval. VLDB (2016), 79--101.

Digital Library

[54]

Xing Wei and W. Bruce Croft. 2006. LDA-based document models for ad-hoc retrieval. In Proceedings of ACM SIGIR. 178--185.

Digital Library

[55]

Ying Wei, Yangqiu Song, Yi Zhen, Bo Liu, and Qiang Yang. 2014. Scalable heterogeneous translated hashing. In Proceedings of ACM SIGKDD. 791--800.

Digital Library

[56]

Yunchao Wei, Yao Zhao, Canyi Lu, Shikui Wei, Luoqi Liu, Zhenfeng Zhu, and Shuicheng Yan. 2016. Cross-modal retrieval with cnn visual features: A new baseline. IEEE Transactions on Cybernetics Vol. 47, 2 (2016), 449--460.

[57]

Jason Weston, Samy Bengio, and Nicolas Usunier. 2010. Large scale image annotation: learning to rank with joint word-image embeddings. Machine learning, Vol. 81, 1 (2010), 21--35.

Digital Library

[58]

Fei Wu, Zhou Yu, Yi Yang, Siliang Tang, Yin Zhang, and Yueting Zhuang. 2014. Sparse multi-modal hashing. Proceedings of MM (2014), 427--439.

Digital Library

[59]

Hamed Zamani and W. Bruce Croft. 2017. Relevance-based Word Embedding. In Proceedings of ACM SIGIR. 505--514.

Digital Library

Cited By

Al-Sabri RGao JChen JOloulade BWu Z(2024)AutoAMS: Automated attention-based multi-modal graph learning architecture searchNeural Networks10.1016/j.neunet.2024.106427179(106427)Online publication date: Nov-2024
https://doi.org/10.1016/j.neunet.2024.106427
Paul AWan YChen BWu Z(2024)SPERM: sequential pairwise embedding recommendation with MI-FGSMInternational Journal of Machine Learning and Cybernetics10.1007/s13042-024-02288-z16:2(771-787)Online publication date: 19-Jul-2024
https://doi.org/10.1007/s13042-024-02288-z
Di Gennaro GBuonanno APalmieri F(2021)Considerations about learning Word2VecThe Journal of Supercomputing10.1007/s11227-021-03743-2Online publication date: 6-Apr-2021
https://doi.org/10.1007/s11227-021-03743-2
Show More Cited By

Index Terms

Deep Neural Architecture for Multi-Modal Retrieval based on Joint Embedding Space for Text and Images
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Image search

Recommendations

Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval
ICMR '18: Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval

Constructing a joint representation invariant across different modalities (e.g., video, language) is of significant importance in many multimedia applications. While there are a number of recent successes in developing effective image-text retrieval ...
Multi-grained encoding and joint embedding space fusion for video and text cross-modal retrieval
Abstract
Video-text cross-modal retrieval is significant to computer vision. Most of existing works focus on exploring the global similarity between modalities, but ignore the influence of details on retrieval results. How to explore the correlation ...
Cross-modal and uni-modal soft-label alignment for image-text retrieval
AAAI'24/IAAI'24/EAAI'24: Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence

Current image-text retrieval methods have demonstrated impressive performance in recent years. However, they still face two problems: the inter-modal matching missing problem and the intra-modal semantic loss problem. These problems can significantly ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WSDM '18: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining

February 2018

821 pages

ISBN:9781450355810

DOI:10.1145/3159652

General Chairs:
Yi Chang
Jilin University, Huawei Inc.
,
Chengxiang Zhai
University of Illinois Urbana-Champaign
,
Program Chairs:
Yan Liu
University of Southern California
,
Yoelle Maarek
Amazon

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 February 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

WSDM 2018

Sponsor:

WSDM 2018: The Eleventh ACM International Conference on Web Search and Data Mining

February 5 - 9, 2018

CA, Marina Del Rey, USA

Acceptance Rates

WSDM '18 Paper Acceptance Rate 81 of 514 submissions, 16%;

Overall Acceptance Rate 498 of 2,863 submissions, 17%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
503
Total Downloads

Downloads (Last 12 months)17
Downloads (Last 6 weeks)1

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Al-Sabri RGao JChen JOloulade BWu Z(2024)AutoAMS: Automated attention-based multi-modal graph learning architecture searchNeural Networks10.1016/j.neunet.2024.106427179(106427)Online publication date: Nov-2024
https://doi.org/10.1016/j.neunet.2024.106427
Paul AWan YChen BWu Z(2024)SPERM: sequential pairwise embedding recommendation with MI-FGSMInternational Journal of Machine Learning and Cybernetics10.1007/s13042-024-02288-z16:2(771-787)Online publication date: 19-Jul-2024
https://doi.org/10.1007/s13042-024-02288-z
Di Gennaro GBuonanno APalmieri F(2021)Considerations about learning Word2VecThe Journal of Supercomputing10.1007/s11227-021-03743-2Online publication date: 6-Apr-2021
https://doi.org/10.1007/s11227-021-03743-2
Otto CSpringstein MAnand AEwerth R(2020)Characterization and classification of semantic image-text relationsInternational Journal of Multimedia Information Retrieval10.1007/s13735-019-00187-6Online publication date: 22-Jan-2020
https://doi.org/10.1007/s13735-019-00187-6
Batra VHaldar AHe YFerhatosmanoglu HVogiatzis GGuha T(2020)Variational Recurrent Sequence-to-Sequence Retrieval for Stepwise IllustrationAdvances in Information Retrieval10.1007/978-3-030-45439-5_4(50-64)Online publication date: 8-Apr-2020
https://doi.org/10.1007/978-3-030-45439-5_4
Zheng SZhang HQi YZhang B(2019)Modal-Dependent Retrieval Based on Mid-Level Semantic Enhancement SpaceIEEE Access10.1109/ACCESS.2019.29101987(49906-49917)Online publication date: 2019
https://doi.org/10.1109/ACCESS.2019.2910198
Thode W(2019)Integration of Images into the Patent Retrieval ProcessAdvances in Information Retrieval10.1007/978-3-030-15719-7_49(359-363)Online publication date: 14-Apr-2019
https://dl.acm.org/doi/10.1007/978-3-030-15719-7_49
Otto CHolzki SEwerth R(2019)“Is This an Example Image?” – Predicting the Relative Abstractness Level of Image and TextAdvances in Information Retrieval10.1007/978-3-030-15712-8_46(711-725)Online publication date: 7-Apr-2019
https://doi.org/10.1007/978-3-030-15712-8_46
Balaneshinkordan SKotov ANikolaev FCuzzocrea AAllan JPaton NSrivastava DAgrawal RBroder AZaki MCandan SLabrinidis ASchuster AWang H(2018)Attentive Neural Architecture for Ad-hoc Structured Document RetrievalProceedings of the 27th ACM International Conference on Information and Knowledge Management10.1145/3269206.3271801(1173-1182)Online publication date: 17-Oct-2018
https://dl.acm.org/doi/10.1145/3269206.3271801
Guo YCheng ZNie LXu XKankanhalli MBoll SMu Lee KLuo JZhu WByun HWen Chen CLienhart RMei T(2018)Multi-modal Preference Modeling for Product SearchProceedings of the 26th ACM international conference on Multimedia10.1145/3240508.3240541(1865-1873)Online publication date: 15-Oct-2018
https://dl.acm.org/doi/10.1145/3240508.3240541

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten