skip to main content
10.1145/3159652.3159735acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Deep Neural Architecture for Multi-Modal Retrieval based on Joint Embedding Space for Text and Images

Published: 02 February 2018 Publication History

Abstract

Recent advances in deep learning and distributed representations of images and text have resulted in the emergence of several neural architectures for cross-modal retrieval tasks, such as searching collections of images in response to textual queries and assigning textual descriptions to images. However, the multi-modal retrieval scenario, when a query can be either a text or an image and the goal is to retrieve both a textual fragment and an image, which should be considered as an atomic unit, has been significantly less studied. In this paper, we propose a gated neural architecture to project image and keyword queries as well as multi-modal retrieval units into the same low-dimensional embedding space and perform semantic matching in this space. The proposed architecture is trained to minimize structured hinge loss and can be applied to both cross- and multi-modal retrieval. Experimental results for six different cross- and multi-modal retrieval tasks obtained on publicly available datasets indicate superior retrieval accuracy of the proposed architecture in comparison to the state-of-art baselines.

References

[1]
Alexandr Andoni and Ilya Razenshteyn. 2015. Optimal data-dependent hashing for approximate near neighbors Proceedings of STOC. 793--801.
[2]
Saeid Balaneshin-kordan and Alexander Kotov. 2017. Embedding-based Query Expansion for Weighted Sequential Dependence Retrieval Model Proceedings of ACM SIGIR. 1213--1216.
[3]
Yue Cao, Mingsheng Long, Jianmin Wang, Qiang Yang, and Philip S. Yu. 2016 b. Deep Visual-Semantic Hashing for Cross-Modal Retrieval Proceedings of ACM SIGKDD. 1445--1454.
[4]
Zhangjie Cao, Mingsheng Long, and Qiang Yang. 2016. Transitive Hashing Network for Heterogeneous Multimedia Retrieval. arXiv preprint arXiv:1608.04307 (2016).
[5]
Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. NUS-WIDE: a real-world web image database from National University of Singapore Proceedings of ICIVR.
[6]
Guillem Collell, Ted Zhang, and Marie-Francine Moens. 2017. Imagined Visual Representations as Multimodal Embeddings Proceedings of AAAI. 4378--4384.
[7]
Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the Association for Information Science and Technology (1990), 391--407.
[8]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database Proceedings of IEEE CVPR. 248--255.
[9]
Fernando Diaz, Bhaskar Mitra, and Nick Craswell. 2016. Query expansion with locally-trained word embeddings. arXiv preprint arXiv:1605.07891 (2016).
[10]
Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, and others. 2015. From captions to visual concepts and back. In Proceedings of IEEE CVPR. 1473--1482.
[11]
Andrea Frome, Greg S. Corrado, Jonathon Shlens, Samy Bengio, Jeffrey Dean, Marc'Aurelio Ranzato, and Tomas Mikolov. 2013. DeViSE: A deep visual-semantic embedding model. Proceedings of NIPS. 2121--2129.
[12]
Debasis Ganguly, Dwaipayan Roy, Mandar Mitra, and Gareth J.F. Jones. 2015. Word embedding based generalized language model for information retrieval Proceedings of ACM SIGIR. 795--798.
[13]
Yunchao Gong, Qifa Ke, Michael Isard, and Svetlana Lazebnik. 2014. A multi-view embedding space for modeling internet images, tags, and their semantics. International Journal of Computer Vision (2014), 210--233.
[14]
Yunchao Gong, Liwei Wang, Micah Hodosh, Julia Hockenmaier, and Svetlana Lazebnik. 2014. Improving image-sentence embeddings using large weakly annotated photo collections Proceedings of ECCV. 529--545.
[15]
Jiafeng Guo, Yixing Fan, Qingyao Ai, and W. Bruce Croft. 2016. A Deep Relevance Matching Model for Ad-hoc Retrieval Proceedings of ACM CIKM. 55--64.
[16]
Kiana Hajebi, Yasin Abbasi-Yadkori, Hossein Shahbazi, and Hong Zhang. 2011. Fast approximate nearest-neighbor search with k-nearest neighbor graph Artificial Intelligence Journal. 1312--1317.
[17]
David R. Hardoon, Sandor Szedmak, and John Shawe-Taylor. 2004. Canonical correlation analysis: An overview with application to learning methods. Neural computation (2004), 2639--2664.
[18]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of IEEE CVPR. 770--778.
[19]
Felix Hill and Anna Korhonen. 2014. Learning Abstract Concept Embeddings from Multi-Modal Data: Since You Probably Can't See What I Mean. In Proceedings of EMNLP. 255--265.
[20]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.
[21]
Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research Vol. 47 (2013), 853--899.
[22]
Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data Proceedings of ACM CIKM. 2333--2338.
[23]
Qing-Yuan Jiang and Wu-Jun Li. 2016. Deep Cross-Modal Hashing. arXiv preprint arXiv:1602.02255 (2016).
[24]
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions Proceedings of IEEE CVPR. 3128--3137.
[25]
Andrej Karpathy, Armand Joulin, and Fei Fei Li. 2014. Deep fragment embeddings for bidirectional image sentence mapping Proceedings of NIPS. 1889--1897.
[26]
Ryan Kiros, Ruslan Salakhutdinov, and Richard Zemel. 2014. Multimodal neural language models. In Proceedings of ICML. 595--603.
[27]
Ryan Kiros, Ruslan Salakhutdinov, and Richard Zemel. 2014. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014).
[28]
Alexander Kotov, Vineeth Rakesh, Eugene Agichtein, and Chandan K. Reddy. 2015. Geographical Latent Variable Models for Microblog Retrieval Proceedings of ECIR. 635--647.
[29]
Alexander Kotov, Yu Wang, and Eugene Agichtein. 2013. Leveraging geographical metadata to improve search over social media Proceedings of WWW. 151--152.
[30]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks Proceedings of NIPS. 1097--1105.
[31]
Saar Kuzi, Anna Shtok, and Oren Kurland. 2016. Query expansion using word embeddings. In Proceedings of ACM CIKM. 1929--1932.
[32]
Angeliki Lazaridou, Nghia The Pham, and Marco Baroni. 2015. Combining Language and Vision with a Multimodal Skip-gram Model Proceedings of NACL-HLT. 153--163.
[33]
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proceedings of IEEE (1998), 2278--2324.
[34]
Mingsheng Long, Yue Cao, Jianmin Wang, and Philip S. Yu. 2016. Composite Correlation Quantization for Efficient Multimodal Retrieval Proceedings of ACM SIGIR. 579--588.
[35]
Lin Ma, Zhengdong Lu, Lifeng Shang, and Hang Li. 2015. Multimodal convolutional neural networks for matching image and sentence Proceedings of IEEE CVPR. 2623--2631.
[36]
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L. Yuille. 2014. Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410.1090 (2014).
[37]
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality Proceedings of NIPS. 3111--3119.
[38]
Bhaskar Mitra, Fernando Diaz, and Nick Craswell. 2017. Learning to Match using Local and Distributed Representations of Text for Web Search Proceedings of WWW. 1291--1299.
[39]
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of EMNLP. 1532--1543.
[40]
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, and others. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision (2015), 211--252.
[41]
Aliaksei Severyn and Alessandro Moschitti. 2015. Learning to rank short text pairs with convolutional deep neural networks Proceedings of ACM SIGIR. 373--382.
[42]
Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. 2014. A latent semantic model with convolutional-pooling structure for information retrieval Proceedings of ACM CIKM. 101--110.
[43]
Richard Socher, Milind Ganjoo, Christopher D. Manning, and Andrew Ng. 2013. Zero-shot learning through cross-modal transfer. Proceedings of NIPS. 935--943.
[44]
Richard Socher, Andrej Karpathy, Quoc V. Le, Christopher D. Manning, and Andrew Y. Ng. 2014. Grounded compositional semantics for finding and describing images with sentences. TACL (2014), 207--218.
[45]
Rupesh K. Srivastava, Klaus Greff, and Jürgen Schmidhuber. 2015. Training very deep networks. In Proceedings of NIPS. 2377--2385.
[46]
Christian Szegedy, Sergey Ioffe, and Vincent Vanhoucke. 2016. Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261 (2016).
[47]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision Proceedings of IEEE CVPR. 2818--2826.
[48]
Ming Tan, Cicero dos Santos, Bing Xiang, and Bowen Zhou. 2016. LSTM-based deep learning models for non-factoid answer selection. Proceedings of ICLR (2016).
[49]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2016. Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence (2016), 652--663.
[50]
Ivan Vulić and Marie-Francine Moens. 2015. Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings Proceedings of ACM SIGIR. 363--372.
[51]
Di Wang and Eric Nyberg. 2015. A long short-term memory model for answer sentence selection in question answering Proceedings of ACL. 707--712.
[52]
Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning deep structure-preserving image-text embeddings Proceedings of IEEE CVPR. 5005--5013.
[53]
Wei Wang, Xiaoyan Yang, Beng Chin Ooi, Dongxiang Zhang, and Yueting Zhuang. 2016. Effective deep learning-based multi-modal retrieval. VLDB (2016), 79--101.
[54]
Xing Wei and W. Bruce Croft. 2006. LDA-based document models for ad-hoc retrieval. In Proceedings of ACM SIGIR. 178--185.
[55]
Ying Wei, Yangqiu Song, Yi Zhen, Bo Liu, and Qiang Yang. 2014. Scalable heterogeneous translated hashing. In Proceedings of ACM SIGKDD. 791--800.
[56]
Yunchao Wei, Yao Zhao, Canyi Lu, Shikui Wei, Luoqi Liu, Zhenfeng Zhu, and Shuicheng Yan. 2016. Cross-modal retrieval with cnn visual features: A new baseline. IEEE Transactions on Cybernetics Vol. 47, 2 (2016), 449--460.
[57]
Jason Weston, Samy Bengio, and Nicolas Usunier. 2010. Large scale image annotation: learning to rank with joint word-image embeddings. Machine learning, Vol. 81, 1 (2010), 21--35.
[58]
Fei Wu, Zhou Yu, Yi Yang, Siliang Tang, Yin Zhang, and Yueting Zhuang. 2014. Sparse multi-modal hashing. Proceedings of MM (2014), 427--439.
[59]
Hamed Zamani and W. Bruce Croft. 2017. Relevance-based Word Embedding. In Proceedings of ACM SIGIR. 505--514.

Cited By

View all
  • (2024)AutoAMS: Automated attention-based multi-modal graph learning architecture searchNeural Networks10.1016/j.neunet.2024.106427179(106427)Online publication date: Nov-2024
  • (2024)SPERM: sequential pairwise embedding recommendation with MI-FGSMInternational Journal of Machine Learning and Cybernetics10.1007/s13042-024-02288-z16:2(771-787)Online publication date: 19-Jul-2024
  • (2021)Considerations about learning Word2VecThe Journal of Supercomputing10.1007/s11227-021-03743-2Online publication date: 6-Apr-2021
  • Show More Cited By

Index Terms

  1. Deep Neural Architecture for Multi-Modal Retrieval based on Joint Embedding Space for Text and Images

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WSDM '18: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining
    February 2018
    821 pages
    ISBN:9781450355810
    DOI:10.1145/3159652
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 February 2018

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. cross-modal ir
    2. deep neural networks
    3. multi-modal ir

    Qualifiers

    • Research-article

    Conference

    WSDM 2018

    Acceptance Rates

    WSDM '18 Paper Acceptance Rate 81 of 514 submissions, 16%;
    Overall Acceptance Rate 498 of 2,863 submissions, 17%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)17
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)AutoAMS: Automated attention-based multi-modal graph learning architecture searchNeural Networks10.1016/j.neunet.2024.106427179(106427)Online publication date: Nov-2024
    • (2024)SPERM: sequential pairwise embedding recommendation with MI-FGSMInternational Journal of Machine Learning and Cybernetics10.1007/s13042-024-02288-z16:2(771-787)Online publication date: 19-Jul-2024
    • (2021)Considerations about learning Word2VecThe Journal of Supercomputing10.1007/s11227-021-03743-2Online publication date: 6-Apr-2021
    • (2020)Characterization and classification of semantic image-text relationsInternational Journal of Multimedia Information Retrieval10.1007/s13735-019-00187-6Online publication date: 22-Jan-2020
    • (2020)Variational Recurrent Sequence-to-Sequence Retrieval for Stepwise IllustrationAdvances in Information Retrieval10.1007/978-3-030-45439-5_4(50-64)Online publication date: 8-Apr-2020
    • (2019)Modal-Dependent Retrieval Based on Mid-Level Semantic Enhancement SpaceIEEE Access10.1109/ACCESS.2019.29101987(49906-49917)Online publication date: 2019
    • (2019)Integration of Images into the Patent Retrieval ProcessAdvances in Information Retrieval10.1007/978-3-030-15719-7_49(359-363)Online publication date: 14-Apr-2019
    • (2019)“Is This an Example Image?” – Predicting the Relative Abstractness Level of Image and TextAdvances in Information Retrieval10.1007/978-3-030-15712-8_46(711-725)Online publication date: 7-Apr-2019
    • (2018)Attentive Neural Architecture for Ad-hoc Structured Document RetrievalProceedings of the 27th ACM International Conference on Information and Knowledge Management10.1145/3269206.3271801(1173-1182)Online publication date: 17-Oct-2018
    • (2018)Multi-modal Preference Modeling for Product SearchProceedings of the 26th ACM international conference on Multimedia10.1145/3240508.3240541(1865-1873)Online publication date: 15-Oct-2018

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media