Multi-modal Retrieval via Deep Textual-Visual Correlation Learning

Song, Jun; Wang, Yueyang; Wu, Fei; Lu, Weiming; Tang, Siliang; Zhuang, Yueting

doi:10.1007/978-3-319-23989-7_19

Jun Song²¹,
Yueyang Wang²¹,
Fei Wu²¹,
Weiming Lu²¹,
Siliang Tang²¹ &
…
Yueting Zhuang²¹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9242))

Included in the following conference series:

International Conference on Intelligent Science and Big Data Engineering

2544 Accesses

Abstract

In this paper, we consider multi-modal retrieval from the perspective of deep textual-visual learning so as to preserve the correlations between multi-modal data. More specifically, We propose a general multi-modal retrieval algorithm to maximize the canonical correlations between multi-modal data via deep learning, which we call Deep Textual-Visual correlation learning (DTV). In DTV, given pairs of images and their describing documents, a convolutional neural network is implemented to learn the visual representation of images and a dependency-tree recursive neural network(DT-RNN) is conducted to learn compositional textual representations of documents respectively, then DTV projects the visual-textual representation into a common embedding space where each pair of multi-modal data is maximally correlated subject to being unrelated with other pairs by matrix-vector canonical correlation analysis (CCA). The experimental results indicate the effectiveness of our proposed DTV when applied to multi-modal retrieval.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Mikolov, T., et al. Efficient estimation of word representations in vector space (2013). arXiv preprint arXiv:1301.3781
Socher, R., et al. Grounded Compositional Semantics for Finding and Describing Images with Sentences. NIPS Deep Learning Workshop (2013)
Google Scholar
Donahue, J., et al. Decaf: A deep convolutional activation feature for generic visual recognition (2013). arXiv preprint arXiv:1310.1531
Krizhevsky, A., Ilya, S., Geoffrey, E.H.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (2012)
Google Scholar
Blei, D.M., Andrew, Y.N., Michael, I.J.: Latent dirichlet allocation. J. mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Lee, S.H., Seungjin, C.: Two-dimensional canonical correlation analysis. IEEE Signal Process. Lett. 14(10), 735–738 (2007)
Article Google Scholar
Siagian, C., Itti, L.: Rapid biologically-inspired scene classification using features shared with visual attention. IEEE Trans. Pattern Anal. Mach. Intell. 29(2), 300–312 (2007)
Article Google Scholar
Hotelling, H.: Relations between two sets of variates. Biometrika 28, 321–377 (1936)
Article MATH Google Scholar
Rasiwasia, N., et al. A new approach to cross-modal multimedia retrieval. In: Proceedings of the International Conference on Multimedia. ACM (2010)
Google Scholar
Blei, D.M., Jordan, M.I.: Modeling annotated data. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 127–134. ACM (2003)
Google Scholar
Salomatin, K., Yiming, Y., Abhimanyu, L.: Multi-field Correlated Topic Modeling. SDM (2009)
Google Scholar
Putthividhy, D., Hagai, T.A., Srikantan, S.N.: Topic regression multi-modal latent dirichlet allocation for image annotation. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2010)
Google Scholar
Zhuang, Y., et al. Supervised Coupled Dictionary Learning with Group Structures for Multi-modal Retrieval. AAAI (2013)
Google Scholar
Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. Comput. Vis. Image Underst. 106(1), 59–70 (2007)
Article Google Scholar
Zhen, Y., Dit-Yan, Y.: A probabilistic model for multimodal hash function learning. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM (2012)
Google Scholar
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)
Article Google Scholar
De Marneffe, M.-C., MacCartney, B., Manning, C.D.: Generating typed dependency parses from phrase structure parses. In: Proceedings of LREC, vol. 6 (2006)
Google Scholar
Deng, Jia., et al. Imagenet: A large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009. IEEE (2009)
Google Scholar
Jia, Y., Mathieu, S., Trevor, D.: Learning cross-modality similarity for multinomial data. In: 2011 IEEE International Conference on Computer Vision (ICCV). IEEE (2011)
Google Scholar

Download references

Acknowledgments

This work is supported in part by 973 Program (2012CB316400), NSFC (61402401), Zhejiang Provincial Natural Science Foundation of China (LQ14F010004), Chinese Knowledge Center of Engineering Science and Technology (CKCEST).

Author information

Authors and Affiliations

College of Computer Science, Zhejiang University, NO. 38 Zheda Road, Hangzhou, 310027, Zhejiang, China
Jun Song, Yueyang Wang, Fei Wu, Weiming Lu, Siliang Tang & Yueting Zhuang

Authors

Jun Song
View author publications
You can also search for this author in PubMed Google Scholar
Yueyang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Fei Wu
View author publications
You can also search for this author in PubMed Google Scholar
Weiming Lu
View author publications
You can also search for this author in PubMed Google Scholar
Siliang Tang
View author publications
You can also search for this author in PubMed Google Scholar
Yueting Zhuang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fei Wu .

Editor information

Editors and Affiliations

Zhejiang University, Hangzhou, China
Xiaofei He
Xidian University, Xi'an, China
Xinbo Gao
Northwestern Polytechnical University, Shaanxi, China
Yanning Zhang
Nanjing University, Nanjing, China
Zhi-Hua Zhou
Chinese Academy of Sciences, Beijing, China
Zhi-Yong Liu
Suzhou University of Science and Technology, Suzhou, China
Baochuan Fu
Suzhou University of Science and Technology, Jiangsu, China
Fuyuan Hu
Suzhou University of Science and Technology, Jiangsu, China
Zhancheng Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Song, J., Wang, Y., Wu, F., Lu, W., Tang, S., Zhuang, Y. (2015). Multi-modal Retrieval via Deep Textual-Visual Correlation Learning. In: He, X., et al. Intelligence Science and Big Data Engineering. Image and Video Data Engineering. IScIDE 2015. Lecture Notes in Computer Science(), vol 9242. Springer, Cham. https://doi.org/10.1007/978-3-319-23989-7_19

Download citation

DOI: https://doi.org/10.1007/978-3-319-23989-7_19
Published: 22 October 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-23987-3
Online ISBN: 978-3-319-23989-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics