Skip to main content

Multi-modal Retrieval via Deep Textual-Visual Correlation Learning

  • Conference paper
  • First Online:
Book cover Intelligence Science and Big Data Engineering. Image and Video Data Engineering (IScIDE 2015)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9242))

  • 2544 Accesses

Abstract

In this paper, we consider multi-modal retrieval from the perspective of deep textual-visual learning so as to preserve the correlations between multi-modal data. More specifically, We propose a general multi-modal retrieval algorithm to maximize the canonical correlations between multi-modal data via deep learning, which we call Deep Textual-Visual correlation learning (DTV). In DTV, given pairs of images and their describing documents, a convolutional neural network is implemented to learn the visual representation of images and a dependency-tree recursive neural network(DT-RNN) is conducted to learn compositional textual representations of documents respectively, then DTV projects the visual-textual representation into a common embedding space where each pair of multi-modal data is maximally correlated subject to being unrelated with other pairs by matrix-vector canonical correlation analysis (CCA). The experimental results indicate the effectiveness of our proposed DTV when applied to multi-modal retrieval.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Mikolov, T., et al. Efficient estimation of word representations in vector space (2013). arXiv preprint arXiv:1301.3781

  2. Socher, R., et al. Grounded Compositional Semantics for Finding and Describing Images with Sentences. NIPS Deep Learning Workshop (2013)

    Google Scholar 

  3. Donahue, J., et al. Decaf: A deep convolutional activation feature for generic visual recognition (2013). arXiv preprint arXiv:1310.1531

  4. Krizhevsky, A., Ilya, S., Geoffrey, E.H.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (2012)

    Google Scholar 

  5. Blei, D.M., Andrew, Y.N., Michael, I.J.: Latent dirichlet allocation. J. mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  6. Lee, S.H., Seungjin, C.: Two-dimensional canonical correlation analysis. IEEE Signal Process. Lett. 14(10), 735–738 (2007)

    Article  Google Scholar 

  7. Siagian, C., Itti, L.: Rapid biologically-inspired scene classification using features shared with visual attention. IEEE Trans. Pattern Anal. Mach. Intell. 29(2), 300–312 (2007)

    Article  Google Scholar 

  8. Hotelling, H.: Relations between two sets of variates. Biometrika 28, 321–377 (1936)

    Article  MATH  Google Scholar 

  9. Rasiwasia, N., et al. A new approach to cross-modal multimedia retrieval. In: Proceedings of the International Conference on Multimedia. ACM (2010)

    Google Scholar 

  10. Blei, D.M., Jordan, M.I.: Modeling annotated data. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 127–134. ACM (2003)

    Google Scholar 

  11. Salomatin, K., Yiming, Y., Abhimanyu, L.: Multi-field Correlated Topic Modeling. SDM (2009)

    Google Scholar 

  12. Putthividhy, D., Hagai, T.A., Srikantan, S.N.: Topic regression multi-modal latent dirichlet allocation for image annotation. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2010)

    Google Scholar 

  13. Zhuang, Y., et al. Supervised Coupled Dictionary Learning with Group Structures for Multi-modal Retrieval. AAAI (2013)

    Google Scholar 

  14. Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. Comput. Vis. Image Underst. 106(1), 59–70 (2007)

    Article  Google Scholar 

  15. Zhen, Y., Dit-Yan, Y.: A probabilistic model for multimodal hash function learning. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM (2012)

    Google Scholar 

  16. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)

    Article  Google Scholar 

  17. De Marneffe, M.-C., MacCartney, B., Manning, C.D.: Generating typed dependency parses from phrase structure parses. In: Proceedings of LREC, vol. 6 (2006)

    Google Scholar 

  18. Deng, Jia., et al. Imagenet: A large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009. IEEE (2009)

    Google Scholar 

  19. Jia, Y., Mathieu, S., Trevor, D.: Learning cross-modality similarity for multinomial data. In: 2011 IEEE International Conference on Computer Vision (ICCV). IEEE (2011)

    Google Scholar 

Download references

Acknowledgments

This work is supported in part by 973 Program (2012CB316400), NSFC (61402401), Zhejiang Provincial Natural Science Foundation of China (LQ14F010004), Chinese Knowledge Center of Engineering Science and Technology (CKCEST).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fei Wu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Song, J., Wang, Y., Wu, F., Lu, W., Tang, S., Zhuang, Y. (2015). Multi-modal Retrieval via Deep Textual-Visual Correlation Learning. In: He, X., et al. Intelligence Science and Big Data Engineering. Image and Video Data Engineering. IScIDE 2015. Lecture Notes in Computer Science(), vol 9242. Springer, Cham. https://doi.org/10.1007/978-3-319-23989-7_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-23989-7_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-23987-3

  • Online ISBN: 978-3-319-23989-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics