Skip to main content
Log in

Effective deep learning-based multi-modal retrieval

  • Special Issue Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Multi-modal retrieval is emerging as a new search paradigm that enables seamless information retrieval from various types of media. For example, users can simply snap a movie poster to search for relevant reviews and trailers. The mainstream solution to the problem is to learn a set of mapping functions that project data from different modalities into a common metric space in which conventional indexing schemes for high-dimensional space can be applied. Since the effectiveness of the mapping functions plays an essential role in improving search quality, in this paper, we exploit deep learning techniques to learn effective mapping functions. In particular, we first propose a general learning objective that effectively captures both intramodal and intermodal semantic relationships of data from heterogeneous sources. Given the general objective, we propose two learning algorithms to realize it: (1) an unsupervised approach that uses stacked auto-encoders and requires minimum prior knowledge on the training data and (2) a supervised approach using deep convolutional neural network and neural language model. Our training algorithms are memory efficient with respect to the data volume. Given a large training dataset, we split it into mini-batches and adjust the mapping functions continuously for each batch. Experimental results on three real datasets demonstrate that our proposed methods achieve significant improvement in search accuracy over the state-of-the-art solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Notes

  1. https://blog.twitter.com/2012/twitter-turns-six.

  2. http://techcrunch.com/2012/08/22/how-big-is-facebooks-data-2-5-billion-pieces-of-content-and-500-terabytes-ingested-every-day/.

  3. The binary value for each dimension indicates whether the corresponding tag appears or not.

  4. We tried both the Sigmoid function and ReLU activation function for s(). ReLU offers better performance.

  5. Notice that in our model, we fix the word vectors learned by SGM. It can also be fine-tuned by integrating the objective of SGM (Eq. 11) into 15.

  6. In our experiment, we use the parameters trained by Caffe [18] to initialize the AlexNet to accelerate the training. We use Gensim (http://radimrehurek.com/gensim/) to train the skip-gram model with the dimension of word vectors being 100.

  7. http://www.comp.nus.edu.sg/~wangwei/code.

  8. The code and parameter configurations for CVH and CMSSH are available online at http://www.cse.ust.hk/~dyyeung/code/mlbe.zip. The code for LCMH is provided by the authors. Parameters are set according to the suggestions provided in the paper.

  9. The last layer with two units is for visualization purpose, such that the latent features could be showed in a 2D space.

  10. Here, recall \(r = \frac{1}{\# all~relevant~results}\approx 0\).

References

  1. Bengio, Y.: Practical recommendations for gradient-based training of deep architectures. CoRR arXiv:1206.5533 (2012)

  2. Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)

    MATH  Google Scholar 

  3. Bengio, Y., Courville, A.C., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)

    Article  Google Scholar 

  4. Bronstein, M.M., Bronstein, A.M., Michel, F., Paragios, N.: Data fusion through cross-modality metric learning using similarity- sensitive hashing. In: The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, pp. 3594–3601. IEEE Computer Society (2010)

  5. Chua, T.S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.T.: Nus-wide: a real-world web image database from national university of singapore. In: Proceedings of ACM Conference on Image and Video Retrieval (CIVR’09), Santorini, Greece (2009)

  6. Ciresan, D.C., Meier, U., Gambardella, L.M., Schmidhuber, J.: Deep Big Multilayer Perceptrons for Digit Recognition, vol. 7700. Springer, Berlin (2012)

    Google Scholar 

  7. Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Le, Q.V., Mao, M.Z., Ranzato, M., Senior, A.W., Tucker, P.A., Yang, K., Ng, A.Y.: Large scale distributed deep networks. In: Bartlett, P.L., Pereira, F.C.N., Burges, C.J.C., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012, Lake Tahoe, Nevada, United States, pp. 1232–1240 (2012)

  8. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: a deep convolutional activation feature for generic visual recognition. arXivpreprint arXiv:1310.1531 (2013)

  9. Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Ranzato, M.,Mikolov,T.: Devise: a deep visual-semantic embedding model. In: Burges, C.J.C., Bottou, L., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 26: 27th AnnualConference on Neural Information Processing Systems 2013, Lake Tahoe, Nevada, United States, pp. 2121–2129 (2013)

  10. Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR 2014, Columbus, OH, USA, pp. 580–587 (2014)

  11. Gong, Y., Jia, Y., Leung, T., Toshev, A., Ioffe, S.: Deep convolutional ranking for multilabel image annotation. CoRR arXiv:1312.4894 (2013a)

  12. Gong, Y., Lazebnik, S., Gordo, A., Perronnin, F.: Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2916–2929 (2013b)

    Article  Google Scholar 

  13. Goroshin, R., LeCun, Y.: Saturating auto-encoder. CoRR arXiv:1301.3577 (2013)

  14. Hinton, G.: A Practical Guide to Training Restricted Boltzmann Machines. In: Montavon, G., Müller, K-R. (eds.) Neural Networks: Tricks of the Trade-Second Edition, Lecture Notes in Computer Science, vol 7700, pp. 599–619. Springer (2012)

  15. Hinton, G., Salakhutdinov, R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)

    Article  MATH  MathSciNet  Google Scholar 

  16. Hjaltason, G.R., Samet, H.: Index-driven similarity search in metric spaces. ACM Trans. Database Syst. 28(4), 517–580 (2003)

    Article  Google Scholar 

  17. Huiskes, M.J., Lew, M.S.: The mir flickr retrieval evaluation. In: Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval, MIR ’08, Vancouver, British Columbia, Canada, pp. 39–43. ACM, New York, USA (2008)

  18. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R.B., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. In: Hua, K.A., Rui, Y., Steinmetz, R., Hanjalic, A., Natsev, A., Zhu, W. (eds.) Proceedings of the ACM International Conference on Multimedia, MM ’14, Orlando, FL, USA, pp. 675–678. ACM (2014)

  19. Krizhevsky, A.: Learning Multiple Layers of Features from Tiny Images. Tech. rep (2009)

  20. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1106–1114 (2012)

    Google Scholar 

  21. Kumar, S., Udupa, R.: Learning hash functions for cross-viewsimilarity search. In: Walsh, T. (ed.) Proceedings of the 22nd International Joint Conference on Artificial Intelligence, Barcelona, Catalonia, pp. 1360–1365. IJCAI/AAAI (2011)

  22. LeCun, Y., Bottou, L., Orr, G., Müller, K.: Efficient backProp. In: Orr, G., Müller, K.R. (eds.) Neural Networks: Tricks of the Trade, Lecture Notes in Computer Science, chap 2, vol. 1524, pp. 9–50. Springer, Berlin (1998)

    Chapter  Google Scholar 

  23. Liu, D., Hua, X., Yang, L., Wang, M., Zhang, H.: Tag ranking. In: Proceedings of the 18th International Conference on World Wide Web, WWW 2009, Madrid, Spain, pp. 351–360, (2009). doi:10.1145/1526709.1526757

  24. Liu, W., Wang, J., Kumar, S., Chang, S.F.: Hashing with graphs. In: Getoor, L., Scheffer, T. (eds.) Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, pp. 1–8. Omnipress (2011)

  25. Lu, X., Wu, F., Tang, S., Zhang, Z., He, X., Zhuang, Y.: A low rank structural large margin method for cross-modal ranking. In: SIGIR, pp. 433–442 (2013)

  26. van der Maaten, L.: Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res. 15, 3221–3245 (2014)

    MATH  MathSciNet  Google Scholar 

  27. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)

    Book  MATH  Google Scholar 

  28. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Getoor, L., Scheffer, T. (eds.) Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, pp. 689–696. Omnipress (2011)

  29. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Getoor, L., Scheffer, T. (eds.) Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, pp. 689–696. Omnipress (2011)

  30. Rasiwasia, N., Pereira, J.C., Coviello, E., Doyle, G., Lanckriet, G.R.G., Levy, R., Vasconcelos, N.: A new approach to cross-modal multimedia retrieval. In: ACM Multimedia, pp. 251–260 (2010)

  31. Rifai, S., Vincent, P., Muller, X., Glorot, X., Bengio, Y.: Contractive auto-encoders: explicit invariance during feature extraction. In: ICML, pp. 833–840 (2011)

  32. Salakhutdinov, R., Hinton, G.E.: Semantic hashing. Int. J. Approx. Reason. 50(7), 969–978 (2009)

    Article  Google Scholar 

  33. Socher, R., Manning, C.D.: Deep learning for NLP (without magic). In: Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, pp. 1–3. Westin Peachtree Plaza Hotel, Atlanta, Georgia, USA (2013)

  34. Socher, R., Pennington, J., Huang, E.H., Ng, A.Y., Manning, C.D.: Semi-supervised recursive autoencoders for predicting sentimentdistributions. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, John McIntyre Conference Centre, Edinburgh, UK, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 151–161. ACL (2011)

  35. Song, J., Yang, Y., Huang, Z., Shen, H.T., Hong, R.: Multiple feature hashing for real-time large scale near-duplicate video retrieval. In: MM, ACM, pp . 423–432 (2011)

  36. Song, J., Yang, Y., Yang, Y., Huang, Z., Shen, H.T.: Inter-media hashing for large-scale retrieval from heterogeneous data sources. In: SIGMOD Conference, pp. 785–796 (2013)

  37. Srivastava, N., Salakhutdinov, R.: Multimodal learning with deep boltzmann machines. In: NIPS, pp. 2231–2239 (2012)

  38. Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: ICML, pp. 1096–1103 (2008)

  39. Wang, W., Ooi, B.C., Yang, X., Zhang, D., Zhuang, Y.: Effective multi-modal retrieval based on stacked auto-encoders. PVLDB 7(8), 649–660 (2014)

    Google Scholar 

  40. Weber, R., Schek, H.J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: Gupta, A., Shmueli, O., Widom, J. (eds.) Proceedings of24rd International Conference on Very Large Data Bases, New York, USA, pp. 194–205. Morgan Kaufmann (1998) (1998)

  41. Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: Koller, D., Schuurmans, D., Bengio, Y., Bottou, L. (eds.) Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems,Vancouver, British Columbia, Canada, pp. 1753–1760. Curran Associates, Inc., (2008)

  42. Zhang, D., Agrawal, D., Chen, G., Tung, A.K.H.: Hashfile: an efficient index structure for multimedia data. In: ICDE, pp. 1103–1114. IEEE Computer Society, Hannover, Germany (2011)

  43. Zhen, Y., Yeung, D.Y.: A probabilistic model for multimodal hashfunction learning. In: Yang, Q., Agarwal, D., Pei, J. (eds.) The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, Beijing, China, pp. 940–948. ACM (2012)

  44. Zhu, X., Huang, Z., Shen, H.T., Zhao, X.: Linear cross-modal hashing for efficient multimedia search. In: ACM Multimedia Conference, MM’ 13, Barcelona, Spain, pp. 143–152 (2013)

  45. Zhuang, Y., Yang, Y., Wu, F.: Mining semantic correlation of heterogeneous multimedia data for cross-media retrieval. IEEE Trans. Multimed. 10(2), 221–229 (2008)

Download references

Acknowledgments

This work is supported by A*STAR Project 1321202073. Xiaoyan Yang is supported by Human-Centered Cyber-physical Systems (HCCS) programme by A*STAR in Singapore.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei Wang.

Appendix

Appendix

figure a
figure b

In this section, we present the mini-batch stochastic gradient descent (mini-batch SGD) algorithm and the back-propagation (BP) algorithm [22], which are used throughout this paper to train MSAE and MDNN.

Mini-batch SGD minimizes the objective loss (e.g., \(\mathcal {L}, \mathcal {L}_I,\mathcal {L}_T\)) by updating the parameters involved in the mapping function(s) based on the gradients of the objective w.r.t the parameters. Specifically, it iterates the whole dataset to extract mini-batches (Line 4). For each mini-batch, it averages the gradients computed from BP (Line 5) and updates the parameters (Line 6).

BP calculates the gradients of the objective loss (e.g., \(\mathcal {L}\), \(\mathcal {L}_I\), \(\mathcal {L}_T\)) w.r.t. the parameters involved in the mapping function (e.g., \(f_I, f_T\)) using a chain rule (Eqs. 19, 20). It forwards the input feature vector through all layers of the mapping function (Line 2). Then it backwards the gradients according to the chain rule (Line 4-6). \(\theta _i\) denotes parameters involved in the i-th layer. Gradients are returned at Line 7.

$$\begin{aligned} \frac{\partial \mathcal {L}}{\partial \theta _i}= & {} \frac{\partial \mathcal {L}}{\partial x_i}*\frac{\partial x_i}{\partial \theta _i} \end{aligned}$$
(19)
$$\begin{aligned} \frac{\partial \mathcal {L}}{\partial x_{i-1}}= & {} \frac{\partial \mathcal {L}}{\partial x_i}*\frac{\partial x_i}{\partial x_{i-1}} \end{aligned}$$
(20)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, W., Yang, X., Ooi, B.C. et al. Effective deep learning-based multi-modal retrieval. The VLDB Journal 25, 79–101 (2016). https://doi.org/10.1007/s00778-015-0391-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-015-0391-4

Keywords

Navigation