Skip to main content
Log in

Learning discriminative representations for semantical crossmodal retrieval

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Heterogeneous gap among different modalities emerges as one of the critical issues in multimedia retrieval areas. Unlike traditional unimodal cases, where raw features are extracted and directly measured, the heterogeneous nature of crossmodal tasks requires the intrinsic semantic representation to be compared in a unified framework. Based on a flexible “feature up-lifting and down projecting” mechanism, this paper studies the learning of crossmodal semantic features that can be retrieved across different modalities. Two effective methods are proposed to mine semantic correlations. One is for traditional handcrafted features, and the other is based on deep neural network. We treat them respectively as normal and deep version of our proposed shared discriminative semantic representation learning (SDSRL) framework. We evaluate both of these two methods on two public multimodal datasets for crossmodal and unimodal retrieval tasks. The experimental results demonstrate that our proposed methods outperform the compared baselines and achieve state-of-the-art performance in most scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. http://www.svcl.ucsd.edu/projects/crossmodal/.

References

  1. Oncel, T., Fatih, P., Peter, M.: Region covariance: a fast descriptor for detection and classification. European Conference on Computer Vision (2006)

  2. Jose Costa, P., Emanuele, C., Gabriel, D.: On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans. on Pattern Analysis and Machine Intelligence. 36(3):521–535 (2014)

  3. Tat-Seng, C., Jinhui, T., Richang, H.: Nus-wide: a real-world web image database from national university of Singapore. ACM Conference on Image and Video Retrieval (2009)

  4. Qing Yuan, J., Wu Jun, L.; Deep cross-modal hashing. In: arXiv:1602.02255v2 (2016)

  5. David, M.B., Michael, I.J.: Modeling annotated data. ACM SIGIR International Conference on Research and Development in Informaion Retrieval (2003)

  6. Yangqing, J., Mathieu, S., Trevor, D.: Learning cross-modality similarity for multinomial data. Int. Confer. Comp. Vision (2011)

  7. Fei, W., Xinyan, L., Zhongfei, Z.: Cross-media semantic representation via bi-directional learning to rank. ACM Int. Confer. Multimedia (2013)

  8. Di, W., Xinbo, G., Xiumei, W., Lihuo, H.: Semantic topic multimodal hashing for cross-media retrieval. Proceed. 24th Int. Conf. Artif. Intell. IJCAI’15 (2015)

  9. Jiquan, N., Aditya, K., Mingyu, K.: Multimodal deep learning. Int. Conf. Mach. Learn. (2011)

  10. Abhishek, S., Abhishek, K., Hal Daume, III, David, W.J.: Generalized multiview analysis: a discriminative latent space. IEEE Confer. Comp. Vision Patt. Recogn. (CVPR) (2012)

  11. Fangxiang, F., Xiaojie, W., Ruifan, L.: Cross-modal retrieval with correspondence autoencoder. ACM Int. Conf. Multimedia (2014)

  12. Wei, W., Beng Chin, O., Xiaoyan, Yang.: Effective multimodal retrieval based on stacked autoencoders. Proceed. VLDB Endowment (2014)

  13. Habibian, A., Mensink, T., Cees G.M.S.: Discovering semantic vocabularies for cross-media retrieval. ACM Int. Confer. Multimedia Retri. (2015)

  14. Yashaswi, V., Jawahar, C.V.: Im2text and text2im: associating images and texts for cross-modal retrieval. Br. Mach. Vision Confer. (2014)

  15. Dongqing, Z., Wu-Jun, L.: Large-scale supervised multimodal hashing with semantic correlation maximization. Proceed. Twenty-Eighth AAAI Confer. Arti. Intell. (2014)

  16. Jingkuan, S., Yang, Y., Yi, Y.: Inter-media hashing for large-scale retrieval from heterogeneous data sources. ACM SIGMOD Int. Conf. Manage. Data (2013)

  17. Jile, Z., Guiguang, D., Yuchen, G.: Latent semantic sparse hashing for cross-modal similarity search. Int. ACM SIGIR Confer. Res. Dev. Inform. Retri. (2014)

  18. Guiguang, D., Yuchen, G., Jile, Z.: Collective matrix factorization hashing for multimodal data. IEEE Confer. Comp. Vision Pattern Recogn. (CVPR) (2014)

  19. Yi Z., Piyush, R., Hongyuan, Z., Lawrence, C.: Cross-modal similarity learning via pairs, preferences, and active supervision. Proceed. Twenty-Ninth AAAI Confer. Arti. Intell. (2015)

  20. Botong, W., Qiang, Y., Wei-Shi, Z., Yizhou, W., Jingdong, W.: Quantized correlation hashing for fast cross-modal search. Proceed. Twenty-Fourth Int. Joint Confer. Arti. Intell. IJCAI (2015)

  21. Sean M., Victor, L.: Regularised cross-modal hashing. Int. ACM SIGIR Confer. Res. Dev. Inform. Retri. (2015)

  22. Zijia, L., Guiguang, D., Mingqing, H., Jianmin, W.: Semantics-preserving hashing for cross-view retrieval. IEEE Confer. Comp. Vision Patt. Recogn. (CVPR) (2015)

  23. Quoc Viet, L., Tamás, S., Alexander, J.S.: Fastfood – Approximating kernel expansions in loglinear time. Int. Confer. Mach. Learn. (2013)

  24. Si, S., Cho-Jui, H., Inderjit, S.D.: Memory efficient kernel approximation. Int. Confer. Mach. Learn (2014)

  25. Drineas, P., Mahoney, M.W.: On the nystrom method for approximating a gram matrix for improved kernel-based learning. J. Mach. Learn. Res. 6(12), 2153–2175 (2005)

    MathSciNet  MATH  Google Scholar 

  26. Corinna C., Mehryar M., Ameet T.: On the impact of kernel approximation on learning accuracy. Int. Confer. Arti. Intell. Stat. (2010)

  27. Andrea V., Andrew Z.: Sparse kernel approximations for efficient classification and detection. IEEE Confer. Comp. Vision Patt. Recogn. (2012)

  28. Simonyan K., Zisserman A.: Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556 (2014)

  29. Tomas M., Ilya S., Kai C., Greg S. C., Jeff D.: Distributed representations of words and phrases and their compositionality. Adv. Neural Inform. Process. Syst. 26 (2013)

  30. Wojciech Z., Ilya S.: Learning to execute. CoRR, abs/1410.4615 (2014)

  31. Kiros, R., Zhu, Y., Ruslan, R.S., Richard, Z., Raquel, U., Antonio, T., Sanja, F.: Skip-thought vectors. In Advances in Neural Information Processing Systems. xxx, xxxx (2015)

    Google Scholar 

  32. Diederik P.K., Jimmy B.A.: A method for stochastic optimization. In Int. Confer. Learning Represen. (ICLR) (2015)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aiwen Jiang.

Additional information

Communicated by L. Zhang.

This work is funded by China Scholarship Council, and supported by Natural Science Foundation of China (61365002, 61462045), Natural Science Foundation of Jiangxi (20142BAB217010), Science and Technology Project Founded by the Education Department of Jiangxi Province (GJJ150350), The Sponsored Program for Cultivating Youths of Outstanding Ability in Jiangxi Normal University. This work is partly revised during visiting NICTA.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jiang, A., Li, H., Li, Y. et al. Learning discriminative representations for semantical crossmodal retrieval. Multimedia Systems 24, 111–121 (2018). https://doi.org/10.1007/s00530-016-0532-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-016-0532-7

Keywords

Navigation