A deep semantic framework for multimodal representation learning

Wang, Cheng; Yang, Haojin; Meinel, Christoph

doi:10.1007/s11042-016-3380-8

A deep semantic framework for multimodal representation learning

Published: 03 March 2016

Volume 75, pages 9255–9276, (2016)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Cheng Wang¹,
Haojin Yang¹ &
Christoph Meinel¹

2335 Accesses
30 Citations
5 Altmetric
Explore all metrics

Abstract

Multimodal representation learning has gained increasing importance in various real-world multimedia applications. Most previous approaches focused on exploring inter-modal correlation by learning a common or intermediate space in a conventional way, e.g. Canonical Correlation Analysis (CCA). These works neglected the exploration of fusing multiple modalities at higher semantic level. In this paper, inspired by the success of deep networks in multimedia computing, we propose a novel unified deep neural framework for multimodal representation learning. To capture the high-level semantic correlations across modalities, we adopted deep learning feature as image representation and topic feature as text representation respectively. In joint model learning, a 5-layer neural network is designed and enforced with a supervised pre-training in the first 3 layers for intra-modal regularization. The extensive experiments on benchmark Wikipedia and MIR Flickr 25K datasets show that our approach achieves state-of-the-art results compare to both shallow and deep models in multimodal and cross-modal retrieval.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

¹ http://www.svcl.ucsd.edu/project/crossmodal/
² http://press.liacs.nl/mirflickr/
³ https://www.flickr.com/
⁴ https://github.com/BVLC/caffe/tree/master/models/
⁵ The P-R values read from graph

References

Bay H, Ess A, Tuytelaars T, Van Gool L (2008) Speeded-up robust features (surf). Comput Vis Image Underst 110(3):346–359
Article Google Scholar
Bengio Y, Lamblin P, Popovici D, Larochelle H et al (2007) Greedy layer-wise training of deep networks. Adv Neural Inf Process Syst 19:153
Google Scholar
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Bosch A, Zisserman A, Munoz X (2007) Image classification using random forests and ferns. In: IEEE 11th international conference on computer vision, ICCV 2007. IEEE, pp 1–8
Caruana R (1997) Multitask learning. Mach Learn 28(1):41–75
Article MathSciNet Google Scholar
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition. IEEE, pp 248–255
Escalante HJ, Hérnadez CA, Sucar LE, Montes M (2008) Late fusion of heterogeneous methods for multimedia image retrieval. In: Proceedings of the 1st ACM international conference on multimedia information retrieval. ACM, pp 172–179
Feng F, Wang X, Li R (2014) Cross-modal retrieval with correspondence autoencoder. In: Proceedings of the ACM international conference on multimedia. ACM, pp 7–16
Hinton G, Deng L, Yu D, Dahl GE, Mohamed A-r, Jaitly N, Senior A, Vanhoucke V, Nguyen P, Sainath TN et al (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process Mag 29(6):82–97
Article Google Scholar
Hoffman J, Rodner E, Donahue J, Darrell T, Saenko K (2013) Efficient learning of domain-invariant image representations. arXiv:1301.3224
Huiskes MJ, Lew MS (2008) The mir flickr retrieval evaluation. In: Proceedings of the 1st ACM international conference on multimedia information retrieval. ACM, pp 39–43
Jacob L, Vert J-p, Francis R Bach. (2009) Clustered multi-task learning: a convex formulation. Adv Neural Inf Process Syst, pp 745–752
Jaderberg M, Vedaldi A, Zisserman A (2014) Deep features for text spotting. In: ECCV. Springer, Berlin, pp 512–528
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. arXiv:1408.5093
Kang C, Xiang S, Liao S, Xu C, Pan C (2015) Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Trans Multimedia 17(3):370–381
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst, pp 1097–1105
Lee H, Grosse R, Ranganath R, Ng AY (2009) Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: Proceedings of the 26th annual international conference on machine learning. ACM, pp 609–616
Liao R, Zhu J, Qin Z (2014) Nonparametric bayesian upstream supervised multi-modal topic models. In: Proceedings of the 7th ACM international conference on Web search and data mining. ACM, pp 493– 502
Lienhart R, Romberg S, Hörster E (2009) Multilayer plsa for multimodal image retrieval. In: Proceedings of the ACM international conference on image and video retrieval. ACM, p 9
Liu D, Lai K-T, Ye G, Chen M-S, Chang S-F (2013) Sample-specific late fusion for visual category recognition. In: IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 803–810
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Article Google Scholar
Manjunath BS, Ohm J-R, Vasudevan VV, Yamada A (2001) Color and texture descriptors. IEEE Trans Circuits Syst Video Technol 11(6):703–715
Article Google Scholar
Mao X, Lin B, Cai D, He X, Pei J (2013) Parallel field alignment for cross media retrieval. In: Proceedings of the 21st ACM international conference on Multimedia. ACM, pp 897–906
Mikolajczyk FYK Deep correlation for matching images and text. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, p 2015
Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2011) Multimodal deep learning. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 689–696
Oliva Aude, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42(3):145–175
Article MATH Google Scholar
Pereira JC, Coviello E, Doyle G, Rasiwasia N, Lanckriet GRG, Levy R, Vasconcelos N (2014) On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans Pattern Anal Mach Intell 36(3):521–535
Article Google Scholar
Pereira JC, Vasconcelos N (2014) Cross-modal domain adaptation for text-based regularization of image semantics in image retrieval systems. Comput Vis Image Underst 124:123–135
Article Google Scholar
Perronnin F, Dance C (2007) Fisher kernels on visual vocabularies for image categorization. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2007). IEEE, pp 1–8
Pham T-T, Maillot NE, Lim J-H, Chevallet J-P (2007) Latent semantic fusion model for image retrieval and annotation. In: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management. ACM, pp 439–444
Pulla C, Jawahar CV (2010) Multi modal semantic indexing for image retrieval. In: Proceedings of the ACM international conference on image and video retrieval. ACM, pp 342–349
Rasiwasia N, Pereira JC, Coviello E, Doyle G, Lanckriet GRG, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: Proceedings of the international conference on multimedia. ACM, pp 251–260
Sharma A, Kumar A, Daume H, Jacobs DW (2012) Generalized multiview analysis: a discriminative latent space. In: Proceedings of IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 2160–2167
Shu X, Qi G-J, Tang J, Wang J (2015) Weakly-shared deep transfer networks for heterogeneous-domain knowledge propagation. In: Proceedings of the 23rd annual ACM conference on multimedia conference. ACM, pp 35–44
Smeulders AWM, Worring M, Santini S, Gupta A, Jain R (2000) Content-based image retrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell 22(12):1349–1380
Article Google Scholar
Srivastava N, Salakhutdinov R (2012) Multimodal learning with deep boltzmann machines. Adv Neural Inf Process Syst, pp 2222–2230
Thompson B (2005) Canonical correlation analysis. Encyclopedia of statistics in behavioral science
Vedaldi A, Fulkerson B (2010) Vlfeat: an open and portable library of computer vision algorithms. In: Proceedings of the international conference on multimedia. ACM, pp 1469–1472
Vincent Pascal, Larochelle Hugo, Lajoie Isabelle, Bengio Yoshua, Manzagol P-A (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J Mach Learn Res 11:3371–3408
MathSciNet MATH Google Scholar
Wang K, He R, Wang W, Wang L, Tan T (2013) Learning coupled feature spaces for cross-modal matching. In: IEEE international conference on computer vision (ICCV). IEEE, pp 2088–2095
Wang W, Ooi BC, Yang X, Zhang D, Zhuang Y (2014) Effective multi-modal retrieval based on stacked auto-encoders. Proceedings of the VLDB Endowment 7 (8):649–660
Article Google Scholar
Wang Y, Wu F, Song J, Li X, Zhuang Y (2014) Multi-modal mutual topic reinforce modeling for cross-media retrieval. In: Proceedings of the ACM international conference on multimedia. ACM, pp 307– 316
Wu F, Lu X, Zhang Z, Yan S, Rui Y, Zhuang Y (2013) Cross-media semantic representation via bi-directional learning to rank. In: Proceedings of the 21st ACM international conference on multimedia. ACM, pp 877–886
Wu Z, Jiang Y-G, Wang J, Pu J, Xue X (2014) Exploring inter-feature and inter-class relationships with deep neural networks for video classification. In: Proceedings of the ACM international conference on multimedia. ACM, pp 167–176
Wu F, Zhang Y, Lu WM, Zhuang YT, Wang YF (2013) Supervised coupled dictionary learning with group structures for multi-modal retrieval. In: Twenty-Seventh AAAI Conference on Artificial Intelligence
Xie L, Pan P, Lu Y (2013) A semantic model for cross-modal and multi-modal retrieval. In: Proceedings of the 3rd ACM conference on international conference on multimedia retrieval. ACM, pp 175– 182
Yu J, Cong Y, Qin Z, Wan T (2012) Cross-modal topic correlations for multimedia retrieval. In: Proceedings of the 21st international conference on pattern recognition (ICPR). IEEE, pp 246– 249
Yu Z, Wu F, Yang Y, Tian Q, Luo Jiebo, Zhuang Y (2014) Discriminative coupled dictionary hashing for fast cross-media retrieval. In: Proceedings of the 37th international ACM SIGIR conference on research & development in information retrieval. ACM, pp 395–404
Zhai X, Peng Y, Xiao J (2012) Cross-modality correlation propagation for cross-media retrieval. In: IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 2337–2340
Zhai X, Peng Y, Xiao J (2013) Cross-media retrieval by intra-media and inter-media correlation mining. Multimedia Systems 19(5):395–406
Article Google Scholar
Zhang Y, Yeung D-Y (2012) A convex formulation for learning task relationships in multi-task learning. UAI
Zhou J, Ding G, Guo Y (2014) Latent semantic sparse hashing for cross-modal similarity search. In: Proceedings of the 37th international ACM SIGIR conference on research & development in information retrieval. ACM, pp 415–424

Download references

Author information

Authors and Affiliations

Hasso Plattner Institute, University of Potsdam, Prof.-Dr.-Helmert-Str. 2-3, 14482, Potsdam, Germany
Cheng Wang, Haojin Yang & Christoph Meinel

Authors

Cheng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Haojin Yang
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Meinel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cheng Wang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, C., Yang, H. & Meinel, C. A deep semantic framework for multimodal representation learning. Multimed Tools Appl 75, 9255–9276 (2016). https://doi.org/10.1007/s11042-016-3380-8

Download citation

Received: 24 September 2015
Revised: 21 December 2015
Accepted: 18 February 2016
Published: 03 March 2016
Issue Date: August 2016
DOI: https://doi.org/10.1007/s11042-016-3380-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A deep semantic framework for multimodal representation learning

Abstract

Access this article

Similar content being viewed by others

Learning discriminative representations for semantical crossmodal retrieval

2D-Convolution Based Feature Fusion for Cross-Modal Correlation Learning

Effective deep learning-based multi-modal retrieval

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A deep semantic framework for multimodal representation learning

Abstract

Access this article

Similar content being viewed by others

Learning discriminative representations for semantical crossmodal retrieval

2D-Convolution Based Feature Fusion for Cross-Modal Correlation Learning

Effective deep learning-based multi-modal retrieval

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation