Cross-Media Retrieval by Multimodal Representation Fusion with Deep Networks

Qi, Jinwei; Huang, Xin; Peng, Yuxin

doi:10.1007/978-981-10-4211-9_22

Jinwei Qi¹²,
Xin Huang¹² &
Yuxin Peng¹²

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 685))

Included in the following conference series:

International Forum of Digital TV and Wireless Multimedia Communication

990 Accesses
5 Citations

Abstract

With the development of computer network, multimedia and digital transmission technology in recent years, the traditional form of information dissemination which mainly depends on text has changed to the multimedia form including texts, images, videos, audios and so on. Under this situation, to meet the growing demand of users for access to multimedia information, cross-media retrieval has become a key problem of research and application. Given queries of any media type, cross-media retrieval can return all relevant media types as results with similar semantics. For measuring the similarity between different media types, it is important to learn better shared representation for multimedia data. Existing methods mainly extract single modal representation for each media type and then learn the cross-media correlations with pairwise similar constraint, which cannot make full use of the rich information within each media type and ignore the dissimilar constraints between different media types. For addressing the above problems, this paper proposes a deep multimodal learning method (DML) for cross-media shared representation learning. First, we adopt two different deep networks for each media type with multimodal learning, which can obtain the high-level semantic representation of single media. Then, a two-pathway network is constructed by jointly modeling the pairwise similar and dissimilar constraints with a contrastive loss to get the shared representation. The experiments are conducted on two widely-used cross-media datasets, which shows the effectiveness of our proposed method. abstract environment.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Andrew, G., Arora, R., Bilmes, J.A., Livescu, K.: Deep canonical correlation analysis. In: International Conference on Machine Learning (ICML), pp. 1247–1255 (2013)
Google Scholar
Chua, T.-S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.: Nus-wide: a real-world web image database from national university of singapore. In: ACM International Conference on Image and Video Retrieval (ACM-CIVR), pp. 1–9 (2009)
Google Scholar
Feng, F., Wang, X., Li, R.: Cross-modal retrieval with correspondence autoencoder. In: ACM International Conference on Multimedia (ACM-MM), pp. 7–16 (2014)
Google Scholar
Hardoon, D.R., Szedmák, S., Shawe-Taylor, J.: Canonical correlation analysis: an overview with application to learning methods. Neural Comput. 16(12), 2639–2664 (2004)
Article MATH Google Scholar
Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)
Article MathSciNet MATH Google Scholar
Hotelling, H.: Relations between two sets of variates. Biometrika 28, 321–377 (1936)
Google Scholar
Li, D., Dimitrova, N., Li, M., Sethi, I.K.: Multimedia content processing through cross-modal association. In: ACM International Conference on Multimedia (ACM-MM), pp. 604–611 (2003)
Google Scholar
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: International Conference on Machine Learning (ICML), pp. 689–696 (2011)
Google Scholar
Peng, Y., Ngo, C.-W.: Clip-based similarity measure for query-dependent clip retrieval and video summarization. IEEE Trans. Circ. Syst. Video Technol. (TCSVT) 16(5), 612–627 (2006)
Article Google Scholar
Rasiwasia, N., Costa Pereira, J., Coviello, E., Doyle, G., Lanckriet, G.R.G., Levy, R., Vasconcelos, N.: A new approach to cross-modal multimedia retrieval. In: ACM International Conference on Multimedia (ACM-MM), pp. 251–260 (2010)
Google Scholar
Salakhutdinov, R., Hinton, G.E.: Replicated softmax: an undirected topic model. In: Advances in Neural Information Processing Systems (NIPS), pp. 1607–1614 (2009)
Google Scholar
Salakhutdinov, R., Hinton, G.E.: An efficient learning procedure for deep Boltzmann machines. Neural Comput. 24(8), 1967–2006 (2012)
Article MathSciNet MATH Google Scholar
Srivastava, N., Salakhutdinov, R.: Learning representations for multimodal data with deep belief nets. In: International Conference on Machine Learning (ICML) Workshop (2012)
Google Scholar
Typke, R., Wiering, F., Veltkamp, R.C.: A survey of music information retrieval systems. In: The International Society for Music Information Retrieval (ISMIR), pp. 153–160 (2005)
Google Scholar
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.-A.: Extracting and composing robust features with denoising autoencoders. In: International Conference on Machine Learning (ICML), pp. 1096–1103 (2008)
Google Scholar
Welling, M., Rosen-Zvi, M., Hinton, G.E.: Exponential family harmoniums with an application to information retrieval. In: Advances in Neural Information Processing Systems (NIPS), pp. 1481–1488 (2004)
Google Scholar
Yan, F., Mikolajczyk, K.: Deep correlation for matching images and text. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3441–3450 (2015)
Google Scholar
Jie, Y., Tian, Q.: Semantic subspace projection and its applications in image retrieval. IEEE Trans. Circ. Syst. Video Technol. (TCSVT) 18(4), 544–548 (2008)
Article MathSciNet Google Scholar
Zhai, X., Peng, Y., Xiao, J.: Heterogeneous metric learning with joint graph regularization for cross-media retrieval. In: AAAI Conference on Artificial Intelligence (AAAI), pp. 1198–1204 (2013)
Google Scholar
Zhai, X., Peng, Y.X., Xiao, J.: Learning cross-media joint representation with sparse and semi-supervised regularization. IEEE Trans. Circ. Syst. Video Technol. (TCSVT) 24(6), 965–978 (2014)
Article Google Scholar

Download references

Acknowledgments

This work was supported by National Hi-Tech Research and Development Program of China (863 Program) under Grant 2014AA015102, and National Natural Science Foundation of China under Grants 61371128 and 61532005.

Author information

Authors and Affiliations

Institute of Computer Science and Technology, Peking University, Beijing, 100871, China
Jinwei Qi, Xin Huang & Yuxin Peng

Authors

Jinwei Qi
View author publications
You can also search for this author in PubMed Google Scholar
Xin Huang
View author publications
You can also search for this author in PubMed Google Scholar
Yuxin Peng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuxin Peng .

Editor information

Editors and Affiliations

Shanghai Jiao Tong University, Shanghai, China
Xiaokang Yang
Shanghai Jiao Tong University, Shanghai, China
Guangtao Zhai

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Qi, J., Huang, X., Peng, Y. (2017). Cross-Media Retrieval by Multimodal Representation Fusion with Deep Networks. In: Yang, X., Zhai, G. (eds) Digital TV and Wireless Multimedia Communication. IFTC 2016. Communications in Computer and Information Science, vol 685. Springer, Singapore. https://doi.org/10.1007/978-981-10-4211-9_22

Download citation

DOI: https://doi.org/10.1007/978-981-10-4211-9_22
Published: 12 March 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-4210-2
Online ISBN: 978-981-10-4211-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics