Learning discriminative representations for semantical crossmodal retrieval

Jiang, Aiwen; Li, Hanxi; Li, Yi; Wang, Mingwen

doi:10.1007/s00530-016-0532-7

Learning discriminative representations for semantical crossmodal retrieval

Regular Paper
Published: 19 November 2016

Volume 24, pages 111–121, (2018)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Aiwen Jiang¹,
Hanxi Li¹,
Yi Li² &
…
Mingwen Wang¹

283 Accesses
3 Citations
Explore all metrics

Abstract

Heterogeneous gap among different modalities emerges as one of the critical issues in multimedia retrieval areas. Unlike traditional unimodal cases, where raw features are extracted and directly measured, the heterogeneous nature of crossmodal tasks requires the intrinsic semantic representation to be compared in a unified framework. Based on a flexible “feature up-lifting and down projecting” mechanism, this paper studies the learning of crossmodal semantic features that can be retrieved across different modalities. Two effective methods are proposed to mine semantic correlations. One is for traditional handcrafted features, and the other is based on deep neural network. We treat them respectively as normal and deep version of our proposed shared discriminative semantic representation learning (SDSRL) framework. We evaluate both of these two methods on two public multimodal datasets for crossmodal and unimodal retrieval tasks. The experimental results demonstrate that our proposed methods outperform the compared baselines and achieve state-of-the-art performance in most scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A deep semantic framework for multimodal representation learning

Article 03 March 2016

Effective deep learning-based multi-modal retrieval

Article 19 July 2015

Double-scale similarity with rich features for cross-modal retrieval

Article 13 May 2022

Notes

http://www.svcl.ucsd.edu/projects/crossmodal/.

References

Oncel, T., Fatih, P., Peter, M.: Region covariance: a fast descriptor for detection and classification. European Conference on Computer Vision (2006)
Jose Costa, P., Emanuele, C., Gabriel, D.: On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans. on Pattern Analysis and Machine Intelligence. 36(3):521–535 (2014)
Tat-Seng, C., Jinhui, T., Richang, H.: Nus-wide: a real-world web image database from national university of Singapore. ACM Conference on Image and Video Retrieval (2009)
Qing Yuan, J., Wu Jun, L.; Deep cross-modal hashing. In: arXiv:1602.02255v2 (2016)
David, M.B., Michael, I.J.: Modeling annotated data. ACM SIGIR International Conference on Research and Development in Informaion Retrieval (2003)
Yangqing, J., Mathieu, S., Trevor, D.: Learning cross-modality similarity for multinomial data. Int. Confer. Comp. Vision (2011)
Fei, W., Xinyan, L., Zhongfei, Z.: Cross-media semantic representation via bi-directional learning to rank. ACM Int. Confer. Multimedia (2013)
Di, W., Xinbo, G., Xiumei, W., Lihuo, H.: Semantic topic multimodal hashing for cross-media retrieval. Proceed. 24th Int. Conf. Artif. Intell. IJCAI’15 (2015)
Jiquan, N., Aditya, K., Mingyu, K.: Multimodal deep learning. Int. Conf. Mach. Learn. (2011)
Abhishek, S., Abhishek, K., Hal Daume, III, David, W.J.: Generalized multiview analysis: a discriminative latent space. IEEE Confer. Comp. Vision Patt. Recogn. (CVPR) (2012)
Fangxiang, F., Xiaojie, W., Ruifan, L.: Cross-modal retrieval with correspondence autoencoder. ACM Int. Conf. Multimedia (2014)
Wei, W., Beng Chin, O., Xiaoyan, Yang.: Effective multimodal retrieval based on stacked autoencoders. Proceed. VLDB Endowment (2014)
Habibian, A., Mensink, T., Cees G.M.S.: Discovering semantic vocabularies for cross-media retrieval. ACM Int. Confer. Multimedia Retri. (2015)
Yashaswi, V., Jawahar, C.V.: Im2text and text2im: associating images and texts for cross-modal retrieval. Br. Mach. Vision Confer. (2014)
Dongqing, Z., Wu-Jun, L.: Large-scale supervised multimodal hashing with semantic correlation maximization. Proceed. Twenty-Eighth AAAI Confer. Arti. Intell. (2014)
Jingkuan, S., Yang, Y., Yi, Y.: Inter-media hashing for large-scale retrieval from heterogeneous data sources. ACM SIGMOD Int. Conf. Manage. Data (2013)
Jile, Z., Guiguang, D., Yuchen, G.: Latent semantic sparse hashing for cross-modal similarity search. Int. ACM SIGIR Confer. Res. Dev. Inform. Retri. (2014)
Guiguang, D., Yuchen, G., Jile, Z.: Collective matrix factorization hashing for multimodal data. IEEE Confer. Comp. Vision Pattern Recogn. (CVPR) (2014)
Yi Z., Piyush, R., Hongyuan, Z., Lawrence, C.: Cross-modal similarity learning via pairs, preferences, and active supervision. Proceed. Twenty-Ninth AAAI Confer. Arti. Intell. (2015)
Botong, W., Qiang, Y., Wei-Shi, Z., Yizhou, W., Jingdong, W.: Quantized correlation hashing for fast cross-modal search. Proceed. Twenty-Fourth Int. Joint Confer. Arti. Intell. IJCAI (2015)
Sean M., Victor, L.: Regularised cross-modal hashing. Int. ACM SIGIR Confer. Res. Dev. Inform. Retri. (2015)
Zijia, L., Guiguang, D., Mingqing, H., Jianmin, W.: Semantics-preserving hashing for cross-view retrieval. IEEE Confer. Comp. Vision Patt. Recogn. (CVPR) (2015)
Quoc Viet, L., Tamás, S., Alexander, J.S.: Fastfood – Approximating kernel expansions in loglinear time. Int. Confer. Mach. Learn. (2013)
Si, S., Cho-Jui, H., Inderjit, S.D.: Memory efficient kernel approximation. Int. Confer. Mach. Learn (2014)
Drineas, P., Mahoney, M.W.: On the nystrom method for approximating a gram matrix for improved kernel-based learning. J. Mach. Learn. Res. 6(12), 2153–2175 (2005)
MathSciNet MATH Google Scholar
Corinna C., Mehryar M., Ameet T.: On the impact of kernel approximation on learning accuracy. Int. Confer. Arti. Intell. Stat. (2010)
Andrea V., Andrew Z.: Sparse kernel approximations for efficient classification and detection. IEEE Confer. Comp. Vision Patt. Recogn. (2012)
Simonyan K., Zisserman A.: Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556 (2014)
Tomas M., Ilya S., Kai C., Greg S. C., Jeff D.: Distributed representations of words and phrases and their compositionality. Adv. Neural Inform. Process. Syst. 26 (2013)
Wojciech Z., Ilya S.: Learning to execute. CoRR, abs/1410.4615 (2014)
Kiros, R., Zhu, Y., Ruslan, R.S., Richard, Z., Raquel, U., Antonio, T., Sanja, F.: Skip-thought vectors. In Advances in Neural Information Processing Systems. xxx, xxxx (2015)
Google Scholar
Diederik P.K., Jimmy B.A.: A method for stochastic optimization. In Int. Confer. Learning Represen. (ICLR) (2015)

Download references

Author information

Authors and Affiliations

Jiangxi Normal University, #99, Ziyang Ave, Nanchang, 330022, China
Aiwen Jiang, Hanxi Li & Mingwen Wang
NICTA, London Circuit 7, ACT, Canberra, Australia
Yi Li

Authors

Aiwen Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Hanxi Li
View author publications
You can also search for this author in PubMed Google Scholar
Yi Li
View author publications
You can also search for this author in PubMed Google Scholar
Mingwen Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aiwen Jiang.

Additional information

Communicated by L. Zhang.

This work is funded by China Scholarship Council, and supported by Natural Science Foundation of China (61365002, 61462045), Natural Science Foundation of Jiangxi (20142BAB217010), Science and Technology Project Founded by the Education Department of Jiangxi Province (GJJ150350), The Sponsored Program for Cultivating Youths of Outstanding Ability in Jiangxi Normal University. This work is partly revised during visiting NICTA.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jiang, A., Li, H., Li, Y. et al. Learning discriminative representations for semantical crossmodal retrieval. Multimedia Systems 24, 111–121 (2018). https://doi.org/10.1007/s00530-016-0532-7

Download citation

Received: 14 November 2015
Accepted: 08 November 2016
Published: 19 November 2016
Issue Date: February 2018
DOI: https://doi.org/10.1007/s00530-016-0532-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning discriminative representations for semantical crossmodal retrieval

Abstract

Access this article

Similar content being viewed by others

A deep semantic framework for multimodal representation learning

Effective deep learning-based multi-modal retrieval

Double-scale similarity with rich features for cross-modal retrieval

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Learning discriminative representations for semantical crossmodal retrieval

Abstract

Access this article

Similar content being viewed by others

A deep semantic framework for multimodal representation learning

Effective deep learning-based multi-modal retrieval

Double-scale similarity with rich features for cross-modal retrieval

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation