Semantically-enhanced kernel canonical correlation analysis: a multi-label cross-modal retrieval

Jia, Yuhua; Bai, Liang; Liu, Shuang; Wang, Peng; Guo, Jinlin; Xie, Yuxiang

doi:10.1007/s11042-018-5767-1

Semantically-enhanced kernel canonical correlation analysis: a multi-label cross-modal retrieval

Published: 27 February 2018

Volume 78, pages 13169–13188, (2019)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Yuhua Jia¹,
Liang Bai¹,
Shuang Liu¹,
Peng Wang²,
Jinlin Guo¹ &
…
Yuxiang Xie¹

470 Accesses
9 Citations
Explore all metrics

Abstract

Aiming at measuring the inter-media semantic similarities, cross-modal retrieval tries to align heterogenous features to an intermediate common subspace in which they can be reasonably compared. This is based on the same understanding of the semantics which are represented by different modalities. However, the semantics can usually be reflected by multiple concepts since concepts co-occur in real-world rather than occur in isolation. This leads to a more challenging task of multi-label cross-modal retrieval in which multiple concepts are annotated as labels for images as an example. More importantly, the co-occurrence patterns of concepts result in correlated pairs of labels whose relationships need to be considered in an accurate cross-modal retrieval. In this paper, we propose multi-label kernel canonical correlation analysis (ml-KCCA), a novel approach for cross-modal retrieval which enhances kernel CCA with high-level semantic information reflected in multi-label annotations. By kernelizing correlation extraction from multi-label information, more complex non-linear correlations between different modalities can be measured in order to learn a discriminative subspace which is more suitable for cross-modal retrieval tasks. Extensive evaluations on public datasets have validated the improvements of our approach over the state-of-the-art cross-modal retrieval approaches including other CCA extensions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semi-supervised cross-modal learning for cross modal retrieval and image annotation

Article 13 July 2018

Label guided correlation hashing for large-scale cross-modal retrieval

Article 06 February 2019

Cross-Modal Learning with Images, Texts and Their Semantics

References

Akaho S (2006) A kernel method for canonical correlation analysis. In: Proceedings of the international meeting of the psychometric society, vol 40, pp 263–269
Bekkerman R, Jeon J (2007) Multi-modal clustering for multimedia collections. In: IEEE conference on computer vision and pattern recognition, pp 1–8
Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) NUS-WIDE: a real-world web image database from National University of Singapore. In: ACM international conference on image and video retrieval, p 48
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition
Everingham M, Gool L, Williams CK, Winn J, Zisserman A (2010) The Pascal Visual Object Classes (VOC) challenge. Int J Comput Vis 88(2):303–338
Article Google Scholar
Gong Y, Lazebnik S, Gordo A et al (2013) Iterative quantization: a Procrustean approach to learning binary codes for large-scale image retrieval. IEEE Trans Pattern Anal Mach Intell 35(12):2916
Article Google Scholar
Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106(2):210–233
Article Google Scholar
Hardoon D, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664
Article MATH Google Scholar
Hotelling H (1992) Relations between two sets of variates. In: Breakthroughs in statistics, pp 321–377
Huyn N (2001) Data analysis and mining in the life sciences. In: ACM
Hwang SJ, Grauman K (2010) Accounting for the relative importance of objects in image retrieval. In: British machine vision conference, pp 1–12
Hwang SJ, Grauman K (2010) Reading between the lines: object localization using implicit cues from image tags. In: IEEE conference on computer vision and pattern recognition, pp 2971–2978
Hwang SJ, Grauman K (2012) Learning the relative importance of objects from tagged images for retrieval and cross-modal search. Int J Comput Vis 100(2):134–153
Article MathSciNet Google Scholar
Järvelin K, Kekäläinen J (2002) Cumulated gain-based evaluation of IR techniques. ACM Trans Inf Syst 20(4):422–446
Article Google Scholar
Jiang W, Chang S-F, Loui AC (2007) Context-based concept fusion with boosted conditional random fields. In: IEEE international conference on acoustics, speech and signal processing
Jiang Y-G, Wang J, Chang S-F, Ngo C-W (2009) Domain adaptive semantic diffusion for large scale context-based video annotation. In: IEEE 12th international conference on computer vision, pp 1420–1427
Jiang Y-G, Dai Q, Wang J, Ngo C-W, Xue X, Chang S-F (2012) Fast semantic diffusion for large-scale context-based image and video annotation. IEEE Trans Image Process 21(6):3080–3091
Article MathSciNet MATH Google Scholar
Jin Y, Khan L, Wang L, Awad M (2005) Image annotations by combining multiple evidence & WordNet. In: ACM international conference on multimedia, pp 706–715
Kang C, Xiang S, Liao S, Xu C, Pan C (2015) Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Trans Multimed 17(3):370–381
Article Google Scholar
Kennedy LS, Chang S-F (2007) A reranking approach for context-based concept fusion in video indexing and retrieval. In: Proceedings of the 6th ACM international conference on image and video retrieval, pp 333–340
Lai PL, Fyfe C (2000) Kernel and nonlinear canonical correlation analysis. Int J Neural Syst 10(5):365
Article Google Scholar
Miller GA (1995) WordNet: a lexical database for english. Commun ACM 38 (11):39–41
Article Google Scholar
Oliva A, Torralba A (2006) Building the gist of a scene: the role of global image features in recognition. Prog Brain Res 155:23–36
Article Google Scholar
Qi G-J, Hua X-S, Rui Y, Tang J, Mei T, Zhang H-J (2007) Correlative multi-label video annotation. In: ACM international conference on multimedia, pp 17–26
Ranjan V, Rasiwasia N, Jawahar CV (2015) Multi-label cross-modal retrieval. In: IEEE international conference on computer vision, pp 4094–4102
Rasiwasia N, Pereira JC, Coviello E et al (2010) A new approach to cross-modal multimedia retrieval. In: ACM international conference on multimedia, pp 251–260
Rasiwasia N, Mahajan D, Mahadevan V, Aggarwal G (2014) Cluster canonical correlation analysis. In: Proceedings of international conference on artificial intelligence and statistics
Sang J, Xu C, Liu J (2012) User-aware image tag refinement via ternary semantic analysis. IEEE Trans Multimed 14(3):883–895
Article Google Scholar
Sang J, Fang Q, Xu C (2017) Exploiting social-mobile information for location visualization. ACM TIST 8(3):39:1–39:19
Google Scholar
Sharma A (2012) Generalized multiview analysis: a discriminative latent space. In: IEEE conference on computer vision and pattern recognition, pp 2160–2167
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Computer Science
Srivastava N, Salakhutdinov R (2014) Multimodal learning with deep Boltzmann machines. J Mach Learn Res 15(8):1967–2006
MathSciNet MATH Google Scholar
Vinokourov A, Shawe-Taylor J, Cristianini N (2002) Inferring a semantic representation of text via cross-language correlation analysis. In: Advances of neural information processing systems, pp 1497–1504
Wang C, Jing F, Zhang L, Zhang H-J (2006) Image annotation refinement using random walk with restarts. In: ACM international conference on multimedia, pp 647–650
Wang K, He R, Wang W, Wang L, Tan T (2013) Learning coupled feature spaces for cross-modal matching. In: IEEE international conference on computer vision, pp 2088–2095
Wang P, Sun LF, Yang SQ, Smeaton AF (2016) Semantically smoothed refinement for everyday concept indexing. In: Pacific rim conference on multimedia (PCM)
Wang P, Sun LF, Yang SQ, Smeaton AF (2016) Towards training-free refinement for semantic indexing of visual media. In: International conference on multimedia modeling, pp 251–263
Wang P, Sun LF, Yang SQ, Smeaton AF, Gurrin C (2016) Characterizing everyday activities from visual lifelogs based on enhancing concept representation. Comput Vis Image Underst 148:181–192
Article Google Scholar
Wang P, Sun LF, Yang SQ, Smeaton A F (2017) Training-free indexing refinement for visual media via multi-semantics. Neurocomputing 236:39–47
Article Google Scholar
Wang H, Wu X, Jia Y (2017) Heterogeneous domain adaptation method for video annotation. IET Comput Vis 11(2):181–187
Article Google Scholar
Wu Y, Tseng B, Smith JR (2004) Ontology-based multi-classification learning for video concept detection. In: IEEE international conference on multimedia and expo
Wu F, Zhang H, Zhuang Y (2007) Learning semantic correlations for cross-media retrieval. In: IEEE international conference on image processing. IEEE, pp 1465–1468
Wu F, Lu X, Zhang Z, Yan S, Rui Y, Zhuang Y (2013) Cross-media semantic representation via bi-directional learning to rank. In: ACM international conference on multimedia, pp 877–886
Xue X, Zhang W, Zhang J, Wu B, Fan J, Lu Y (2011) Correlative multi-label multi-instance image annotation. In: ICCV, pp 651–658
Yao T, Mei T, Ngo C W (2015) Learning query and image similarities with ranking canonical correlation analysis. In: IEEE international conference on computer vision, pp 28–36
Youshida K, Yoshimoto J, Doya K (2017) Sparse kernel canonical correlation analysis for discovery of nonlinear interactions in high-dimensional data. BMC Bioinf 18(1):108
Article Google Scholar
Yu J, Rui Y, Tao D (2014) Click Prediction for web image reranking using multimodal sparse coding. IEEE Trans Image Process 23(5):2019–2032
Article MathSciNet MATH Google Scholar
Yu J, Tao D, Wang M, Rui Y (2015) Learning to rank using user clicks and visual features for image retrieval. IEEE Trans Cybern 45(4):767–779
Article Google Scholar
Yu J, Yang X, Gao F, Tao D (2016) Deep multimodal distance metric learning using click constraints for image ranking. IEEE Trans Cybern PP(99):1–11
Google Scholar

Download references

Acknowledgments

This work is supported by the Natural Science Foundation of China under Grant No. 61571453, No. 61502264, and No. 61405252, Natural Science Foundation of Hunan Province, China under Grant No. 14JJ3010, Research Funding of National University of Defense Technology under grant No. ZK16-03-37.

Author information

Authors and Affiliations

Science and Technology on Information Systems Engineering Laboratory, National University of Defense Technology, Changsha, 410073, China
Yuhua Jia, Liang Bai, Shuang Liu, Jinlin Guo & Yuxiang Xie
National Laboratory for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
Peng Wang

Authors

Yuhua Jia
View author publications
You can also search for this author inPubMed Google Scholar
Liang Bai
View author publications
You can also search for this author inPubMed Google Scholar
Shuang Liu
View author publications
You can also search for this author inPubMed Google Scholar
Peng Wang
View author publications
You can also search for this author inPubMed Google Scholar
Jinlin Guo
View author publications
You can also search for this author inPubMed Google Scholar
Yuxiang Xie
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Peng Wang.

Additional information

Yuhua Jia and Liang Bai are both first authors.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jia, Y., Bai, L., Liu, S. et al. Semantically-enhanced kernel canonical correlation analysis: a multi-label cross-modal retrieval. Multimed Tools Appl 78, 13169–13188 (2019). https://doi.org/10.1007/s11042-018-5767-1

Download citation

Received: 30 July 2017
Revised: 26 January 2018
Accepted: 09 February 2018
Published: 27 February 2018
Issue Date: 30 May 2019
DOI: https://doi.org/10.1007/s11042-018-5767-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semantically-enhanced kernel canonical correlation analysis: a multi-label cross-modal retrieval

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Semi-supervised cross-modal learning for cross modal retrieval and image annotation

Label guided correlation hashing for large-scale cross-modal retrieval

Cross-Modal Learning with Images, Texts and Their Semantics

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now