Abstract
Aiming at measuring the inter-media semantic similarities, cross-modal retrieval tries to align heterogenous features to an intermediate common subspace in which they can be reasonably compared. This is based on the same understanding of the semantics which are represented by different modalities. However, the semantics can usually be reflected by multiple concepts since concepts co-occur in real-world rather than occur in isolation. This leads to a more challenging task of multi-label cross-modal retrieval in which multiple concepts are annotated as labels for images as an example. More importantly, the co-occurrence patterns of concepts result in correlated pairs of labels whose relationships need to be considered in an accurate cross-modal retrieval. In this paper, we propose multi-label kernel canonical correlation analysis (ml-KCCA), a novel approach for cross-modal retrieval which enhances kernel CCA with high-level semantic information reflected in multi-label annotations. By kernelizing correlation extraction from multi-label information, more complex non-linear correlations between different modalities can be measured in order to learn a discriminative subspace which is more suitable for cross-modal retrieval tasks. Extensive evaluations on public datasets have validated the improvements of our approach over the state-of-the-art cross-modal retrieval approaches including other CCA extensions.




Similar content being viewed by others
References
Akaho S (2006) A kernel method for canonical correlation analysis. In: Proceedings of the international meeting of the psychometric society, vol 40, pp 263–269
Bekkerman R, Jeon J (2007) Multi-modal clustering for multimedia collections. In: IEEE conference on computer vision and pattern recognition, pp 1–8
Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) NUS-WIDE: a real-world web image database from National University of Singapore. In: ACM international conference on image and video retrieval, p 48
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition
Everingham M, Gool L, Williams CK, Winn J, Zisserman A (2010) The Pascal Visual Object Classes (VOC) challenge. Int J Comput Vis 88(2):303–338
Gong Y, Lazebnik S, Gordo A et al (2013) Iterative quantization: a Procrustean approach to learning binary codes for large-scale image retrieval. IEEE Trans Pattern Anal Mach Intell 35(12):2916
Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106(2):210–233
Hardoon D, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664
Hotelling H (1992) Relations between two sets of variates. In: Breakthroughs in statistics, pp 321–377
Huyn N (2001) Data analysis and mining in the life sciences. In: ACM
Hwang SJ, Grauman K (2010) Accounting for the relative importance of objects in image retrieval. In: British machine vision conference, pp 1–12
Hwang SJ, Grauman K (2010) Reading between the lines: object localization using implicit cues from image tags. In: IEEE conference on computer vision and pattern recognition, pp 2971–2978
Hwang SJ, Grauman K (2012) Learning the relative importance of objects from tagged images for retrieval and cross-modal search. Int J Comput Vis 100(2):134–153
Järvelin K, Kekäläinen J (2002) Cumulated gain-based evaluation of IR techniques. ACM Trans Inf Syst 20(4):422–446
Jiang W, Chang S-F, Loui AC (2007) Context-based concept fusion with boosted conditional random fields. In: IEEE international conference on acoustics, speech and signal processing
Jiang Y-G, Wang J, Chang S-F, Ngo C-W (2009) Domain adaptive semantic diffusion for large scale context-based video annotation. In: IEEE 12th international conference on computer vision, pp 1420–1427
Jiang Y-G, Dai Q, Wang J, Ngo C-W, Xue X, Chang S-F (2012) Fast semantic diffusion for large-scale context-based image and video annotation. IEEE Trans Image Process 21(6):3080–3091
Jin Y, Khan L, Wang L, Awad M (2005) Image annotations by combining multiple evidence & WordNet. In: ACM international conference on multimedia, pp 706–715
Kang C, Xiang S, Liao S, Xu C, Pan C (2015) Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Trans Multimed 17(3):370–381
Kennedy LS, Chang S-F (2007) A reranking approach for context-based concept fusion in video indexing and retrieval. In: Proceedings of the 6th ACM international conference on image and video retrieval, pp 333–340
Lai PL, Fyfe C (2000) Kernel and nonlinear canonical correlation analysis. Int J Neural Syst 10(5):365
Miller GA (1995) WordNet: a lexical database for english. Commun ACM 38 (11):39–41
Oliva A, Torralba A (2006) Building the gist of a scene: the role of global image features in recognition. Prog Brain Res 155:23–36
Qi G-J, Hua X-S, Rui Y, Tang J, Mei T, Zhang H-J (2007) Correlative multi-label video annotation. In: ACM international conference on multimedia, pp 17–26
Ranjan V, Rasiwasia N, Jawahar CV (2015) Multi-label cross-modal retrieval. In: IEEE international conference on computer vision, pp 4094–4102
Rasiwasia N, Pereira JC, Coviello E et al (2010) A new approach to cross-modal multimedia retrieval. In: ACM international conference on multimedia, pp 251–260
Rasiwasia N, Mahajan D, Mahadevan V, Aggarwal G (2014) Cluster canonical correlation analysis. In: Proceedings of international conference on artificial intelligence and statistics
Sang J, Xu C, Liu J (2012) User-aware image tag refinement via ternary semantic analysis. IEEE Trans Multimed 14(3):883–895
Sang J, Fang Q, Xu C (2017) Exploiting social-mobile information for location visualization. ACM TIST 8(3):39:1–39:19
Sharma A (2012) Generalized multiview analysis: a discriminative latent space. In: IEEE conference on computer vision and pattern recognition, pp 2160–2167
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Computer Science
Srivastava N, Salakhutdinov R (2014) Multimodal learning with deep Boltzmann machines. J Mach Learn Res 15(8):1967–2006
Vinokourov A, Shawe-Taylor J, Cristianini N (2002) Inferring a semantic representation of text via cross-language correlation analysis. In: Advances of neural information processing systems, pp 1497–1504
Wang C, Jing F, Zhang L, Zhang H-J (2006) Image annotation refinement using random walk with restarts. In: ACM international conference on multimedia, pp 647–650
Wang K, He R, Wang W, Wang L, Tan T (2013) Learning coupled feature spaces for cross-modal matching. In: IEEE international conference on computer vision, pp 2088–2095
Wang P, Sun LF, Yang SQ, Smeaton AF (2016) Semantically smoothed refinement for everyday concept indexing. In: Pacific rim conference on multimedia (PCM)
Wang P, Sun LF, Yang SQ, Smeaton AF (2016) Towards training-free refinement for semantic indexing of visual media. In: International conference on multimedia modeling, pp 251–263
Wang P, Sun LF, Yang SQ, Smeaton AF, Gurrin C (2016) Characterizing everyday activities from visual lifelogs based on enhancing concept representation. Comput Vis Image Underst 148:181–192
Wang P, Sun LF, Yang SQ, Smeaton A F (2017) Training-free indexing refinement for visual media via multi-semantics. Neurocomputing 236:39–47
Wang H, Wu X, Jia Y (2017) Heterogeneous domain adaptation method for video annotation. IET Comput Vis 11(2):181–187
Wu Y, Tseng B, Smith JR (2004) Ontology-based multi-classification learning for video concept detection. In: IEEE international conference on multimedia and expo
Wu F, Zhang H, Zhuang Y (2007) Learning semantic correlations for cross-media retrieval. In: IEEE international conference on image processing. IEEE, pp 1465–1468
Wu F, Lu X, Zhang Z, Yan S, Rui Y, Zhuang Y (2013) Cross-media semantic representation via bi-directional learning to rank. In: ACM international conference on multimedia, pp 877–886
Xue X, Zhang W, Zhang J, Wu B, Fan J, Lu Y (2011) Correlative multi-label multi-instance image annotation. In: ICCV, pp 651–658
Yao T, Mei T, Ngo C W (2015) Learning query and image similarities with ranking canonical correlation analysis. In: IEEE international conference on computer vision, pp 28–36
Youshida K, Yoshimoto J, Doya K (2017) Sparse kernel canonical correlation analysis for discovery of nonlinear interactions in high-dimensional data. BMC Bioinf 18(1):108
Yu J, Rui Y, Tao D (2014) Click Prediction for web image reranking using multimodal sparse coding. IEEE Trans Image Process 23(5):2019–2032
Yu J, Tao D, Wang M, Rui Y (2015) Learning to rank using user clicks and visual features for image retrieval. IEEE Trans Cybern 45(4):767–779
Yu J, Yang X, Gao F, Tao D (2016) Deep multimodal distance metric learning using click constraints for image ranking. IEEE Trans Cybern PP(99):1–11
Acknowledgments
This work is supported by the Natural Science Foundation of China under Grant No. 61571453, No. 61502264, and No. 61405252, Natural Science Foundation of Hunan Province, China under Grant No. 14JJ3010, Research Funding of National University of Defense Technology under grant No. ZK16-03-37.
Author information
Authors and Affiliations
Corresponding author
Additional information
Yuhua Jia and Liang Bai are both first authors.
Rights and permissions
About this article
Cite this article
Jia, Y., Bai, L., Liu, S. et al. Semantically-enhanced kernel canonical correlation analysis: a multi-label cross-modal retrieval. Multimed Tools Appl 78, 13169–13188 (2019). https://doi.org/10.1007/s11042-018-5767-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-5767-1