Skip to main content
Log in

CAESAR: concept augmentation based semantic representation for cross-modal retrieval

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

With the increasing amount of multimedia data, cross-modal retrieval has attracted more attentions in the area of multimedia and computer vision. To bridge the semantic gap between multi-modal data and improve the performance of retrieval, we propose an effective concept augmentation based method, named CAESAR, which is an end-to-end framework including cross-modal correlation learning and concept augmentation based semantic mapping learning. To enhance the representation and correlation learning, a novel multi-modal CNNs based CCA model is developed, which is to capture high-level semantic information during the cross-modal feature learning, and then capture maximal nonlinear correlation. In addition, to learn the semantic relationships between multi-modal samples, a concept learning model named CaeNet is proposed, which is realized by word2vec and LDA to capture the closer relations between texts and abstract concepts. Reenforce by the abstract concept information, cross-modal semantic mappings are learnt with a semantic alignment strategy. We conduct comprehensive experiments on four benchmark multimedia datasets. The results show that our method has great performance for cross-modal retrieval.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: ICML

  2. Atrey PK, Hossain MA, Saddik AE, Kankanhalli MS (2010) Multimodal fusion for multimedia analysis: a survey. Multimedia Systems 16(6):345–379

    Article  Google Scholar 

  3. Ballan L, Uricchio T, Seidenari L, Del Bimbo A (2014) A cross-media model for automatic image annotation. In: Proceedings of international conference on multimedia retrieval, ACM, p 73

  4. Baltrušaitis T, Ahuja C, Morency L-P (2018) Multimodal machine learning: a survey and taxonomy. IEEE Trans Pattern Anal Mach Intell 41(2):423–443

    Article  Google Scholar 

  5. Blei DM, Jordan MI (2003) Modeling annotated data. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, ACM, pp 127–134

  6. Blei DM, Ng AY, Jordan MI, Lafferty J (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  7. Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010, Springer, pp 177–186

  8. Bottou L (2012) Stochastic gradient descent tricks. In: Neural networks: tricks of the trade, Springer, pp 421–436

  9. Cao G, Iosifidis A, Chen K, Gabbouj M (2017) Generalized multi-view embedding for visual recognition and cross-modal retrieval. IEEE Transactions on Cybernetics 48(9):2542–2555

    Article  Google Scholar 

  10. Chua T-S, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image database from national university of singapore. In: Proceedings of the ACM international conference on image and video retrieval, ACM, p 48

  11. Deng C, Chen Z, Liu X, Gao X, Tao D (2018) Triplet-based deep hashing network for cross-modal retrieval. IEEE Trans Image Process 27(8):3893–3903

    Article  MathSciNet  Google Scholar 

  12. Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2014) Decaf: a deep convolutional activation feature for generic visual recognition. In: International conference on machine learning, pp 647–655

  13. DPLSNTMCM C, Cadene R (2018) Cross-modal retrieval in the cooking context: learning semantic text-image embeddings. In: ACM SIGIR

  14. Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes(voc) challenge. Int J Comput Vis 88(2):303–338

    Article  Google Scholar 

  15. Feng F, Wang X, Li R (2014) Cross-modal retrieval with correspondence autoencoder. In: Proceedings of the 22nd ACM international conference on Multimedia, ACM, pp 7–16

  16. Fu R, Li B, Gao Y, Wang P (2016) Content-based image retrieval based on cnn and svm. In: 2016 2nd IEEE international conference on computer and communications (ICCC), IEEE, pp 638–642

  17. Gong Y, Ke Q, Isard M, Lazebnik S (2012) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106(2):210–233

    Article  Google Scholar 

  18. Gordo A, Almazán J, Revaud J, Larlus D (2016) Deep image retrieval: Learning global representations for image search. In: European conference on computer vision, Springer, pp 241–257

  19. Gu J, Cai J, Joty SR, Niu L, Wang G (2018) Look, imagine and match: improving textual-visual cross-modal retrieval with generative models. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7181–7189

  20. Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16 (12):2639–2664

    Article  Google Scholar 

  21. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  22. He L, Xu X, Lu H, Yang Y, Shen F, Shen HT (2017) Unsupervised cross-modal retrieval through adversarial learning. In: 2017 IEEE International conference on multimedia and expo (ICME), IEEE, pp 1153–1158

  23. He X, Peng Y, Xie L (2019) A new benchmark and approach for fine-grained cross-media retrieval. In: Proceedings of the 27th ACM international conference on multimedia, ACM, pp 1740– 1748

  24. Herbrich R (2000) Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers 115–132

  25. Hinton GE, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov RR Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580

  26. Hotelling H Relations between two sets of variates, vol 28

  27. Hwang SJ, Grauman K (2012) Learning the relative importance of objects from tagged images for retrieval and cross-modal search. Int J Comput Vis 100 (2):134–153

    Article  MathSciNet  Google Scholar 

  28. Joachims T (2002) Optimizing search engines using clickthrough data. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 133– 142

  29. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp 1097–1105

  30. LeCun Y, Bottou L, Bengio Y, Haffner P, et al (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  31. Lew M S, Sebe N, Djeraba C, Jain R (2006) Content-based multimedia information retrieval: state of the art and challenges. ACM Trans Multimed Comput Commun Appl(TOMM) 2(1):1–19

    Article  Google Scholar 

  32. Li C-X, Chen Z-D, Zhang P-F, Luo X, Nie L, Zhang W, Xu X-S (2018) Scratch: a scalable discrete matrix factorization hashing for cross-modal retrieval. In: 2018 ACM multimedia conference on multimedia conference, ACM, pp 1–9

  33. Liu P, Guo J-M, Wu C-Y, Cai D (2017) Fusion of deep learning and compressed domain features for content-based image retrieval. IEEE Trans Image Process 26(12):5706–5717

    Article  MathSciNet  Google Scholar 

  34. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110

    Article  Google Scholar 

  35. Messina A, Montagnuolo M (2009) A generalised cross-modal clustering method applied to multimedia news semantic indexing and retrieval. In: Proceedings of the 18th international conference on world wide web, ACM, pp 321–330

  36. Martin N, Maes H (1979) Multivariate analysis. Academic Press, London

    Google Scholar 

  37. Matsuo S, Yanai K (2016) Cnn-based style vector for style image retrieval. In: Proceedings of the 2016 ACM on international conference on multimedia retrieval, ACM, pp 309–312

  38. Muirhead RJ, Anderson TW (1986) An introduction to multivariate statistical analysis. J Bus Econ Stat 4(1):135

    Google Scholar 

  39. Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2011) Multimodal deep learning. In: Proceedings of the 28th international conference on machine learning (ICML-11), pp 689–696

  40. Pei-Xia S, Hui-Ting L, Tao L (2016) Learning discriminative cnn features and similarity metrics for image retrieval. In: 2016 IEEE International conference on signal processing, communications and computing (ICSPCC), IEEE, pp 1–5

  41. Peng Y, Qi J (2019) Cm-gans: cross-modal generative adversarial networks for common representation learning. ACM Trans Multimed Comput Commun Appl (TOMM) 15(1):22

    MathSciNet  Google Scholar 

  42. Peng Y, Huang X, Zhao Y (2017) An overview of cross-media retrieval: concepts, methodologies, benchmarks, and challenges. IEEE Transactions on Circuits and Systems for Video Technology 28(9):2372–2385

    Article  Google Scholar 

  43. Peng Y, Qi J, Yuan Y (2018) Modality-specific cross-modal similarity measurement with recurrent attention network. IEEE Trans Image Process 27(11):5585–5599

    Article  MathSciNet  Google Scholar 

  44. Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543

  45. Pereira JC, Coviello E, Doyle G, Rasiwasia N, Lanckriet G R, Levy R, Vasconcelos N (2013) On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans Pattern Anal Mach Intell 36(3):521–535

    Article  Google Scholar 

  46. Ranjan V, Rasiwasia N, Jawahar CV (2015) Multi-label cross-modal retrieval. In: International conference on computer vision(ICCV 2015)

  47. Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using amazon’s mechanical turk. In: Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk, Association for Computational Linguistics, pp 139–147

  48. Rasiwasia N, Pereira JC, Coviello E, Doyle G, Lanckriet GRG, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval

  49. Seddati O, Dupont S, Mahmoudi S, Parian M (2017) Towards good practices for image retrieval based on CNN features. In: Proceedings of the IEEE international conference on computer vision workshops, pp 1246–1255

  50. Sermanet P, Eigen D, Zhang X, Mathieu M, Fergus R, LeCun Y Overfeat: integrated recognition, localization and detection using convolutional networks. arXiv:1312.6229

  51. Sharif Razavian A, Azizpour H, Sullivan J, Carlsson S (2014) Cnn features off-the-shelf: an astounding baseline for recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 806–813

  52. Shen Y, Liu L, Shao L, Song J (2017) Deep binaries: Encoding semantic-rich cues for efficient textual-visual cross retrieval. In: Proceedings of the IEEE international conference on computer vision, pp 4097–4106

  53. Simonyan K, Zisserman A Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

  54. Sivic J, Zisserman A (2003) Video google: a text retrieval approach to object matching in videos. In: Null, IEEE, p 1470

  55. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1–9

  56. Thompson B (2005) Canonical correlation analysis, encyclopedia of statistics in behavioral science. Wiley Online Library

  57. Van Ginneken B, Setio AA, Jacobs C, Ciompi F (2015) Off-the-shelf convolutional neural network features for pulmonary nodule detection in computed tomography scans. In: 2015 IEEE 12th international symposium on biomedical imaging (ISBI), IEEE, pp 286–289

  58. Virtanen S, Jia Y, Klami A, Darrell T Factorized multi-modal topic model. arXiv:1210.4920

  59. Wan J, Wang D, Hoi S C H, Wu P, Zhu J, Zhang Y, Li J (2014) Deep learning for content-based image retrieval: a comprehensive study. In: Proceedings of the 22nd ACM international conference on multimedia, ACM, pp 157–166

  60. Wang S, Guo W (2017) Sparse multigraph embedding for multimodal feature representation. IEEE Trans Multimed 19(7):1454–1466

    Article  Google Scholar 

  61. Wang C, Yang H, Meinel C (2015) Deep semantic mapping for cross-modal retrieval. In: 2015 IEEE 27th international conference on tools with artificial intelligence (ICTAI), IEEE, pp 234–241

  62. Wang S, Lu J, Gu X, Weyori BA, Yang JY (2015) Unsupervised discriminant canonical correlation analysis based on spectral clustering. Neurocomputing 171(C):425–433

    Google Scholar 

  63. Wang K, Yin Q, Wang W, Wu S, Wang LA comprehensive survey on cross-modal retrieval. arXiv:1607.06215

  64. Wang B, Yang Y, Xu X, Hanjalic A, Shen HT (2017) Adversarial cross-modal retrieval. In: Proceedings of the 25th ACM international conference on multimedia, ACM, pp 154–162

  65. Wei Y, Zhao Y, Lu C, Wei S, Liu L, Zhu Z, Yan S (2016) Cross-modal retrieval with cnn visual features: a new baseline. IEEE Transactions on Cybernetics 47(2):449–460

    Google Scholar 

  66. Wu L, Wang Y, Shao L (2018) Cycle-consistent deep generative hashing for cross-modal retrieval. IEEE Trans Image Process 28(4):1602–1612

    Article  MathSciNet  Google Scholar 

  67. Yakhnenko O, Honavar V (2009) Multi-modal hierarchical dirichlet process model for predicting image annotation and image-object label correspondence. In: Proceedings of the 2009 SIAM international conference on data mining, SIAM, pp 283–293

  68. Yang Y, Wu F, Xu D, Zhuang Y, Chia L-T (2010) Cross-media retrieval using query dependent search methods. Pattern Recogn 43(8):2927–2936

    Article  Google Scholar 

  69. Yang E, Deng C, Liu W, Liu X, Tao D, Gao X (2017) Pairwise relationship guided deep hashing for cross-modal retrieval. In: Thirty-first AAAI conference on artificial intelligence

  70. Yang X, Ramesh P, Chitta R, Madhvanath S, Bernal EA, Luo J (2017) Deep multimodal representation learning from temporal data. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5447–5455

  71. Yang J, Liang J, Shen H, Wang K, Rosin PL, Yang M-H (2018) Dynamic match kernel with deep convolutional features for image retrieval. IEEE Trans Image Process 27(11):5288–5302

    Article  MathSciNet  Google Scholar 

  72. Yu J, Cong Y, Qin Z, Wan T (2012) Cross-modal topic correlations for multimedia retrieval. In: Proceedings of the 21st international conference on pattern recognition (ICPR2012), IEEE, pp 246–249

  73. Yu J, Lu Y, Qin Z, Zhang W, Liu Y, Tan J, Guo L (2018) Modeling text with graph convolutional network for cross-modal information retrieval. In: Pacific rim conference on multimedia, Springer, pp 223–234

  74. Zhai X, Peng Y, Xiao J (2012) Cross-modality correlation propagation for cross-media retrieval. In: 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 2337–2340

  75. Zhang B, Hao J, Ma G, Yue J, Zhang J, Shi Z (2015) Mixture of probabilistic canonical correlation analysis. Journal of Computer Research and Development 52(07):1463–1476

    Google Scholar 

  76. Zhang C, Lin Y, Zhu L, Zhang Z, Tang Y, Huang F (2019) Efficient region of visual interests search for geo-multimedia data. Multimedia Tools and Applications 78(21):30839–30863

    Article  Google Scholar 

  77. Zhen L, Hu P, Wang X, Peng D (2019) Deep supervised cross-modal retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 10394–10403

  78. Zhu L, Long J, Zhang C, Yu W, Yuan X, Sun L (2019) An efficient approach for geo-multimedia cross-modal retrieval. IEEE Access 7:180571–180589

    Article  Google Scholar 

  79. Zu C, Zhang D (2016) Canonical sparse cross-view correlation analysis. Neurocomputing 191:263–272

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (61702560, 61472450, 61972203), the Key Research Program of Hunan Province (2016JC2018), project (2018JJ3691) of Science and Technology Plan of Hunan Province, and the Research and Innovation Project of Central South University Graduate Students (2018zzts177).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Hao Yu or Jun Long.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhu, L., Song, J., Wei, X. et al. CAESAR: concept augmentation based semantic representation for cross-modal retrieval. Multimed Tools Appl 81, 34213–34243 (2022). https://doi.org/10.1007/s11042-020-09983-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-09983-3

Keywords

Navigation