Skip to main content
Log in

Semantic consistent adversarial cross-modal retrieval exploiting semantic similarity

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Cross-modal retrieval aims to search the semantically similar instances from the other modalities given a query from one modality. However, the differences of the distributions and representations between different modalities make that the similarity of different modalities can not be measured directly. To address this problem, in this paper, we propose a novel semantic consistent adversarial cross-modal retrieval (SC-ACMR), which learns semantic consistent representation for different modalities under adversarial learning framework by considering the semantic similarity from intra-modality and inter-modality. Specifically, for intra-modality, we minimize the intra-class distances. For the inter-modality, we require class center of different modalities with same semantic label to be as close as possible, and also minimize the distances between the samples and the class center with same semantic label from different modalities. Furthermore, we preserve the semantic similarity of transformed features of different modalities through a semantic similarity matrix. Comprehensive experiments on two benchmark datasets are conducted and the experimental results show that the proposed method have learned more compact semantic representations and achieved better performance than many existing methods in cross-modal retrieval.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: The 30th international conference on machine learning (ICML), pp 1247–1255

  2. Chua TS, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image database from national university of singapore. In: ACM International conference on image and video retrieval, pp 48

  3. Costa PJ, Coviello E, Doyle G, Rasiwasia N, Lanckriet GR, Levy R, Vasconcelos N (2014) On the role of correlation and abstraction in cross-modal multimedia retrieval. IEEE Trans Pattern Anal Mach Intell 36(3):521–35

    Article  Google Scholar 

  4. Deng C, Chen Z, Liu X, Gao X, Tao D (2018) Triplet-based deep hashing network for cross-modal retrieval. IEEE Trans Image Process 27(8):3893–3903

    Article  MathSciNet  Google Scholar 

  5. Dong S, Gao Z, Sun S, Wang X, Li M, Zhang H, Yang G, Liu H, Li S (2018) Holistic and deep feature pyramids for saliency detection. In: British machine vision conference (BMVC), Northumbria University, Newcastle, UK, September 3–6, p 67

  6. Feng F, Wang X, Li R (2014) Cross-modal retrieval with correspondence autoencoder. The 22nd International conference on multimedia (ACM):7–16

  7. Gao L, Guo Z, Zhang H, Xu X, Shen HT (2017) Video captioning with attention-based lstm and semantic consistency. IEEE Trans Multimedia 19(9):2045–2055

    Article  Google Scholar 

  8. Gao Z, Li Y, Sun Y, Yang J, Xiong H, Zhang H, Liu X, Wu W, Liang D, Li S (2018) Motion tracking of the carotid artery wall from ultrasound image sequences: a nonlinear state-space approach. IEEE Trans Med Imaging 37(1):273–283

    Article  Google Scholar 

  9. Gao Z, Xiong H, Liu X, Zhang H, Ghista D, Wu W, Li S (2017) Robust estimation of carotid artery wall motion using the elasticity-based state-space approach. Med Image Anal 37:1–21

    Article  Google Scholar 

  10. Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106 (2):210–233

    Article  Google Scholar 

  11. Gong M, Zhang K, Liu T, Tao D, Glymour C, Schölkopf B (2016) Domain adaptation with conditional transferable components. In: Proceedings of the 33nd international conference on machine learning (ICML), New York City, NY, USA, June 19–24, vol 48, pp 2839–2848

  12. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems (NIPS), pp 2672–2680

  13. Hardoon DR, Szedmak S, Shawetaylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664

    Article  Google Scholar 

  14. He Z, Li X, You X, Tao D, Tang YY (2016) Connected component model for multi-object tracking. IEEE Trans Image Process 25(8):3698–3711

    Article  MathSciNet  Google Scholar 

  15. Hua Y, Tian H, Cai A, Shi P (2016) Cross-modal correlation learning with deep convolutional architecture. In: Visual communications and image processing, pp 1–4

  16. Huang X, Peng Y, Yuan M (2018) Mhtn: Modal-adversarial hybrid transfer network for cross-modal retrieval. IEEE Transactions on Cybernetics, https://doi.org/10.1109/TCYB.2018.2879846

  17. Jacobs DW, Daume H, Kumar A, Sharma A (2012) Generalized multiview analysis: a discriminative latent space. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 2160– 2167

  18. Jiang X, Wu F, Li X, Zhao Z, Lu W, Tang S, Zhuang Y (2015) Deep compositional cross-modal learning to rank via local-global alignment. In: International conference on multimedia ACM, pp 69–78

  19. Kang C, Xiang S, Liao S, Xu C, Pan C (2015) Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Trans Multimedia 17 (3):370–381

    Article  Google Scholar 

  20. Li C, Deng C, Li N, Liu W, Gao X, Tao D (2018) Self-supervised adversarial hashing networks for cross-modal retrieval. arXiv:1804.01223

  21. Li H, Xu X, Lu H, Yang Y, Shen F, Shen HT (2017) Unsupervised cross-modal retrieval through adversarial learning. In: IEEE international conference on multimedia and expo, pp 1153–1158

  22. Liu Q, Lu X, He Z, Zhang C, Wen-sheng C (2017) Deep convolutional neural networks for thermal infrared object tracking. Knowl-Based Syst 134:189–198

    Article  Google Scholar 

  23. Lu H, Li B, Zhu J, Li Y, Li Y, Xu X, He L, Li X, Li J, Serikawa S (2017) Wound intensity correction and segmentation with convolutional neural networks. Concurrency & Computation Practice & Experience. https://doi.org/10.1002/cpe.3927

  24. Lu H, Li Y, Chen M, Kim H, Serikawa S (2018) Brain intelligence: Go beyond artificial intelligence. Mobile Networks & Applications 23(2):368–375

    Article  Google Scholar 

  25. Lu H, Li Y, Mu S, Wang D, Kim H, Serikawa S Motor anomaly detection for unmanned aerial vehicles using reinforcement learning. IEEE Internet Things J, https://doi.org/10.1109/JIOT.2017.2737479

  26. Lu H, Li Y, Uemura T, Ge Z, Xu X, Li H, Serikawa S, Kim H (2017) Fdcnet: filtering deep convolutional network for marine organism classification. Multimed Tools Appl(2):1–14

  27. Lu H, Li Y, Uemura T, Kim H, Serikawa S (2018) Low illumination underwater light field images reconstruction using deep convolutional neural networks. Futur Gener Comput Syst. https://doi.org/10.1016/j.future.2018.01.001

  28. Maaten Laurens van der, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(11):2579–2605

    MATH  Google Scholar 

  29. Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2011) Multimodal deep learning. In: The 28th international conference on machine learning (ICML), Washington, USA, from June 28 to July 2, 2011, pp 689–696

  30. Peng Y, Huang X, Zhao Y (2017) An overview of cross-media retrieval: Concepts, methodologies, benchmarks and challenges. IEEE Trans Circuits Syst Video Technol: 1–14

  31. Peng Y, Qi J, Yuan Y Cm-gans: Cross-modal generative adversarial networks for common representation learning. arXiv:1710.05106

  32. Peng Y, Zhang J, Yuan M (2018) Sch-gan: Semi-supervised cross-modal hashing by generative adversarial network. IEEE Transactions on Cybernetics, https://doi.org/10.1109/TCYB.2018.2868826

  33. Rasiwasia N, Pereira JC, Coviello E, Doyle G, Lanckriet GRG, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: International conference on multimedia (ACM), pp 251–260

  34. Rosipal R, Kramer N (2006) Overview and recent advances in partial least squares. International Statistical and Optimization Perspectives Workshop 3940:34–51

    Google Scholar 

  35. Song J, Yuyu G, Gao L, Li X, Hanjalic A, Shen HT (2018) From deterministic to generative: Multi-modal stochastic rnns for video captioning. IEEE Transactions on Neural Networks and Learning Systems, https://doi.org/10.1109/TNNLS.2018.2851077

  36. Song J, Zhang H, Li X, Gao L, Wang M, Hong R (2018) Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Trans Image Process 27 (7):3210

    Article  MathSciNet  Google Scholar 

  37. Srivastava N, Salakhutdinov R (2012) Learning representations for multimodal data with deep belief nets. ICML workshop:79

  38. Tenenbaum JB, Freeman WT (2000) Separating style and content with bilinear models. Neural Comput 12(6):1247–1283

    Article  Google Scholar 

  39. Wang B, Yang Y, Xu X, Hanjalic A, Shen HT (2017) Adversarial cross-modal retrieval. In: International conference on multimedia (ACM), pp 154–162

  40. Wang K, He R, Wang W, Wang L, Tan T (2013) Learning coupled feature spaces for cross-modal matching. IEEE International Conference on Computer Vision (ICCV):2088–2095

  41. Wang J, He Y, Kang C, Xiang S, Pan C (2015) Image-text cross-modal retrieval via modality-specific feature learning. In: International conference on multimedia retrieval (ACM), pp 347–354

  42. Wang K, He R, Wang L, Wang W, Tan T (2016) Joint feature selection and subspace learning for cross-modal retrieval. IEEE Trans Pattern Anal Mach Intell (PAMI) 38(10):2010–2023

    Article  Google Scholar 

  43. Wang K, Yin Q, Wang W, Wu S, Wang L (2016) A comprehensive survey on cross-modal retrieval. arXiv:1607.06215

  44. Wei Y, Zhao Y, Lu C , Wei S, Liu L, Zhu Z, Yan S (2016) Cross-modal retrieval with cnn visual features: a new baseline. IEEE Transactions on Cybernetics 47(2):449–460

    Google Scholar 

  45. Xi Z, Zhou S, Feng J, Lai H, Li B, Pan Y, Yin J, Yan S (2017) Hashgan: Attention-aware deep adversarial hashing for cross modal retrieval. arXiv:1711.09347

  46. Xu T, Yang Y, Deng C, Gao X (2016) Coupled dictionary learning with common label alignment for cross-modal retrieval. IEEE Trans Multimedia 18 (2):208–218

    Article  Google Scholar 

  47. Xu X, Li H, Shimada A, Taniguchi RI, Huimin L (2016) Learning unified binary codes for cross-modal retrieval via latent semantic hashing. Neurocomputing 213:191–203

    Article  Google Scholar 

  48. Xu X, Li H, Lu H, Gao L, Ji Y (2018) Deep adversarial metric learning for cross-modal retrieval. World Wide Web:1–16. https://doi.org/10.1007/s11280-018-0541-x

  49. Xu X, Shen F, Yang Y, Shen HT, Li X (2017) Learning discriminative binary codes for large-scale cross-modal retrieval. IEEE Trans Image Process 26(5):2494–2507

    Article  MathSciNet  Google Scholar 

  50. Xu X, Song J, Lu H, Yang Y, Shen F, Zi H (2018) Modal-adversarial semantic learning network for extendable cross-modal retrieval. In: International conference on multimedia retrieval (ICMR), Yokohama, Japan, June 11–14, pp 46–54. https://doi.org/10.1145/3206025.3206033

  51. Yao T, Mei T, Ngo CW (2015) Learning query and image similarities with ranking canonical correlation analysis. In: IEEE International conference on computer vision (ICCV), pp 28–36

  52. Zhai X, Peng Y, Xiao J (2014) Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Trans Circuits Syst Video Technol 24 (6):965–978

    Article  Google Scholar 

Download references

Acknowledgments

Weihua Ou and Quan Zhou are the corresponding author. This work was supported by the National Natural Science Foundation of China (No. 61762021,61502208,61876093, 61881240048), Natural Science Foundation of Guizhou Province (Grant No.[2017]1130, [2017]5726-32), Key Disciplines of Guizhou Province (ZDXK[2016]8), the 2014 Ph.D. Recruitment Program of Guizhou Normal University, Natural Science Foundation of Jiangsu Province (Grant No.BK20150522,BK20181393), Foundation of Guizhou Educational Department (KY[2016]027), HIRP Open 2018 Project of Huawei. International Postdoctoral Exchange Fellowship Program of China Postdoctoral Council (No. 20180051).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weihua Ou.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ou, W., Xuan, R., Gou, J. et al. Semantic consistent adversarial cross-modal retrieval exploiting semantic similarity. Multimed Tools Appl 79, 14733–14750 (2020). https://doi.org/10.1007/s11042-019-7343-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-019-7343-8

Keywords

Navigation