Abstract
Cross-modal retrieval not only needs to eliminate the heterogeneity of modalities, but also needs to constrain the return order of retrieval results. Accordingly, we propose a novel common representation space learning method, called Semantic Ranking Structure Preserving (SRSP) for Cross-modal Retrieval in this paper. First, the dependency relationship between labels is used to minimize the discriminative loss of multi-modal data and mine potential relationships between samples to get richer semantic information in the common space. Second, we constrain the correlation ranking of representations in common space, so as to break the modal gap and promote the multi-modal correlation learning. The comprehensive experimental comparison results show that our algorithm substantially enhances the performance and consistently outperforms very recent algorithms in terms of widely used cross-modal benchmark datasets.






Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Peng Y, Qi J (2019) Cm-gans: Cross-modal generative adversarial networks for common representation learning. ACM Trans Multimed Comput Commun Appl (TOMM) 15(1):1–24
Zhang J, Peng Y (2020) Multi-pathway generative adversarial hashing for unsupervised cross-modal retrieval. IEEE Trans Multimed 22(1):174–187
Zhu B, Ngo C-W, Chen J, Hao Y (2019) R2gan: Cross-modal recipe retrieval with generative adversarial network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 11477–11486
Zhang Y-y, Feng Y, Liu D-j, Shang J-x, Qiang B-h (2020) Frwcae: joint faster-rcnn and wasserstein convolutional auto-encoder for instance retrieval. Applied Intelligence, pp 1–14
Varish N, Pal AK (2018) A novel image retrieval scheme using gray level co-occurrence matrix descriptors of discrete cosine transform based residual image. Appl Intell 48 (9):2930– 2953
Chen J, Cheung WK (2019) Similarity preserving deep asymmetric quantization for image retrieval. Proc AAAI Conf Artif Intell 33:8183–8190
Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: International conference on machine learning, pp 1247–1255
Wang K, He R, Wang L, Wang W, Tan T (2015) Joint feature selection and subspace learning for cross-modal retrieval. IEEE Trans Pattern Anal Mach Intell 38(10):2010–2023
Yi Y, Nie F, Xu D, Luo J, Zhuang Y, Pan Y (2011) A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Trans Pattern Anal Mach Intell 34(4):723–742
Ma D, Zhai X, Peng Y (2013) Cross-media retrieval by cluster-based correlation analysis. In: 2013 IEEE International Conference On Image Processing. IEEE, pp 3986–3990
Ye Z, Peng Y (2019) Sequential cross-modal hashing learning via multi-scale correlation mining. ACM Trans Multimed Comput Commun Appl (TOMM) 15(4):1–20
Jiang Q-Y, Li W-J (2017) Deep cross-modal hashing. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3232–3240
Hotelling H (1992) Relations between two sets of variates. In: Breakthroughs in statistics. Springer, pp 162–190
Ranjan V, Rasiwasia N, Jawahar C V (2015) Multi-label cross-modal retrieval. In: Proceedings of the IEEE International Conference on Computer Vision, pp 4094–4102
Cao Y, Long M, Wang J, Liu S (2017) Collective deep quantization for efficient cross-modal retrieval. In: Thirty-First AAAI Conference on Artificial Intelligence
Wu L, Wang Y, Shao L (2019) Cycle-consistent deep generative hashing for cross-modal retrieval. IEEE Trans Image Process 28(4):1602–1612
Wu G, Lin Z, Han J, Li L, Ding G, Zhang B, Shen J (2018) Unsupervised deep hashing via binary latent factor models for large-scale cross-modal retrieval. In: IJCAI, pp 2854– 2860
Peng Y, Huang X, Qi J (2016) Cross-media shared representation by hierarchical learning with multiple deep networks. In: IJCAI, pp 3846–3853
Wang W, Yang X, Ooi BC, Zhang D, Zhuang Y (2016) Effective deep learning-based multi-modal retrieval. VLDB J 25(1):79–101
Cao Y, Long M, Wang J, Zhu H (2016) Correlation autoencoder hashing for supervised cross-modal search. In: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, pp 197–204
Wang X, Huang Q, Celikyilmaz A, Gao J, Shen D, Wang Y-F, Wang WY, Zhang L (2019) Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 6629–6638
Liong VE, Lu J, Tan Y-P, Zhou J (2017) Cross-modal deep variational hashing. In: Proceedings of the IEEE International Conference on Computer Vision, pp 4077–4085
Li C, Deng C, Li N, Liu W, Gao X, Tao D (2018) Self-supervised adversarial hashing networks for cross-modal retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4242–4251
Xu R, Li C, Yan J, Deng C, Liu X (2019) Graph convolutional network hashing for cross-modal retrieval. In: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence. IJCAI, pp 10–16
Mandal D, Rao P, Biswas S (2019) Semi-supervised cross-modal retrieval with label prediction. IEEE Transactions on Multimedia
Zhen L, Hu P, Xu W, Peng D (2019) Deep supervised cross-modal retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 10394–10403
Wang B, Yang Y, Xu X, Hanjalic A, Shen Heng T (2017) Adversarial cross-modal retrieval. In: Proceedings of the 25th ACM international conference on Multimedia, pp 154–162
Peng Y, Qi J, Huang X, Yuan Y (2018) Ccl: Cross-modal correlation learning with multigrained fusion by hierarchical network. IEEE Trans Multimed 20(2):405–420
Wang L, Li Y, Lazebnik S (2016) Learning deep structure-preserving image-text embeddings. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5005–5013
Peng Y, Qi J, Zhuo Y (2020) Mava: Multi-level adaptive visual-textual alignment by cross-media bi-attention mechanism. IEEE Trans Image Process 29:2728–2741
He X, Peng Y, Xie L (2019) A new benchmark and approach for fine-grained cross-media retrieval. In: Proceedings of the 27th ACM International Conference on Multimedia, pp 1740–1748
Gu J, Cai J, Joty SR, Niu L, Wang G (2018) Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7181–7189
Chen Z-M, Wei X-S, Wang P, Guo Y (2019) Multi-label image recognition with graph convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5177–5186
Kipf TN, Welling M Semi-supervised classification with graph convolutional networks
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. Computer Science
Cer D, Yang Y, Kong S-y, Hua N, Limtiaco N, St John R, Constant N, Guajardo-Cespedes M, Yuan S, Tar C et al (2018) Universal sentence encoder for english. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 169–174
Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. Computer Science
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In European conference on computer vision. Springer, pp 740–755
Chua T-S, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image database from national university of singapore. In: Proceedings of the ACM international conference on image and video retrieval, pp 1–9
Järvelin K, Kekäläinen J (2002) Cumulated gain-based evaluation of ir techniques. ACM Trans Inf Syst (TOIS) 20(4):422–446
Wang X, Hua Y, Kodirov E, Hu G, Garnier R, Robertson NM (2019) Ranked list loss for deep metric learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5207–5216
Acknowledgements
Supported by National Key R&D Program of China (No. 2017YFB1402400), National Nature Science Foundation of China (No. 61762025), Guangxi Key Laboratory of Trusted Software (No. kx202006), Guangxi Key Laboratory of Optoelectroric Information Processing (No. GD18202), and Natural Science Foundation of Guangxi Province, China (No. 2019GXNSFDA185007).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Liu, H., Feng, Y., Zhou, M. et al. Semantic ranking structure preserving for cross-modal retrieval. Appl Intell 51, 1802–1812 (2021). https://doi.org/10.1007/s10489-020-01930-x
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-020-01930-x