Skip to main content
Log in

Intra-class low-rank regularization for supervised and semi-supervised cross-modal retrieval

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Cross-modal retrieval aims to retrieve related items across different modalities, for example, using an image query to retrieve related text. The existing deep methods ignore both the intra-modal and inter-modal intra-class low-rank structures when fusing various modalities, which decreases the retrieval performance. In this paper, two deep models (denoted as ILCMR and Semi-ILCMR) based on intra-class low-rank regularization are proposed for supervised and semi-supervised cross-modal retrieval, respectively. Specifically, ILCMR integrates the image network and text network into a unified framework to learn a common feature space by imposing three regularization terms to fuse the cross-modal data. First, to align them in the label space, we utilize semantic consistency regularization to convert the data representations to probability distributions over the classes. Second, we introduce an intra-modal low-rank regularization, which encourages the intra-class samples that originate from the same space to be more relevant in the common feature space. Third, an inter-modal low-rank regularization is applied to reduce the cross-modal discrepancy. To enable the low-rank regularization to be optimized using automatic gradients during network back-propagation, we propose the rank-r approximation and specify the explicit gradients for theoretical completeness. In addition to the three regularization terms that rely on label information incorporated by ILCMR, we propose Semi-ILCMR in the semi-supervised regime, which introduces a low-rank constraint before projecting the general representations into the common feature space. Extensive experiments on four public cross-modal datasets demonstrate the superiority of ILCMR and Semi-ILCMR over other state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Cao W, Lin Q, He Z, He Z (2019) Hybrid representation learning for cross-modal retrieval. Neurocomputing 345:45–57

    Article  Google Scholar 

  2. Catelli R, Casola V, De Pietro G, Fujita H, Esposito M (2021) Combining contextualized word representation and sub-document level analysis through bi-lstm+ crf architecture for clinical de-identification. Knowl-Based Syst 213:106649

    Article  Google Scholar 

  3. Cheng Q, Gu X (2020) Bridging multimedia heterogeneity gap via graph representation learning for cross-modal retrieval. Neural Networks

  4. Chua T S, Tang J, Hong R, Li H, Luo Z, Zheng Y (2009) Nus-wide: a real-world web image database from national university of singapore. In: Proceedings of the ACM International Conference on Image and Video Retrieval, pp 1–9. https://doi.org/10.1145/1646396.1646452

  5. Deng T, Ye D, Ma R, Fujita H, Xiong L (2020) Low-rank local tangent space embedding for subspace clustering. Inf Sci 508:1–21

    Article  MathSciNet  Google Scholar 

  6. Devlin J, Chang M W, Lee K, Toutanova K (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp 4171–4186

  7. Ding Z, Fu Y (2018) Deep transfer low-rank coding for cross-domain learning. IEEE Trans Neural Netw Learn Syst 30(6):1768–1779. https://doi.org/10.1109/TNNLS.2018.2874567

    Article  MathSciNet  Google Scholar 

  8. Ding Z, Shao M, Fu Y (2018) Generative zero-shot learning via low-rank embedded semantic dictionary. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2018.2867870

  9. Eckart C, Young G (1939) A principal axis transformation for non-hermitian matrices. Bull Am Math Soc 45(2):118–121

    Article  MathSciNet  Google Scholar 

  10. Esposito M, Damiano E, Minutolo A, De Pietro G, Fujita H (2020) Hybrid query expansion using lexical resources and word embeddings for sentence retrieval in question answering. Inf Sci 514:88–105

    Article  Google Scholar 

  11. Fang X, Han N, Wu J, Xu Y, Yang J, Wong W K, Li X (2018) Approximate low-rank projection learning for feature extraction. IEEE Trans Neural Netw Learn Syst 29(11):5228–5241. https://doi.org/10.1109/TNNLS.2018.2796133

    Article  MathSciNet  Google Scholar 

  12. Fei L, Xu Y, Fang X, Yang J (2017) Low rank representation with adaptive distance penalty for semi-supervised subspace classification. Pattern Recogn 67:252–262. https://doi.org/10.1016/j.patcog.2017.02.017

    Article  Google Scholar 

  13. Feng F, Wang X, Li R (2014) Cross-modal retrieval with correspondence autoencoder. In: Proceedings of the 22nd ACM International Conference on Multimedia. ACM, pp 7–16. https://doi.org/10.1145/2647868.2654902

  14. Golub G H, Hoffman A, Stewart G W (1987) A generalization of the eckart-young-mirsky matrix approximation theorem. Linear Algebra Appl 88:317–327

    Article  MathSciNet  Google Scholar 

  15. Hardoon D R, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664. https://doi.org/10.1162/0899766042321814

    Article  Google Scholar 

  16. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 770–778

  17. He Y, Xiang S, Kang C, Wang J, Pan C (2016) Cross-modal retrieval via deep and bidirectional representation learning. IEEE Trans Multimed 18(7):1363–1377. https://doi.org/10.1109/TMM.2016.2558463

    Article  Google Scholar 

  18. Hu P, Zhen L, Peng D, Liu P (2019) Scalable deep multimodal learning for cross-modal retrieval. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp 635–644

  19. Kang P, Fang X, Zhang W, Teng S, Fei L, Xu Y, Zheng Y (2018) Supervised group sparse representation via intra-class low-rank constraint. In: Chinese Conference on Biometric Recognition. Springer, pp 206–213. https://doi.org/10.1007/978-3-319-97909-0_22

  20. Kang P, Lin Z, Yang Z, Fang X, Li Q, Liu W (2019) Deep semantic space with intra-class low-rank constraint for cross-modal retrieval. In: Proceedings of the 2019 on International Conference on Multimedia Retrieval. ACM, pp 226–234. https://doi.org/10.1145/3323873.3325029

  21. Lezama J, Qiu Q, Musé P, Sapiro G (2018) Ole: Orthogonal low-rank embedding-a plug and play geometric loss for deep learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 8109–8118

  22. Li C, Deng C, Wang L, Xie D, Liu X (2019) Coupled cyclegan: Unsupervised hashing network for cross-modal retrieval. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol 33, pp 176–183

  23. Li K, Qi G J, Ye J, Hua K A (2016) Linear subspace ranking hashing for cross-modal retrieval. IEEE Trans Pattern Anal Mach Intell 39(9):1825–1838. https://doi.org/10.1109/TPAMI.2016.2610969

    Article  Google Scholar 

  24. Liu H, Feng Y, Zhou M, Qiang B (2020) Semantic ranking structure preserving for cross-modal retrieval. Appl Intell:1–11

  25. Maaten Lvd, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9:2579–2605

    MATH  Google Scholar 

  26. Peng Y, Huang X, Qi J (2016) Cross-media shared representation by hierarchical learning with multiple deep networks. In: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pp 3846–3853

  27. Peng Y, Huang X, Zhao Y (2017) An overview of cross-media retrieval: concepts, methodologies, benchmarks, and challenges. IEEE Trans Circ Syst Video Technol 28 (9):2372–2385. https://doi.org/10.1109/TCSVT.2017.2705068

    Article  Google Scholar 

  28. Peng Y, Qi J, Huang X, Yuan Y (2017) Ccl: Cross-modal correlation learning with multigrained fusion by hierarchical network. IEEE Trans Multimed 20(2):405–420. https://doi.org/10.1109/TMM.2017.2742704

    Article  Google Scholar 

  29. Peters M E, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp 2227–2237

  30. Pota M, Marulli F, Esposito M, De Pietro G, Fujita H (2019) Multilingual pos tagging by a composite deep architecture based on character-level features and on-the-fly enriched word embeddings. Knowl-Based Syst 164:309–323

    Article  Google Scholar 

  31. Qi J, Peng Y (2018) Cross-modal bidirectional translation via reinforcement learning. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp 2630–2636

  32. Qiang H, Wan Y, Liu Z, Xiang L, Meng X (2020) Discriminative deep asymmetric supervised hashing for cross-modal retrieval. Knowl-Based Syst 204:106188

    Article  Google Scholar 

  33. Qiu Q, Sapiro G (2015) Learning transformations for clustering and classification. J Mach Learn Res 16(1):187–225

    MathSciNet  MATH  Google Scholar 

  34. Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using amazon’s mechanical turk. In: Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. Association for Computational Linguistics, pp 139–147

  35. Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet G R, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: Proceedings of the 18th ACM International Conference on Multimedia. ACM, pp 251–260. https://doi.org/10.1145/1873951.1873987

  36. Shang F, Zhang H, Zhu L, Sun J (2019) Adversarial cross-modal retrieval based on dictionary learning. Neurocomputing 355:93–104

    Article  Google Scholar 

  37. Shen T H, Liu L, Yang Y, Xu X, Huang Z, Shen F, Hong R (2020) Exploiting subspace relation in semantic labels for cross-modal hashing. IEEE Trans Knowl Data Eng:1–1

  38. Situ R, Yang Z, Lv J, Li Q, Liu W (2018) Cross-modal event retrieval: a dataset and a baseline using deep semantic learning. In: Pacific Rim Conference on Multimedia. Springer, pp 147–157. https://doi.org/10.1007/978-3-030-00767-6_14

  39. Wang B, Yang Y, Xu X, Hanjalic A, Shen H T (2017) Adversarial cross-modal retrieval. In: Proceedings of the 25th ACM International Conference on Multimedia, pp 154–162. https://doi.org/10.1145/3123266.3123326

  40. Wang D, Gao X B, Wang X, He L (2018) Label consistent matrix factorization hashing for large-scale cross-modal similarity search. IEEE Transactions on Pattern Analysis and Machine Intelligence. https://doi.org/10.1109/TPAMI.2018.2861000

  41. Wang X, Hu P, Zhen L, Peng D (2021) Drsl: Deep relational similarity learning for cross-modal retrieval. Inf Sci 546:298–311. https://doi.org/10.1016/j.ins.2020.08.009

    Article  Google Scholar 

  42. Wei Y, Zhao Y, Lu C, Wei S, Liu L, Zhu Z, Yan S (2016) Cross-modal retrieval with cnn visual features: a new baseline. IEEE Trans Cybern 47(2):449–460. https://doi.org/10.1109/TCYB.2016.2519449

    Google Scholar 

  43. Wen J, Xu Y, Liu H (2018) Incomplete multiview spectral clustering with adaptive graph learning. IEEE Transactions on Cybernetics. https://doi.org/10.1109/TCYB.2018.2884715

  44. Wu F, Jing X Y, Wu Z, Ji Y, Dong X, Luo X, Huang Q, Wang R (2020) Modality-specific and shared generative adversarial network for cross-modal retrieval. Pattern Recogn:107335

  45. Xiao Q, Dai J, Luo J, Fujita H (2019) Multi-view manifold regularized learning-based method for prioritizing candidate disease mirnas. Knowl-Based Syst 175:118–129

    Article  Google Scholar 

  46. Xu X, He L, Lu H, Gao L, Ji Y (2019) Deep adversarial metric learning for cross-modal retrieval. World Wide Web 22(2):657–672

    Article  Google Scholar 

  47. Yan F, Mikolajczyk K (2015) Deep correlation for matching images and text. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3441–3450

  48. Yang X, Jiang X, Tian C, Wang P, Zhou F, Fujita H (2020) Inverse projection group sparse representation for tumor classification: a low rank variation dictionary approach. Knowl-Based Syst 196:105768

    Article  Google Scholar 

  49. Yang Z, Lin Z, Kang P, Lv J, Li Q, Liu W (2020) Learning shared semantic space with correlation alignment for cross-modal event retrieval. ACM Trans Multimed Comput Commun Appl 16(1):1–22. https://doi.org/10.1145/3374754

    Article  Google Scholar 

  50. Zhai X, Peng Y, Xiao J (2013) Learning cross-media joint representation with sparse and semisupervised regularization. IEEE Trans Circ Syst Video Technol 24(6):965–978. https://doi.org/10.1109/TCSVT.2013.2276704

    Article  Google Scholar 

  51. Zhan S, Wu J, Han N, Wen J, Fang X (2019) Unsupervised feature extraction by low-rank and sparsity preserving embedding. Neural Netw 109:56–66. https://doi.org/10.1016/j.neunet.2018.10.001

    Article  Google Scholar 

  52. Zhang L, Ma B, Li G, Huang Q, Tian Q (2017) Generalized semi-supervised and structured subspace learning for cross-modal retrieval. IEEE Trans Multimed 20(1):128–141. https://doi.org/10.1109/TMM.2017.2723841

    Article  Google Scholar 

  53. Zhang X (2017) Matrix analysis and applications. Cambridge University Press

  54. Zhen L, Hu P, Wang X, Peng D (2019) Deep supervised cross-modal retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 10394–10403

  55. Zheng Z, Zheng L, Garrett M, Yang Y, Xu M, Shen Y D (2020) Dual-path convolutional image-text embeddings with instance loss. ACM Trans Multimed Comput Commun Appl 16(2):1–23. https://doi.org/10.1145/3383184

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (No. 62076073), the Guangdong Basic and Applied Basic Research Foundation (No. 2020A1515010616), the Guangdong Innovative Research Team Program (No. 2014ZT05G157).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Zhenguo Yang or Wenyin Liu.

Ethics declarations

Competing interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A:: Derivation of (13)

We acquired the approximation of rank calculation as shown in (7):

$$ rank(A) \approx \rho(A,r) = \frac{{\sum}_{i=r+1}^{s} \delta_{i}}{{\sum}_{i=1}^{s} \delta_{i}} = 1- \frac{{\sum}_{i=1}^{r} \delta_{i}}{{\sum}_{i=1}^{s} \delta_{i}}, $$

and combine it with the SVD knowledge as shown in (21):

$$ A = U \varSigma V^{T} = \delta_{1} u_{1} {v_{1}^{T}} + \delta_{2} u_{2} {v_{2}^{T}} + {\dots} + \delta_{s} u_{s} {v_{s}^{T}}. $$
(21)

We acquire the derivative in terms of A by the chain rule as shown in (13):

$$ \frac{\partial rank(A)}{\partial A} \approx \frac{\partial \rho(A,r)}{\partial A} = {\sum}_{i=1}^{s} \frac{\partial \rho}{\partial \delta_{i}} \cdot \frac{\partial \delta_{i}}{\partial A}. $$

A.1 \(\frac {\partial \rho }{\partial \delta _{i}}\)

For i ∈ [1,r], we have the following derivation from (7):

$$ \frac{\partial \rho}{\partial \delta_{i}} = \frac{0*{\sum}_{j=1}^{s} \delta_{j} - 1*{\sum}_{j=r+1}^{s} \delta_{j}}{{({\sum}_{j=1}^{s} \delta_{j})}^{2}} = - \frac{{\sum}_{j=r+1}^{s} \delta_{j}}{{({\sum}_{j=1}^{s} \delta_{j})}^{2}},\! $$
(22)

and for i ∈ [r + 1,s], we have the following derivation from (7):

$$ \frac{\partial \rho}{\partial \delta_{i}} = \frac{1*{\sum}_{j=1}^{s} \delta_{j} - 1*{\sum}_{j=r+1}^{s} \delta_{j}}{{({\sum}_{j=1}^{s} \delta_{j})}^{2}} = \frac{{\sum}_{j=1}^{r} \delta_{j}}{{({\sum}_{j=1}^{s} \delta_{j})}^{2}}. $$
(23)

Therefore, we can derive (14):

$$ \frac{\partial \rho}{\partial \delta_{i}}=\left\{ \begin{array}{rcl} - \frac{{\sum}_{j=r+1}^{s} \delta_{j}}{{\left( {\sum}_{j=1}^{s} \delta_{j} \right)}^{2}}, & & {i=1, \dots, r}\\ \\ \frac{{\sum}_{j=1}^{r} \delta_{j}}{{\left( {\sum}_{j=1}^{s} \delta_{j} \right)}^{2}}, & & {i=r+1, \dots, s} \end{array}. \right. $$

A.2 \(\frac {\partial \delta _{i}}{\partial A}\)

From (21), we have the following equation:

$$ \begin{array}{lll} &&\begin{pmatrix} a_{11}&a_{12}&\cdots&a_{1d}\\a_{21}&a_{22}&\cdots&a_{2d}\\\vdots&\vdots&\ddots&\vdots\\a_{n1}&a_{n2}&\cdots&a_{nd}\end{pmatrix} \\&=&\delta_{1} \begin{pmatrix} u_{11}v_{11} & u_{11}v_{12} & {\dots} & u_{11}v_{1d} \\ u_{12}v_{11} & u_{12}v_{12} & {\dots} & u_{12}v_{1d} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} \\ u_{1n}v_{11} & u_{1n}v_{12} & {\dots} & u_{1n}v_{1d} \end{pmatrix} \\ &&+ \delta_{2} \begin{pmatrix} u_{21}v_{21} & u_{21}v_{22} & {\dots} & u_{21}v_{2d} \\ u_{22}v_{21} & u_{22}v_{22} & {\dots} & u_{22}v_{2d} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} \\ u_{2n}v_{21} & u_{2n}v_{22} & {\dots} & u_{2n}v_{2d} \end{pmatrix} \\ &&+{\dots} + \delta_{s} \begin{pmatrix} u_{s1}v_{s1} & u_{s1}v_{s2} & {\dots} & u_{s1}v_{sd} \\ u_{s2}v_{s1} & u_{s2}v_{s2} & {\dots} & u_{s2}v_{sd} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} \\ u_{sn}v_{s1} & u_{sn}v_{s2} & {\dots} & u_{sn}v_{1d} \end{pmatrix} \end{array}. $$
(24)

Equivalently, we have

$$ \left\{ \begin{array}{rcl} a_{11} &=& {\sum}_{i=1}^{s} \delta_{i} u_{i1} v_{i1}\\ \\ a_{12} &=& {\sum}_{i=1}^{s} \delta_{i} u_{i1} v_{i2}\\ &\vdots& \\ a_{nd} &=& {\sum}_{i=1}^{s} \delta_{i} u_{in} v_{id} \end{array} .\right. $$
(25)

Then, we can obtain the derivatives as follows:

$$ \left\{ \begin{array}{rcl} \frac{\partial a_{11}}{\partial \delta_{i}} &=& u_{i1} v_{i1} \rightarrow \frac{\partial \delta_{i}}{\partial a_{11}} = \frac{1}{u_{i1} v_{i1}}\\ \\ \frac{\partial a_{12}}{\partial \delta_{i}} &=& u_{i1} v_{i2} \rightarrow \frac{\partial \delta_{i}}{\partial a_{12}} = \frac{1}{u_{i1} v_{i2}}\\ &\vdots& \\ \frac{\partial a_{nd}}{\partial \delta_{i}} &=& u_{in} v_{id} \rightarrow \frac{\partial \delta_{i}}{\partial a_{nd}} = \frac{1}{u_{in} v_{id}} \end{array} .\right. $$
(26)

Therefore, the derivative in terms of A can be obtained as follows:

$$ \frac{\partial \delta_{i}}{\partial A} = 1./ \begin{pmatrix} u_{i1}v_{i1} & u_{i1}v_{i2} & {\dots} & u_{i1}v_{id} \\ u_{i2}v_{i1} & u_{i2}v_{i2} & {\dots} & u_{i2}v_{id} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} \\ u_{in}v_{i1} & u_{in}v_{i2} & {\dots} & u_{in}v_{id} \end{pmatrix} =1./u_{i} {v_{i}^{T}}, $$
(27)

where ‘./’ denotes the elementwise division operation.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kang, P., Lin, Z., Yang, Z. et al. Intra-class low-rank regularization for supervised and semi-supervised cross-modal retrieval. Appl Intell 52, 33–54 (2022). https://doi.org/10.1007/s10489-021-02308-3

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-021-02308-3

Keywords

Navigation