Skip to main content
Log in

Hierarchical Visual-Textual Knowledge Distillation for Life-Long Correlation Learning

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Correlation learning among different types of multimedia data, such as visual and textual content, faces huge challenges from two important perspectives, namely, cross modal and cross domain. Cross modal means the heterogeneous properties of different types of multimedia data, where the data from different modalities have inconsistent distributions and representations. This situation leads to the first challenge: cross-modal similarity measurement. Cross domain means the multisource property of multimedia data from various domains, in which data from new domains arrive continually, leading to the second challenge: model storage and retraining. Therefore, correlation learning requires a cross-modal continual learning approach, in which only the data from the new domains are used for training, but the previously learned correlation capabilities are preserved. To address the above issues, we introduce the idea of life-long learning into visual-textual cross-modal correlation modeling and propose a visual-textual life-long knowledge distillation (VLKD) approach. In this study, we construct a hierarchical recurrent network that can leverage knowledge from both semantic and attention levels through adaptive network expansion to support cross-modal retrieval in life-long scenarios across various domains. The results of extensive experiments performed on multiple cross-modal datasets with different domains verify the effectiveness of the proposed VLKD approach for life-long cross-modal retrieval.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Akaho, S. (2006). A kernel method for canonical correlation analysis. arXiv preprint arXiv:cs/0609071

  • Aljundi, R., Chakravarty, P., & Tuytelaars, T. (2017). Expert gate: Lifelong learning with a network of experts. In Conference on computer vision and pattern recognition (CVPR) (pp. 7120–7129).

  • Andrew, G., Arora, R., Bilmes, J. A., & Livescu, K. (2013). Deep canonical correlation analysis. In International conference on machine learning (ICML) (pp. 1247–1255).

  • Eisenschtat, A., & Wolf, L. (2017). Linking image and text with 2-way nets. In Conference on computer vision and pattern recognition (CVPR) (pp. 1855–1865).

  • Feng, F., Wang, X., & Li, R. (2014). Cross-modal retrieval with correspondence autoencoder. In ACM conference on multimedia (ACM-MM) (pp. 7–16).

  • Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., & Rohrbach, M. (2016). Multimodal compact bilinear pooling for visual question answering and visual grounding. In Conference on empirical methods in natural language processing (EMNLP) (pp. 457–468).

  • Goodfellow, I. J., Mirza, M., Xiao, D., Courville, A., & Bengio, Y. (2013). An empirical investigation of catastrophic forgetting in gradient-based neural networks. Computer Science, 84(12), 1387–91.

    Google Scholar 

  • Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., & Smola, A. J. (2012). A kernel two-sample test. Journal of Machine Learning Research, 13, 723–773.

    MathSciNet  MATH  Google Scholar 

  • Hardoon, D. R., Szedmák, S., & Shawe-Taylor, J. (2004). Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16(12), 2639–2664.

    Article  Google Scholar 

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Conference on computer vision and pattern recognition (CVPR) (pp. 770–778).

  • Hinton, G. E., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.

  • Huang, Y., Wu, Q., Song, C., & Wang, L. (2018a). Learning semantic concepts and order for image and sentence matching. In Conference on computer vision and pattern recognition (CVPR) (pp. 6163–6171).

  • Huang, Y., Wu, Q., & Wang, L. (2018b). Learning semantic concepts and order for image and sentence matching. In Computer vision and pattern recognition (CVPR).

  • Kang, C., Xiang, S., Liao, S., Xu, C., & Pan, C. (2015). Learning consistent feature representation for cross-modal multimedia retrieval. IEEE Transactions on Multimedia (TMM), 17(3), 370–381.

    Article  Google Scholar 

  • Karpathy, A., & Li, F. (2015). Deep visual-semantic alignments for generating image descriptions. In Conference on computer vision and pattern recognition (CVPR) (pp. 3128–3137).

  • Kim, Y. (2014). Convolutional neural networks for sentence classification. In Conference on empirical methods in natural language processing (EMNLP) (pp. 1746–1751).

  • Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., et al. (2016). Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences (PNAS), 114(13), 3521–3526.

    Article  MathSciNet  Google Scholar 

  • Krause, J., Johnson, J., Krishna, R., & Fei-Fei, L. (2017). A hierarchical approach for generating descriptive image paragraphs. In Computer vision and pattern recognition (CVPR).

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (NeurIPS) (pp. 1106–1114).

  • Lee, K. H., Chen, X., Hua, G., Hu, H., & He, X. (2018). Stacked cross attention for image-text matching. In European conference on computer vision (ECCV) (pp. 212–228).

  • Li, D., Dimitrova, N., Li, M., & Sethi, I. K. (2003). Multimedia content processing through cross-modal association. In ACM conference on multimedia (ACM-MM) (pp. 604–611).

  • Li, Z., & Hoiem, D. (2018). Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 40(12), 2935–2947.

    Article  Google Scholar 

  • Lin, T., Maire, M., Belongie, S. J., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European conference on computer vision (ECCV) (pp. 740–755).

  • Lin, Y., Pang, Z., Wang, D., & Zhuang, Y. (2017). Task-driven visual saliency and attention-based visual question answering. arXiv preprint arXiv:1702.06700.

  • Mallya, A., & Lazebnik, S. (2018). Packnet: Adding multiple tasks to a single network by iterative pruning. In Conference on computer vision and pattern recognition (CVPR) (pp. 7765–7773).

  • Mitchell, T. M., Cohen, W. W, Jr., E. R. H., Talukdar, P. P., Yang, B., Betteridge, J., et al. (2018). Never-ending learning. Communications of the ACM, 61(5), 103–115.

  • Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. In International conference on machine learning (ICML) (pp. 689–696).

  • Peng, Y., Huang, X., & Qi, J. (2016a). Cross-media shared representation by hierarchical learning with multiple deep networks. In International joint conference on artificial intelligence (IJCAI) (pp. 3846–3853).

  • Peng, Y., Zhai, X., Zhao, Y., & Huang, X. (2016b). Semi-supervised cross-media feature learning with unified patch graph regularization. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 26(3), 583–596.

    Article  Google Scholar 

  • Peng, Y., Huang, X., & Zhao, Y. (2017). An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 28(9), 2372–2385.

    Article  Google Scholar 

  • Peng, Y., Qi, J., Huang, X., & Yuan, Y. (2018a). CCL: Cross-modal correlation learning with multi-grained fusion by hierarchical network. IEEE Transactions on Multimedia (TMM), 20(2), 405–420.

    Article  Google Scholar 

  • Peng, Y., Qi, J., & Yuan, Y. (2018b). Modality-specific cross-modal similarity measurement with recurrent attention network. IEEE Transactions on Image Processing (TIP), 27(11), 5585–5599.

    Article  MathSciNet  Google Scholar 

  • Qi, J., Peng, Y., Zhuo, Y. (2018). Life-long cross-media correlation learning. In ACM conference on multimedia (ACM-MM). ACM (pp. 528–536).

  • Ranjan, V., Rasiwasia, N., & Jawahar, C. V. (2015). Multi-label cross-modal retrieval. In IEEE international conference on computer vision (ICCV) (pp. 4094–4102).

  • Rasiwasia, N., Costa Pereira, J., Coviello, E., Doyle, G., Lanckriet. G. R., Levy, R., & Vasconcelos, N. (2010). A new approach to cross-modal multimedia retrieval. In ACM conference on multimedia (ACM-MM) (pp. 251–260).

  • Reed, S.E., Akata, Z., Lee, H., & Schiele, B. (2016). Learning deep representations of fine-grained visual descriptions. In Conference on computer vision and pattern recognition (CVPR) (pp. 49–58).

  • Rusu, A. A., Rabinowitz, N. C., Desjardins, G., Soyer, H., Kirkpatrick, J., Kavukcuoglu, K., Pascanu, R., & Hadsell, R. (2016). Progressive neural networks. arXiv preprint arXiv:1606.04671.

  • Shin, H., Lee, J. K., Kim, J., & Kim, J. (2017). Continual learning with deep generative replay. In Advances in neural information processing systems (NeurIPS) (pp. 2994–3003).

  • Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. In International conference on learning representations (ICLR).

  • Song, Y., & Soleymani, M. (2019). Polysemous visual-semantic embedding for cross-modal retrieval. In Computer vision and pattern recognition (CVPR).

  • Triki, A. R., Aljundi, R., Blaschko, M. B., & Tuytelaars, T. (2017). Encoder based lifelong learning. In IEEE international conference on computer vision (ICCV) (pp. 1329–1337).

  • Wang, B., Yang, Y., Xu, X., Hanjalic, A., & Hengtao, S. (2017). Adversarial cross-modal retrieval. In ACM conference on multimedia (ACM-MM) (pp. 154–162).

  • Wang, K., He, R., Wang, L., Wang, W., & Tan, T. (2016a). Joint feature selection and subspace learning for cross-modal retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 38(10), 2010–2023.

    Article  Google Scholar 

  • Wang, L., Li, Y., & Lazebnik, S. (2016b). Learning deep structure-preserving image-text embeddings. In Conference on computer vision and pattern recognition (CVPR) (pp. 5005–5013).

  • Wang, S., Chen, Y., Zhuo, J., Huang, Q., & Tian, Q. (2018). Joint global and co-attentive representation learning for image-sentence retrieval. In ACM conference on multimedia (ACM-MM) (pp. 1398–1406).

  • Wei, Y., Zhao, Y., Lu, C., Wei, S., Liu, L., Zhu, Z., et al. (2017). Cross-modal retrieval with CNN visual features: A new baseline. IEEE Transactions on Cybernetics (TCYB), 47(2), 449–460.

    Google Scholar 

  • Xu, J., & Zhu, Z. (2018). Reinforced continual learning. In Advances in neural information processing systems (NeurIPS) (pp. 907–916).

  • Yan, F., & Mikolajczyk, K. (2015). Deep correlation for matching images and text. In Conference on computer vision and pattern recognition (CVPR) (pp. 3441–3450).

  • Yoon, J., Yang, E., Lee, J., & Ju Hwang, S. (2017). Lifelong learning with dynamically expandable networks. arXiv preprint arXiv:1708.01547.

  • Zagoruyko, S., & Komodakis, N. (2016). Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928.

  • Zenke, F., Poole, B., & Ganguli, S. (2017). Continual learning through synaptic intelligence. In International conference on machine learning (ICML) (pp. 3987–3995).

  • Zhai, X., Peng, Y., & Xiao, J. (2013). Heterogeneous metric learning with joint graph regularization for cross-media retrieval. In AAAI conference on artificial intelligence (AAAI) (pp. 1198–1204).

  • Zhai, X., Peng, Y., & Xiao, J. (2014). Learning cross-media joint representation with sparse and semi-supervised regularization. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT), 24, 965–978.

    Article  Google Scholar 

  • Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level convolutional networks for text classification. In Advances in neural information processing systems (NeurIPS) (pp. 649–657).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuxin Peng.

Additional information

Communicated by Josef Sivic.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by the National Natural Science Foundation of China under Grant 61925201 and Grant 61771025.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Peng, Y., Qi, J., Ye, Z. et al. Hierarchical Visual-Textual Knowledge Distillation for Life-Long Correlation Learning. Int J Comput Vis 129, 921–941 (2021). https://doi.org/10.1007/s11263-020-01392-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-020-01392-1

Keywords

Navigation