Skip to main content
Log in

Knowledge distilled pre-training model for vision-language-navigation

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Vision-language-navigation(VLN) is a challenging task that requires a robot to autonomously move to a destination based on visual observation following a human’s natural language instructions. To improve the performance and generalization ability, the pre-training model based on the transformer is used instead of the traditional methods. However, the pre-training model is not suitable for sustainable computing and practical application because of its complex computations and large amount of hardware occupation. Therefore, we propose a slight pre-training model through knowledge distillation. Through knowledge distillation, the plenty of knowledge encoded in a large “teacher” model can be well transferred to a small “student” model, which greatly reduces the model parameters and inference time while maintaining the original performance. In the experiments, the model size is reduced by 87%, and the average inference time is reduced by approximately 86%. It can be trained and run much faster. At the same time, 95% performance of the original model was maintained, which is still better than the traditional VLN models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Anderson P, Qi W, Teney D, Bruce J, Johnson M, Sünderhauf N, Reid I, Gould S, Van Den Hengel A (2018) Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3674–3683

  2. Hao W, Li C, Li X, Carin L, Gao J (2020) Towards learning a generic agent for vision-and-language navigation via pre-training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13137–13146

  3. Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. arXiv:1503.02531

  4. Wu MC, Chiu CT (2020) Multi-teacher knowledge distillation for compressed video action recognition based on deep learning. J Syst Archit 103:101695

    Article  Google Scholar 

  5. Zhu Y, Zhu F, Zhan Z, Lin B, Jiao J, Chang X, Liang X (2020) Vision-dialog navigation by exploring cross-modal memory. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10730–10739

  6. Nguyen K, Daumé III H (2019) Help, anna! visual navigation with natural multimodal assistance via retrospective curiosity-encouraging imitation learning. arXiv:1909.01871

  7. Fried D, Hu R, Cirik V, Rohrbach A, Andreas J, Morency LP, Berg-Kirkpatrick T, Saenko K, Klein D, Trevor D (2018) Speaker-follower models for vision-and-language navigation. arXiv:1806.02724

  8. Vapnik V, Izmailov R (2015) Learning using privileged information: similarity control and knowledge transfer. J Mach Learn Res 16(1):2023–2049

    MathSciNet  MATH  Google Scholar 

  9. Tan H, Yu L, Bansal M (2019) Learning to navigate unseen environments:, Back translation with environmental dropout. arXiv:1904.04195

  10. Wang X, Huang Q, Celikyilmaz A, Gao J, Shen D, Wang YF, Wang WY, Zhang L (2019) Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6629–6638

  11. Landi F, Baraldi L, Cornia M, Corsini M, Cucchiara R (2021) Multimodal attention networks for low-level vision-and-language navigation. Comput Vis Image Underst 210:103255

    Article  Google Scholar 

  12. Pota M, Ventura M, Fujita H, Esposito M (2021) Multilingual evaluation of pre-processing for bert-based sentiment analysis of tweets. Expert Syst Appl 181:115119

    Article  Google Scholar 

  13. Li X, Li C, Xia Q, Bisk Y, Celikyilmaz A, Gao J, Smith N, Choi Y (2019) Robust navigation with language pretraining and stochastic sampling. arXiv:1909.02244

  14. Kinghorn P, Li Z, Shao L (2018) A region-based image caption generator with refined descriptions. Neurocomputing 272:416–424

    Article  Google Scholar 

  15. Cao J, Gan Z, Cheng Y, Yu L, Chen YC, Liu J (2020) Behind the scene: Revealing the secrets of pre-trained vision-and-language models. European Conference on Computer Vision

  16. Lu J, Batra D, Parikh D, Lee S (2019) Vilbert:, Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv:1908.02265

  17. Sun C, Myers A, Vondrick C, Murphy K, Schmid C (2019) Videobert: a joint model for video and language representation learning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7464–7473

  18. Liu G, Liao Y, Wang F, Zhang B, Zhang L, Liang X, Wan X, Li S, Li Z, Zhang S et al (2021) Medical-vlbert: Medical visual language bert for covid-19 ct report generation with alternate learning. IEEE Trans Neural Netw Learn Syst 32(9):3786– 3797

    Article  Google Scholar 

  19. Hubara I, Courbariaux M, Soudry D, El-Yaniv R, Bengio Y (2017) Quantized neural networks: Training neural networks with low precision weights and activations. J Mach Learn Res 18(1):6869–6898

    MathSciNet  MATH  Google Scholar 

  20. Yeom SK, Seegerer P, Lapuschkin S, Binder A, Wiedemann S, Müller KR, Samek W (2021) Pruning by explaining: a novel criterion for deep neural network pruning. Pattern Recogn 115:107899

    Article  Google Scholar 

  21. Wang X, Zheng Z, He Y, Yan F, Zeng Z, Yang Y (2021) Soft person reidentification network pruning via blockwise adjacent filter decaying. IEEE Transactions on Cybernetics

  22. Chi PH, Chung PH, Wu TH, Hsieh CC, Chen YH, Li SW, Lee HY (2021) Audio albert: A lite bert for self-supervised learning of audio representation. 2021 IEEE Spoken Language Technology Workshop (SLT)

  23. He Y, Ding Y, Liu P, Zhu L, Zhang H, Yi Y (2020) Learning filter pruning criteria for deep convolutional neural networks acceleration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2009–2018

  24. Riaz N, Latif S, Latif R (2021) From transformers to reformers. 2021 International Conference on Digital Futures and Transformative Technologies (ICoDT2)

  25. Catelli R, Gargiulo F, Casola V, De Pietro G, Fujita H, Esposito M (2020) Crosslingual named entity recognition for clinical de-identification applied to a covid-19 italian data set. Appl Soft Comput 97:106779

    Article  Google Scholar 

  26. Zhang L, Song J, Gao A, Chen J, Bao C, Ma K (2019) Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In: Proceedings of the IEEE/CVF international conference on computer Vision, pp 3713–3722

  27. Liu Y, Zhang W, Wang J (2020) Adaptive multi-teacher multi-level knowledge distillation. Neurocomputing 415:106–113

    Article  Google Scholar 

  28. Sanh V, Debut L, Chaumond J, Wolf T (2019) Distilbert, a distilled version of bert:, smaller, faster, cheaper and lighter. arXiv:1910.01108

  29. Liu W, Zhao X, Zhao Z, Qi J, Yang X, Lu W (2021) An empirical study on adaptive inference for pretrained language model IEEE Transactions on Neural Networks and Learning Systems

  30. Ganesh P, Chen Y, Lou X, Khan MA, Yang Y, Sajjad H, Nakov P, Chen D, Winslett M (2021) Compressing large-scale transformer-based models: a case study on bert. Trans Assoc Comput Linguist 9:1061–1080

    Article  Google Scholar 

  31. Guarasci R, Silvestri S, De Pietro G, Fujita H, Esposito M (2021) Assessing bert’s ability to learn italian syntax: a study on null-subject and agreement phenomena. Journal of Ambient Intelligence and Humanized Computing

Download references

Acknowledgements

This work is sponsored by the Scientific and Technological Innovation 2030 - Major Project of New Generation Artificial Intelligence (No. 2020AAA0109300), the Shanghai Science and Technology Young Talents Sailing Program (No. 19YF1418400), the National Natural Science Foundation of China (No. 62006150), the Shanghai Science and Technology Innovation Action Plan (22S31903700, 21S31904200), the Songjiang District Science and Technology Research Project (No.19SJKJGG83), and Shanghai Local Capacity Enhancement Project (No. 21010501500).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bo Huang.

Ethics declarations

Competing interests

The authors declare that they have no conflicts of interest in this work. We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Visualization

Appendix A: Visualization

This part shows the process of our model step by step, especially the attention mechanism. The blue dotted box in the panorama denotes the visual information that needs to be given the most attention according to the textual information and historical trajectory. In addition, the red arrow is the specific orientation (the next point). The blurry part in the image needs to be ignored according to the attention mechanism. The coloured words (such as ) in the instruction represent the degree of attention. The darker the colour, the more attention required.

figure e

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, B., Zhang, S., Huang, J. et al. Knowledge distilled pre-training model for vision-language-navigation. Appl Intell 53, 5607–5619 (2023). https://doi.org/10.1007/s10489-022-03779-8

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-03779-8

Keywords

Navigation