Knowledge distilled pre-training model for vision-language-navigation

Huang, Bo; Zhang, Shuai; Huang, Jitao; Yu, Yijun; Shi, Zhicai; Xiong, Yujie

doi:10.1007/s10489-022-03779-8

Knowledge distilled pre-training model for vision-language-navigation

Published: 30 June 2022

Volume 53, pages 5607–5619, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Bo Huang ORCID: orcid.org/0000-0002-5476-620X¹,
Shuai Zhang¹,
Jitao Huang²,
Yijun Yu¹,
Zhicai Shi³ &
…
Yujie Xiong¹

522 Accesses
1 Citation
Explore all metrics

Abstract

Vision-language-navigation(VLN) is a challenging task that requires a robot to autonomously move to a destination based on visual observation following a human’s natural language instructions. To improve the performance and generalization ability, the pre-training model based on the transformer is used instead of the traditional methods. However, the pre-training model is not suitable for sustainable computing and practical application because of its complex computations and large amount of hardware occupation. Therefore, we propose a slight pre-training model through knowledge distillation. Through knowledge distillation, the plenty of knowledge encoded in a large “teacher” model can be well transferred to a small “student” model, which greatly reduces the model parameters and inference time while maintaining the original performance. In the experiments, the model size is reduced by 87%, and the average inference time is reduced by approximately 86%. It can be trained and run much faster. At the same time, 95% performance of the original model was maintained, which is still better than the traditional VLN models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Article 15 September 2023

3D Object Detection for Autonomous Driving: A Comprehensive Survey

Article 27 April 2023

References

Anderson P, Qi W, Teney D, Bruce J, Johnson M, Sünderhauf N, Reid I, Gould S, Van Den Hengel A (2018) Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3674–3683
Hao W, Li C, Li X, Carin L, Gao J (2020) Towards learning a generic agent for vision-and-language navigation via pre-training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 13137–13146
Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. arXiv:1503.02531
Wu MC, Chiu CT (2020) Multi-teacher knowledge distillation for compressed video action recognition based on deep learning. J Syst Archit 103:101695
Article Google Scholar
Zhu Y, Zhu F, Zhan Z, Lin B, Jiao J, Chang X, Liang X (2020) Vision-dialog navigation by exploring cross-modal memory. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10730–10739
Nguyen K, Daumé III H (2019) Help, anna! visual navigation with natural multimodal assistance via retrospective curiosity-encouraging imitation learning. arXiv:1909.01871
Fried D, Hu R, Cirik V, Rohrbach A, Andreas J, Morency LP, Berg-Kirkpatrick T, Saenko K, Klein D, Trevor D (2018) Speaker-follower models for vision-and-language navigation. arXiv:1806.02724
Vapnik V, Izmailov R (2015) Learning using privileged information: similarity control and knowledge transfer. J Mach Learn Res 16(1):2023–2049
MathSciNet MATH Google Scholar
Tan H, Yu L, Bansal M (2019) Learning to navigate unseen environments:, Back translation with environmental dropout. arXiv:1904.04195
Wang X, Huang Q, Celikyilmaz A, Gao J, Shen D, Wang YF, Wang WY, Zhang L (2019) Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6629–6638
Landi F, Baraldi L, Cornia M, Corsini M, Cucchiara R (2021) Multimodal attention networks for low-level vision-and-language navigation. Comput Vis Image Underst 210:103255
Article Google Scholar
Pota M, Ventura M, Fujita H, Esposito M (2021) Multilingual evaluation of pre-processing for bert-based sentiment analysis of tweets. Expert Syst Appl 181:115119
Article Google Scholar
Li X, Li C, Xia Q, Bisk Y, Celikyilmaz A, Gao J, Smith N, Choi Y (2019) Robust navigation with language pretraining and stochastic sampling. arXiv:1909.02244
Kinghorn P, Li Z, Shao L (2018) A region-based image caption generator with refined descriptions. Neurocomputing 272:416–424
Article Google Scholar
Cao J, Gan Z, Cheng Y, Yu L, Chen YC, Liu J (2020) Behind the scene: Revealing the secrets of pre-trained vision-and-language models. European Conference on Computer Vision
Lu J, Batra D, Parikh D, Lee S (2019) Vilbert:, Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv:1908.02265
Sun C, Myers A, Vondrick C, Murphy K, Schmid C (2019) Videobert: a joint model for video and language representation learning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7464–7473
Liu G, Liao Y, Wang F, Zhang B, Zhang L, Liang X, Wan X, Li S, Li Z, Zhang S et al (2021) Medical-vlbert: Medical visual language bert for covid-19 ct report generation with alternate learning. IEEE Trans Neural Netw Learn Syst 32(9):3786– 3797
Article Google Scholar
Hubara I, Courbariaux M, Soudry D, El-Yaniv R, Bengio Y (2017) Quantized neural networks: Training neural networks with low precision weights and activations. J Mach Learn Res 18(1):6869–6898
MathSciNet MATH Google Scholar
Yeom SK, Seegerer P, Lapuschkin S, Binder A, Wiedemann S, Müller KR, Samek W (2021) Pruning by explaining: a novel criterion for deep neural network pruning. Pattern Recogn 115:107899
Article Google Scholar
Wang X, Zheng Z, He Y, Yan F, Zeng Z, Yang Y (2021) Soft person reidentification network pruning via blockwise adjacent filter decaying. IEEE Transactions on Cybernetics
Chi PH, Chung PH, Wu TH, Hsieh CC, Chen YH, Li SW, Lee HY (2021) Audio albert: A lite bert for self-supervised learning of audio representation. 2021 IEEE Spoken Language Technology Workshop (SLT)
He Y, Ding Y, Liu P, Zhu L, Zhang H, Yi Y (2020) Learning filter pruning criteria for deep convolutional neural networks acceleration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2009–2018
Riaz N, Latif S, Latif R (2021) From transformers to reformers. 2021 International Conference on Digital Futures and Transformative Technologies (ICoDT2)
Catelli R, Gargiulo F, Casola V, De Pietro G, Fujita H, Esposito M (2020) Crosslingual named entity recognition for clinical de-identification applied to a covid-19 italian data set. Appl Soft Comput 97:106779
Article Google Scholar
Zhang L, Song J, Gao A, Chen J, Bao C, Ma K (2019) Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In: Proceedings of the IEEE/CVF international conference on computer Vision, pp 3713–3722
Liu Y, Zhang W, Wang J (2020) Adaptive multi-teacher multi-level knowledge distillation. Neurocomputing 415:106–113
Article Google Scholar
Sanh V, Debut L, Chaumond J, Wolf T (2019) Distilbert, a distilled version of bert:, smaller, faster, cheaper and lighter. arXiv:1910.01108
Liu W, Zhao X, Zhao Z, Qi J, Yang X, Lu W (2021) An empirical study on adaptive inference for pretrained language model IEEE Transactions on Neural Networks and Learning Systems
Ganesh P, Chen Y, Lou X, Khan MA, Yang Y, Sajjad H, Nakov P, Chen D, Winslett M (2021) Compressing large-scale transformer-based models: a case study on bert. Trans Assoc Comput Linguist 9:1061–1080
Article Google Scholar
Guarasci R, Silvestri S, De Pietro G, Fujita H, Esposito M (2021) Assessing bert’s ability to learn italian syntax: a study on null-subject and agreement phenomena. Journal of Ambient Intelligence and Humanized Computing

Download references

Acknowledgements

This work is sponsored by the Scientific and Technological Innovation 2030 - Major Project of New Generation Artificial Intelligence (No. 2020AAA0109300), the Shanghai Science and Technology Young Talents Sailing Program (No. 19YF1418400), the National Natural Science Foundation of China (No. 62006150), the Shanghai Science and Technology Innovation Action Plan (22S31903700, 21S31904200), the Songjiang District Science and Technology Research Project (No.19SJKJGG83), and Shanghai Local Capacity Enhancement Project (No. 21010501500).

Author information

Authors and Affiliations

School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai, China
Bo Huang, Shuai Zhang, Yijun Yu & Yujie Xiong
China Telecom Corporation Limited Shanghai Branch, Shanghai, China
Jitao Huang
Shanghai Key Laboratory of Integrated Administration Technologies for Information Security, Shanghai, China
Zhicai Shi

Authors

Bo Huang
View author publications
You can also search for this author in PubMed Google Scholar
Shuai Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jitao Huang
View author publications
You can also search for this author in PubMed Google Scholar
Yijun Yu
View author publications
You can also search for this author in PubMed Google Scholar
Zhicai Shi
View author publications
You can also search for this author in PubMed Google Scholar
Yujie Xiong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bo Huang.

Ethics declarations

Competing interests

The authors declare that they have no conflicts of interest in this work. We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Visualization

This part shows the process of our model step by step, especially the attention mechanism. The blue dotted box in the panorama denotes the visual information that needs to be given the most attention according to the textual information and historical trajectory. In addition, the red arrow is the specific orientation (the next point). The blurry part in the image needs to be ignored according to the attention mechanism. The coloured words (such as ) in the instruction represent the degree of attention. The darker the colour, the more attention required.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, B., Zhang, S., Huang, J. et al. Knowledge distilled pre-training model for vision-language-navigation. Appl Intell 53, 5607–5619 (2023). https://doi.org/10.1007/s10489-022-03779-8

Download citation

Accepted: 17 May 2022
Published: 30 June 2022
Issue Date: March 2023
DOI: https://doi.org/10.1007/s10489-022-03779-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Knowledge distilled pre-training model for vision-language-navigation

Abstract

Access this article

Similar content being viewed by others

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

3D Object Detection for Autonomous Driving: A Comprehensive Survey

References

Acknowledgements