PANDA: Prompt-Based Context- and Indoor-Aware Pretraining for Vision and Language Navigation

Liu, Ting; Hu, Yue; Wu, Wansen; Wang, Youkai; Xu, Kai; Yin, Quanjun

doi:10.1007/978-3-031-53305-1_15

Ting Liu¹⁴,
Yue Hu¹⁴,
Wansen Wu¹⁴,
Youkai Wang¹⁴,
Kai Xu¹⁴ &
…
Quanjun Yin¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14554))

Included in the following conference series:

International Conference on Multimedia Modeling

399 Accesses

Abstract

Pretrained visual-language models have extensive world kno- wledge and are widely used in visual and language navigation (VLN). However, they are not sensitive to indoor scenarios for VLN tasks. Another challenge for VLN is how the agent understands the contextual relations between actions on a path and performs cross-modal alignment sequentially. In this paper, we propose a novel Prompt-bAsed coNtext- and inDoor-Aware (PANDA) pretraining framework to address these problems. It performs prompting in two stages. In the indoor-aware stage, we apply an efficient tuning paradigm to learn deep visual prompts from an indoor dataset, in order to augment pretrained models with inductive biases towards indoor environments. This can enable more sample-efficient adaptation for VLN agents. Furthermore, in the context-aware stage, we design a set of hard context prompts to capture the sequence-level semantics in the instruction. They enable further tuning of the pretrained models via contrastive learning. Experimental results on both R2R and REVERIE show the superiority of PANDA compared to existing state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Embodied question answering. In: Proceedings of CVPR, pp. 1–10 (2018)
Google Scholar
Qi, Y., Wu, Q., Anderson, P., et al.: Reverie: remote embodied visual referring expression in real indoor environments. In: Proceedings of CVPR, pp. 9982–9991 (2020)
Google Scholar
Majumdar, A., Shrivastava, A., Lee, S., Anderson, P., Parikh, D., Batra, D.: Improving vision-and-language navigation with image-text pairs from the web. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 259–274. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_16
Chapter Google Scholar
Hao, W., Li, C., Li, X., Carin, L., et al.: Towards learning a generic agent for vision-and-language navigation via pre-trainin. In: CVPR 2022, pp. 13134–13143. IEEE (2022)
Google Scholar
Guhur, P.-L., Tapaswi, M., Chen, S., et al.: Airbert: in-domain pretraining for vision-and-language navigation. In: Proceedings of ICCV, pp. 1634–1643. IEEE (2021)
Google Scholar
Lin, B., Zhu, Y., Chen, Z., et al.: ADAPT: vision-language navigation with modality-aligned action prompts. In: CVPR, pp. 15375–15385. IEEE (2022)
Google Scholar
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing, CoRR, vol. abs/ arXiv: 2107.13586 (2021)
Lester, B., Al-Rfou, R., Constant, N.: The power of scale for parameter-efficient prompt tuning. In: EMNLP (1), pp. 3045–3059. ACL (2021)
Google Scholar
Radford, A., Kim, J.W., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763. PMLR (2021)
Google Scholar
Yao, Y., Zhang, A., Liu, Z., et al.: CPT: colorful prompt tuning for pre-trained vision-language models, CoRR, vol. abs/ arXiv: 2109.11797 (2021)
Brown, T.B., Mann, B., et al.: Language models are few-shot learners. In: NeurIPS (2020)
Google Scholar
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. Int. J. Comput. Vis. 130(9), 2337–2348 (2022)
Article Google Scholar
Jia, M., et al.: Visual prompt tuning. In: Avidan, S., Brostow, G.J., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022–17th European Conference, Tel Aviv, Israel, 23–27 October 2022, Proceedings, Part XXXIII, vol. 13693. LNCS, pp. 709–727, Springer (2022). https://doi.org/10.1007/978-3-031-19827-4_41
Liu, T., Hu, Y., Wu, W., Wang, Y., Xu, K., Yin, Q.: Dap: domain-aware prompt learning for vision-and-language navigation (2023)
Google Scholar
Hao, W., Li, C., Li, X., Carin, L., et al.: Towards learning a generic agent for vision-and-language navigation via pre-training. In: Proceedings of CVPR, pp. 13137–13146 (2020)
Google Scholar
Hong, Y., Wu, Q., et al.: VLN BERT: a recurrent vision-and-language BERT for navigation. In: CVPR, pp. 1643–1653. IEEE (2021)
Google Scholar
Devlin, J., Chang, M., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (1), pp. 4171–4186 (2019)
Google Scholar
Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8
Chapter Google Scholar
Liu, X., Huang, S., Kang, Y., Chen, H., Wang, D.: VGDiffZero: text-to-image diffusion models can be zero-shot visual grounders (2023)
Google Scholar
Petroni, F., et al.: Language models as knowledge bases? EMNLP/IJCNLP (1), pp. 2463–2473. ACL (2019)
Google Scholar
Liu, X.: GPT understands, too, CoRR, vol. abs/ arXiv: 2103.10385 (2021)
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. Int. J. Comput. Vision 130(9), 2337–2348 (2022)
Article Google Scholar
Tsimpoukelli, M., Menick, J.L., Cabi, S., Eslami, S., Vinyals, O., Hill, F.: Multimodal few-shot learning with frozen language models. Adv. Neural. Inf. Process. Syst. 34, 200–212 (2021)
Google Scholar
Chang, A.X., Dai, A., Funkhouser, T.A., Halber, M., Nießner, M., et al.: Matterport3d: learning from RGB-D data in indoor environments. In: 3DV, 667–676. IEEE (2017)
Google Scholar
Anderson, P., Chang, A.X., Chaplot, D.S., et al.: On evaluation of embodied navigation agents, CoRR, vol. abs/ arXiv: 1807.06757 (2018)
Li, M., et al.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of CVPR, pp. 19880–19889 (2022)
Google Scholar
Hong, Y., Opazo, C.R., Wu, Q., Gould, S.: Sub-instruction aware vision-and-language navigation. In: EMNLP (1), pp. 3360–3376. Association for Computational Linguistics (2020)
Google Scholar
Li, X., Li, C., Xia, Q., Bisk, Y., Celikyilmaz, A., et al.: Robust navigation with language pretraining and stochastic sampling. In: EMNLP/IJCNLP (1), pp. 1494–1499. ACL (2019)
Google Scholar
Tan, H., Yu, L., et al.: Learning to navigate unseen environments: back translation with environmental dropout. In: NAACL, pp. 2610–2621. ACL (2019)
Google Scholar
Liu, C., Zhu, F., Chang, X., et al.: Vision-language navigation with random environmental mixup. In: ICCV, pp. 1624–1634. IEEE (2021)
Google Scholar
Zhu, F., Zhu, Y., Chang, X., et al.: Vision-language navigation with self-supervised auxiliary reasoning tasks. In: Proceedings of CVPR, pp. 10012–10022. IEEE (2020)
Google Scholar
Qi, Y., Pan, Z., Hong, Y., Wu, Q., et al.: The road to know-where: an object-and-room informed sequential BERT for indoor vision-language navigation. In: ICCV, pp. 1635–1644. IEEE (2021)
Google Scholar
An, D., Qi, Y., Wu, Q., et al.: Neighbor-view enhanced model for vision and language navigation. In: ACM MM, pp. 5101–5109. ACM (2021)
Google Scholar
Chen, J., Gao, C., Meng, E., et al.: Reinforced structured state-evolution for vision-language navigation. In: Proceedings of CVPR, pp. 15429–15438. IEEE (2022)
Google Scholar
Liang, X., Zhu, F., Li, L., Xu, H., Liang, X.: Visual-language navigation pretraining via prompt-based environmental self-exploration. In: ACL (1), pp. 4837–4851. ACL (2022)
Google Scholar
Zhang, Z., Qi, S., Zhou, Z., et al.: Reinforced vision-and-language navigation based on historical BERT. In: ICSI, pp. 427–438 (2023)
Google Scholar
Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., et al.: "Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. Proc. CVPR 22, 3674–3683 (2018)
Google Scholar
Wang, X., Huang, Q., Celikyilmaz, A., Gao, J., et al.: Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In: Proceedings of CVPR, pp. 6629–6638. IEEE (2019)
Google Scholar
Ma, C., Lu, J., Wu, Z., AlRegib, G., Kira, Z., et al.: Self-monitoring navigation agent via auxiliary progress estimation. In: ICLR (2019)
Google Scholar
Ke, L., Li, X., et al.: Tactical rewind: self-correction via backtracking in vision-and-language navigation. In: Proceedings of IEEE CVPR, pp. 6741–6749 (2019)
Google Scholar

Download references

Acknowledgement

This research was supported partially by the National Natural Science Fund of China (Grant Nos. 62306329 and 62103425, and the Natural Science Fund of Hunan Province (Grant Nos. 2023JJ40676 and 2022JJ40559).

Author information

Authors and Affiliations

College of Systems Engineering, National University of Defense Technology, Changsha, China
Ting Liu, Yue Hu, Wansen Wu, Youkai Wang, Kai Xu & Quanjun Yin

Authors

Ting Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yue Hu
View author publications
You can also search for this author in PubMed Google Scholar
Wansen Wu
View author publications
You can also search for this author in PubMed Google Scholar
Youkai Wang
View author publications
You can also search for this author in PubMed Google Scholar
Kai Xu
View author publications
You can also search for this author in PubMed Google Scholar
Quanjun Yin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yue Hu .

Editor information

Editors and Affiliations

University of Amsterdam, Amsterdam, The Netherlands
Stevan Rudinac
Delft University of Technology, Delft, The Netherlands
Alan Hanjalic
Delft University of Technology, Delft, The Netherlands
Cynthia Liem
University of Amsterdam, Amsterdam, The Netherlands
Marcel Worring
Reykjavik University, Reykjavik, Iceland
Björn Þór Jónsson
Microsoft Research Lab – Asia, Beijing, China
Bei Liu
The University of Tokyo, Tokyo, Japan
Yoko Yamakata

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, T., Hu, Y., Wu, W., Wang, Y., Xu, K., Yin, Q. (2024). PANDA: Prompt-Based Context- and Indoor-Aware Pretraining for Vision and Language Navigation. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14554. Springer, Cham. https://doi.org/10.1007/978-3-031-53305-1_15

Download citation

DOI: https://doi.org/10.1007/978-3-031-53305-1_15
Published: 28 January 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53304-4
Online ISBN: 978-3-031-53305-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

PANDA: Prompt-Based Context- and Indoor-Aware Pretraining for Vision and Language Navigation