Skip to main content

PANDA: Prompt-Based Context- and Indoor-Aware Pretraining for Vision and Language Navigation

  • Conference paper
  • First Online:
MultiMedia Modeling (MMM 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14554))

Included in the following conference series:

  • 399 Accesses

Abstract

Pretrained visual-language models have extensive world kno- wledge and are widely used in visual and language navigation (VLN). However, they are not sensitive to indoor scenarios for VLN tasks. Another challenge for VLN is how the agent understands the contextual relations between actions on a path and performs cross-modal alignment sequentially. In this paper, we propose a novel Prompt-bAsed coNtext- and inDoor-Aware (PANDA) pretraining framework to address these problems. It performs prompting in two stages. In the indoor-aware stage, we apply an efficient tuning paradigm to learn deep visual prompts from an indoor dataset, in order to augment pretrained models with inductive biases towards indoor environments. This can enable more sample-efficient adaptation for VLN agents. Furthermore, in the context-aware stage, we design a set of hard context prompts to capture the sequence-level semantics in the instruction. They enable further tuning of the pretrained models via contrastive learning. Experimental results on both R2R and REVERIE show the superiority of PANDA compared to existing state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Embodied question answering. In: Proceedings of CVPR, pp. 1–10 (2018)

    Google Scholar 

  2. Qi, Y., Wu, Q., Anderson, P., et al.: Reverie: remote embodied visual referring expression in real indoor environments. In: Proceedings of CVPR, pp. 9982–9991 (2020)

    Google Scholar 

  3. Majumdar, A., Shrivastava, A., Lee, S., Anderson, P., Parikh, D., Batra, D.: Improving vision-and-language navigation with image-text pairs from the web. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 259–274. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_16

    Chapter  Google Scholar 

  4. Hao, W., Li, C., Li, X., Carin, L., et al.: Towards learning a generic agent for vision-and-language navigation via pre-trainin. In: CVPR 2022, pp. 13134–13143. IEEE (2022)

    Google Scholar 

  5. Guhur, P.-L., Tapaswi, M., Chen, S., et al.: Airbert: in-domain pretraining for vision-and-language navigation. In: Proceedings of ICCV, pp. 1634–1643. IEEE (2021)

    Google Scholar 

  6. Lin, B., Zhu, Y., Chen, Z., et al.: ADAPT: vision-language navigation with modality-aligned action prompts. In: CVPR, pp. 15375–15385. IEEE (2022)

    Google Scholar 

  7. Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing, CoRR, vol. abs/ arXiv: 2107.13586 (2021)

  8. Lester, B., Al-Rfou, R., Constant, N.: The power of scale for parameter-efficient prompt tuning. In: EMNLP (1), pp. 3045–3059. ACL (2021)

    Google Scholar 

  9. Radford, A., Kim, J.W., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  10. Yao, Y., Zhang, A., Liu, Z., et al.: CPT: colorful prompt tuning for pre-trained vision-language models, CoRR, vol. abs/ arXiv: 2109.11797 (2021)

  11. Brown, T.B., Mann, B., et al.: Language models are few-shot learners. In: NeurIPS (2020)

    Google Scholar 

  12. Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. Int. J. Comput. Vis. 130(9), 2337–2348 (2022)

    Article  Google Scholar 

  13. Jia, M., et al.: Visual prompt tuning. In: Avidan, S., Brostow, G.J., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022–17th European Conference, Tel Aviv, Israel, 23–27 October 2022, Proceedings, Part XXXIII, vol. 13693. LNCS, pp. 709–727, Springer (2022). https://doi.org/10.1007/978-3-031-19827-4_41

  14. Liu, T., Hu, Y., Wu, W., Wang, Y., Xu, K., Yin, Q.: Dap: domain-aware prompt learning for vision-and-language navigation (2023)

    Google Scholar 

  15. Hao, W., Li, C., Li, X., Carin, L., et al.: Towards learning a generic agent for vision-and-language navigation via pre-training. In: Proceedings of CVPR, pp. 13137–13146 (2020)

    Google Scholar 

  16. Hong, Y., Wu, Q., et al.: VLN BERT: a recurrent vision-and-language BERT for navigation. In: CVPR, pp. 1643–1653. IEEE (2021)

    Google Scholar 

  17. Devlin, J., Chang, M., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (1), pp. 4171–4186 (2019)

    Google Scholar 

  18. Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8

    Chapter  Google Scholar 

  19. Liu, X., Huang, S., Kang, Y., Chen, H., Wang, D.: VGDiffZero: text-to-image diffusion models can be zero-shot visual grounders (2023)

    Google Scholar 

  20. Petroni, F., et al.: Language models as knowledge bases? EMNLP/IJCNLP (1), pp. 2463–2473. ACL (2019)

    Google Scholar 

  21. Liu, X.: GPT understands, too, CoRR, vol. abs/ arXiv: 2103.10385 (2021)

  22. Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. Int. J. Comput. Vision 130(9), 2337–2348 (2022)

    Article  Google Scholar 

  23. Tsimpoukelli, M., Menick, J.L., Cabi, S., Eslami, S., Vinyals, O., Hill, F.: Multimodal few-shot learning with frozen language models. Adv. Neural. Inf. Process. Syst. 34, 200–212 (2021)

    Google Scholar 

  24. Chang, A.X., Dai, A., Funkhouser, T.A., Halber, M., Nießner, M., et al.: Matterport3d: learning from RGB-D data in indoor environments. In: 3DV, 667–676. IEEE (2017)

    Google Scholar 

  25. Anderson, P., Chang, A.X., Chaplot, D.S., et al.: On evaluation of embodied navigation agents, CoRR, vol. abs/ arXiv: 1807.06757 (2018)

  26. Li, M., et al.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of CVPR, pp. 19880–19889 (2022)

    Google Scholar 

  27. Hong, Y., Opazo, C.R., Wu, Q., Gould, S.: Sub-instruction aware vision-and-language navigation. In: EMNLP (1), pp. 3360–3376. Association for Computational Linguistics (2020)

    Google Scholar 

  28. Li, X., Li, C., Xia, Q., Bisk, Y., Celikyilmaz, A., et al.: Robust navigation with language pretraining and stochastic sampling. In: EMNLP/IJCNLP (1), pp. 1494–1499. ACL (2019)

    Google Scholar 

  29. Tan, H., Yu, L., et al.: Learning to navigate unseen environments: back translation with environmental dropout. In: NAACL, pp. 2610–2621. ACL (2019)

    Google Scholar 

  30. Liu, C., Zhu, F., Chang, X., et al.: Vision-language navigation with random environmental mixup. In: ICCV, pp. 1624–1634. IEEE (2021)

    Google Scholar 

  31. Zhu, F., Zhu, Y., Chang, X., et al.: Vision-language navigation with self-supervised auxiliary reasoning tasks. In: Proceedings of CVPR, pp. 10012–10022. IEEE (2020)

    Google Scholar 

  32. Qi, Y., Pan, Z., Hong, Y., Wu, Q., et al.: The road to know-where: an object-and-room informed sequential BERT for indoor vision-language navigation. In: ICCV, pp. 1635–1644. IEEE (2021)

    Google Scholar 

  33. An, D., Qi, Y., Wu, Q., et al.: Neighbor-view enhanced model for vision and language navigation. In: ACM MM, pp. 5101–5109. ACM (2021)

    Google Scholar 

  34. Chen, J., Gao, C., Meng, E., et al.: Reinforced structured state-evolution for vision-language navigation. In: Proceedings of CVPR, pp. 15429–15438. IEEE (2022)

    Google Scholar 

  35. Liang, X., Zhu, F., Li, L., Xu, H., Liang, X.: Visual-language navigation pretraining via prompt-based environmental self-exploration. In: ACL (1), pp. 4837–4851. ACL (2022)

    Google Scholar 

  36. Zhang, Z., Qi, S., Zhou, Z., et al.: Reinforced vision-and-language navigation based on historical BERT. In: ICSI, pp. 427–438 (2023)

    Google Scholar 

  37. Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., et al.: "Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. Proc. CVPR 22, 3674–3683 (2018)

    Google Scholar 

  38. Wang, X., Huang, Q., Celikyilmaz, A., Gao, J., et al.: Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In: Proceedings of CVPR, pp. 6629–6638. IEEE (2019)

    Google Scholar 

  39. Ma, C., Lu, J., Wu, Z., AlRegib, G., Kira, Z., et al.: Self-monitoring navigation agent via auxiliary progress estimation. In: ICLR (2019)

    Google Scholar 

  40. Ke, L., Li, X., et al.: Tactical rewind: self-correction via backtracking in vision-and-language navigation. In: Proceedings of IEEE CVPR, pp. 6741–6749 (2019)

    Google Scholar 

Download references

Acknowledgement

This research was supported partially by the National Natural Science Fund of China (Grant Nos. 62306329 and 62103425, and the Natural Science Fund of Hunan Province (Grant Nos. 2023JJ40676 and 2022JJ40559).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yue Hu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, T., Hu, Y., Wu, W., Wang, Y., Xu, K., Yin, Q. (2024). PANDA: Prompt-Based Context- and Indoor-Aware Pretraining for Vision and Language Navigation. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14554. Springer, Cham. https://doi.org/10.1007/978-3-031-53305-1_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-53305-1_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-53304-4

  • Online ISBN: 978-3-031-53305-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics