Abstract
Pretrained visual-language models have extensive world kno- wledge and are widely used in visual and language navigation (VLN). However, they are not sensitive to indoor scenarios for VLN tasks. Another challenge for VLN is how the agent understands the contextual relations between actions on a path and performs cross-modal alignment sequentially. In this paper, we propose a novel Prompt-bAsed coNtext- and inDoor-Aware (PANDA) pretraining framework to address these problems. It performs prompting in two stages. In the indoor-aware stage, we apply an efficient tuning paradigm to learn deep visual prompts from an indoor dataset, in order to augment pretrained models with inductive biases towards indoor environments. This can enable more sample-efficient adaptation for VLN agents. Furthermore, in the context-aware stage, we design a set of hard context prompts to capture the sequence-level semantics in the instruction. They enable further tuning of the pretrained models via contrastive learning. Experimental results on both R2R and REVERIE show the superiority of PANDA compared to existing state-of-the-art methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Embodied question answering. In: Proceedings of CVPR, pp. 1–10 (2018)
Qi, Y., Wu, Q., Anderson, P., et al.: Reverie: remote embodied visual referring expression in real indoor environments. In: Proceedings of CVPR, pp. 9982–9991 (2020)
Majumdar, A., Shrivastava, A., Lee, S., Anderson, P., Parikh, D., Batra, D.: Improving vision-and-language navigation with image-text pairs from the web. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 259–274. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_16
Hao, W., Li, C., Li, X., Carin, L., et al.: Towards learning a generic agent for vision-and-language navigation via pre-trainin. In: CVPR 2022, pp. 13134–13143. IEEE (2022)
Guhur, P.-L., Tapaswi, M., Chen, S., et al.: Airbert: in-domain pretraining for vision-and-language navigation. In: Proceedings of ICCV, pp. 1634–1643. IEEE (2021)
Lin, B., Zhu, Y., Chen, Z., et al.: ADAPT: vision-language navigation with modality-aligned action prompts. In: CVPR, pp. 15375–15385. IEEE (2022)
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing, CoRR, vol. abs/ arXiv: 2107.13586 (2021)
Lester, B., Al-Rfou, R., Constant, N.: The power of scale for parameter-efficient prompt tuning. In: EMNLP (1), pp. 3045–3059. ACL (2021)
Radford, A., Kim, J.W., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763. PMLR (2021)
Yao, Y., Zhang, A., Liu, Z., et al.: CPT: colorful prompt tuning for pre-trained vision-language models, CoRR, vol. abs/ arXiv: 2109.11797 (2021)
Brown, T.B., Mann, B., et al.: Language models are few-shot learners. In: NeurIPS (2020)
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. Int. J. Comput. Vis. 130(9), 2337–2348 (2022)
Jia, M., et al.: Visual prompt tuning. In: Avidan, S., Brostow, G.J., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022–17th European Conference, Tel Aviv, Israel, 23–27 October 2022, Proceedings, Part XXXIII, vol. 13693. LNCS, pp. 709–727, Springer (2022). https://doi.org/10.1007/978-3-031-19827-4_41
Liu, T., Hu, Y., Wu, W., Wang, Y., Xu, K., Yin, Q.: Dap: domain-aware prompt learning for vision-and-language navigation (2023)
Hao, W., Li, C., Li, X., Carin, L., et al.: Towards learning a generic agent for vision-and-language navigation via pre-training. In: Proceedings of CVPR, pp. 13137–13146 (2020)
Hong, Y., Wu, Q., et al.: VLN BERT: a recurrent vision-and-language BERT for navigation. In: CVPR, pp. 1643–1653. IEEE (2021)
Devlin, J., Chang, M., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (1), pp. 4171–4186 (2019)
Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8
Liu, X., Huang, S., Kang, Y., Chen, H., Wang, D.: VGDiffZero: text-to-image diffusion models can be zero-shot visual grounders (2023)
Petroni, F., et al.: Language models as knowledge bases? EMNLP/IJCNLP (1), pp. 2463–2473. ACL (2019)
Liu, X.: GPT understands, too, CoRR, vol. abs/ arXiv: 2103.10385 (2021)
Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. Int. J. Comput. Vision 130(9), 2337–2348 (2022)
Tsimpoukelli, M., Menick, J.L., Cabi, S., Eslami, S., Vinyals, O., Hill, F.: Multimodal few-shot learning with frozen language models. Adv. Neural. Inf. Process. Syst. 34, 200–212 (2021)
Chang, A.X., Dai, A., Funkhouser, T.A., Halber, M., Nießner, M., et al.: Matterport3d: learning from RGB-D data in indoor environments. In: 3DV, 667–676. IEEE (2017)
Anderson, P., Chang, A.X., Chaplot, D.S., et al.: On evaluation of embodied navigation agents, CoRR, vol. abs/ arXiv: 1807.06757 (2018)
Li, M., et al.: Bridge-prompt: Towards ordinal action understanding in instructional videos. In: Proceedings of CVPR, pp. 19880–19889 (2022)
Hong, Y., Opazo, C.R., Wu, Q., Gould, S.: Sub-instruction aware vision-and-language navigation. In: EMNLP (1), pp. 3360–3376. Association for Computational Linguistics (2020)
Li, X., Li, C., Xia, Q., Bisk, Y., Celikyilmaz, A., et al.: Robust navigation with language pretraining and stochastic sampling. In: EMNLP/IJCNLP (1), pp. 1494–1499. ACL (2019)
Tan, H., Yu, L., et al.: Learning to navigate unseen environments: back translation with environmental dropout. In: NAACL, pp. 2610–2621. ACL (2019)
Liu, C., Zhu, F., Chang, X., et al.: Vision-language navigation with random environmental mixup. In: ICCV, pp. 1624–1634. IEEE (2021)
Zhu, F., Zhu, Y., Chang, X., et al.: Vision-language navigation with self-supervised auxiliary reasoning tasks. In: Proceedings of CVPR, pp. 10012–10022. IEEE (2020)
Qi, Y., Pan, Z., Hong, Y., Wu, Q., et al.: The road to know-where: an object-and-room informed sequential BERT for indoor vision-language navigation. In: ICCV, pp. 1635–1644. IEEE (2021)
An, D., Qi, Y., Wu, Q., et al.: Neighbor-view enhanced model for vision and language navigation. In: ACM MM, pp. 5101–5109. ACM (2021)
Chen, J., Gao, C., Meng, E., et al.: Reinforced structured state-evolution for vision-language navigation. In: Proceedings of CVPR, pp. 15429–15438. IEEE (2022)
Liang, X., Zhu, F., Li, L., Xu, H., Liang, X.: Visual-language navigation pretraining via prompt-based environmental self-exploration. In: ACL (1), pp. 4837–4851. ACL (2022)
Zhang, Z., Qi, S., Zhou, Z., et al.: Reinforced vision-and-language navigation based on historical BERT. In: ICSI, pp. 427–438 (2023)
Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., et al.: "Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. Proc. CVPR 22, 3674–3683 (2018)
Wang, X., Huang, Q., Celikyilmaz, A., Gao, J., et al.: Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In: Proceedings of CVPR, pp. 6629–6638. IEEE (2019)
Ma, C., Lu, J., Wu, Z., AlRegib, G., Kira, Z., et al.: Self-monitoring navigation agent via auxiliary progress estimation. In: ICLR (2019)
Ke, L., Li, X., et al.: Tactical rewind: self-correction via backtracking in vision-and-language navigation. In: Proceedings of IEEE CVPR, pp. 6741–6749 (2019)
Acknowledgement
This research was supported partially by the National Natural Science Fund of China (Grant Nos. 62306329 and 62103425, and the Natural Science Fund of Hunan Province (Grant Nos. 2023JJ40676 and 2022JJ40559).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, T., Hu, Y., Wu, W., Wang, Y., Xu, K., Yin, Q. (2024). PANDA: Prompt-Based Context- and Indoor-Aware Pretraining for Vision and Language Navigation. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14554. Springer, Cham. https://doi.org/10.1007/978-3-031-53305-1_15
Download citation
DOI: https://doi.org/10.1007/978-3-031-53305-1_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53304-4
Online ISBN: 978-3-031-53305-1
eBook Packages: Computer ScienceComputer Science (R0)