Skip to main content
Log in

Active Perception for Visual-Language Navigation

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Visual-language navigation (VLN) is the task of entailing an agent to carry out navigational instructions inside photo-realistic environments. One of the key challenges in VLN is how to conduct robust navigation by mitigating the uncertainty caused by ambiguous instructions and insufficient observation of the environment. Agents trained by current approaches typically suffer from this and would consequently struggle to take navigation actions at every step. In contrast, when humans face such a challenge, we can still maintain robust navigation by actively exploring the surroundings to gather more information and thus make a more confident navigation decision. This work draws inspiration from human navigation behavior and endows an agent with an active perception ability for more intelligent navigation. To achieve this, we propose an end-to-end framework for learning an exploration policy that decides (i) when and where to explore, (ii) what information is worth gathering during exploration, and (iii) how to adjust the navigation decision after the exploration. In this way, the agent is able to turn its past experiences as well as new explored knowledge to contexts for more confident navigation decision making. In addition, an external memory is used to explicitly store the visited visual environments and thus allows the agent to adopt a late action-taking strategy to avoid duplicate exploration and navigation movements. Our experimental results on two standard benchmark datasets show promising exploration strategies emerged from training, which leads to significant boost in navigation performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • Anderson, P., Chang, A., Chaplot, D.S., Dosovitskiy, A., Gupta, S., Koltun, V., Kosecka, J., Malik, J., Mottaghi, R., & Savva, M., et al.: (2018) On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757

  • Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S., & van den Hengel, A.: (2018)Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE conference on computer vision and pattern recognition pp. 3674–3683.

  • Andreas, J., Klein, D.: (2015) Alignment-based compositional semantics for instruction following. In Conference on empirical methods in natural language processing pp. 1165–1174.

  • Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., Parikh, D.: (2015) VQA: Visual question answering. In Proceedings of the IEEE international conference on computer vision pp. 2425–2433.

  • Bengio, Y., Louradour, J., Collobert, R., Weston, J.: (2009) Curriculum learning. In International conference on machine learning pp. 41–48.

  • Chang, A., Dai, A., Funkhouser, T., Halber, M., Niessner, M., Savva, M., Song, S., Zeng, A., Zhang, Y.: (2017) Matterport3D: Learning from RGB-D data in indoor environments. 3DV pp. 667–676.

  • Chen, D.L., Mooney, R.J.: (2011)Learning to interpret natural language navigation instructions from observations. In The AAAI conference on artificial intelligence pp. 859–865.

  • Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J.M., Parikh, D., Batra, D.: (2017) Visual dialog. In Proceedings of the IEEE conference on computer vision pattern recognition pp. 326–335.

  • Deng, Z., Narasimhan, K., Russakovsky, O.: (2020) Evolving graphical planner: Contextual global planning for vision-and-language navigation. In Proceedings of the advances in neural information processing systems.

  • Fried, D., Hu, R., Cirik, V., Rohrbach, A., Andreas, J., Morency, L.P., Berg-Kirkpatrick, T., Saenko, K., Klein, D., Darrell, T.: (2018) Speaker-follower models for vision-and-language navigation. In Proceedings of the advances in neural information processing systems pp. 3314–3325.

  • He, K., Zhang, X., Ren, S., Sun, J.: (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition pp. 770–778.

  • Hu, R., Fried, D., Rohrbach, A., Klein, D., Darrell, T., Saenko, K.: (2019) Are you looking? grounding to multiple modalities in vision-and-language navigation. In ACL pp. 6551–6557.

  • Huang, H., Jain, V., Mehta, H., Ku, A., Magalhaes, G., Baldridge, J., Ie, E.: (2019) Transferable representation learning in vision-and-language navigation. In Proceedings of the IEEE/CVF international conference on computer vision pp. 7404–7413.

  • Jain, V., Magalhaes, G., Ku, A., Vaswani, A., Ie, E., Baldridge, J.: (2019) Stay on the path: Instruction fidelity in vision-and-language navigation. In Annual meeting of the association for computational linguistics.

  • Ke, L., Li, X., Bisk, Y., Holtzman, A., Gan, Z., Liu, J., Gao, J., Choi, Y., Srinivasa, S.: (2019) Tactical rewind: Self-correction via backtracking in vision-and-language navigation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp. 6741–6749.

  • Landi, F., Baraldi, L., Cornia, M., Corsini, M., Cucchiara, R.: (2020)Perceive, transform, and act: Multi-modal attention networks for vision-and-language navigation. arXiv preprint arXiv:1911.12377

  • Liang, C., Wang, W., Zhou, T., Miao, J., Luo, Y., Yang, Y.: (2022) Local-global context aware transformer for language-guided video segmentation. arXiv preprint arXiv:2203.09773

  • Ma, C.Y., Lu, J., Wu, Z., AlRegib, G., Kira, Z., Socher, R., Xiong, C.: (2019) Self-monitoring navigation agent via auxiliary progress estimation. In Proceedings of the international conference on learning representations.

  • Ma, C.Y., Wu, Z., AlRegib, G., Xiong, C., Kira, Z.: (2019) The regretful agent: Heuristic-aided navigation through progress estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp. 6732–6740.

  • MacMahon, M., Stankiewicz, B., Kuipers, B.: (2006) Walk the talk: connecting language, knowledge, and action in route instructions. In The AAAI conference on artificial intelligence pp. 1475–1482.

  • Mei, H., Bansal, M., Walter, M.R.: (2016) Listen, attend, and walk: Neural mapping of navigational instructions to action sequences. In The AAAI conference on artificial intelligence pp. 2772–2778.

  • Mirowski, P., Pascanu, R., Viola, F., Soyer, H., Ballard, A.J., Banino, A., Denil, M., Goroshin, R., Sifre, L., Kavukcuoglu, K., et al.: (2017) Learning to navigate in complex environments. In Proceedings of the international conference learning representations.

  • Misra, D., Bennett, A., Blukis, V., Niklasson, E., Shatkhin, M., Artzi, Y.: (2018)Mapping instructions to actions in 3d environments with visual goal prediction. In Conference on empirical methods in natural language processing pp. 2667–2678.

  • Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., Kavukcuoglu, K.: (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning pp. 1928–1937.

  • Qi, Y., Pan, Z., Zhang, S., Hengel, A.v.d., Wu, Q.: (2020) Object-and-action aware model for visual language navigation. In Proceedings of the european conference on computer vision

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.

    Article  MathSciNet  Google Scholar 

  • Tan, H., Yu, L., Bansal, M. (2019). Learning to navigate unseen environments: Back translation with environmental dropout. In Proceedings of the conference on North American chapter of the ACL pp. 2610–2621.

  • Tellex, S., Kollar, T., Dickerson, S., Walter, M.R., Banerjee, A.G., Teller, S., Roy, N. (2011). Understanding natural language commands for robotic navigation and mobile manipulation. In The AAAI conference on artificial intelligence pp. 1507–1514.

  • Thomason, J., Gordon, D., Bisk, Y. (2019). Shifting the baseline: Single modality performance on visual navigation & qa. In Proceedings of the conference on north american chapter of the ACL.

  • Wang, H., Liang, W., Shen, J., Van Gool, L., Wang, W. (2022). Counterfactual cycle-consistent learning for instruction following and generation in vision-language navigation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

  • Wang, H., Wang, W., Liang, W., Xiong, C., Shen, J. (2021). Structured scene memory for vision-language navigation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp. 8455–8464.

  • Wang, H., Wang, W., Shu, T., Liang, W., Shen, J. (2020). Active visual information gathering for vision-language navigation. In European conference on computer vision pp. 307–322.

  • Wang, H., Wang, W., Zhu, X., Dai, J., Wang, L. (2021). Collaborative visual navigation. arXiv preprint arXiv:2107.01151.

  • Wang, X., Huang, Q., Celikyilmaz, A., Gao, J., Shen, D., Wang, Y.F., Wang, W.Y., Zhang, L. (2019). Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp. 6629–6638.

  • Wang, X., Xiong, W., Wang, H., Yang Wang, W. (2018). Look before you leap: Bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation. In Proceedings of the european conference on computer vision (ECCV) pp. 37–53.

  • Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning pp. 2048–2057.

  • Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L. (2016). Modeling context in referring expressions. In European conference on computer vision pp. 69–85.

  • Zheng, Z., Wang, W., Qi, S., Zhu, S.C. (2019). Reasoning visual dialogs with structural and partial observations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp. 6669–6678.

  • Zhu, F., Zhu, Y., Chang, X., Liang, X. (2020). Vision-language navigation with self-supervised auxiliary reasoning tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp. 10012–10022.

  • Zhu, Y., Mottaghi, R., Kolve, E., Lim, J.J., Gupta, A., Fei-Fei, L., Farhadi, A. (2017). Target-driven visual navigation in indoor scenes using deep reinforcement learning. In International conference on robotics and automation pp. 3357–3364.

Download references

Acknowledgements

Wei Liang acknowledges partial support from China National Key R &D Program (2021YFB3101900) and National Natural Science Foundation of China(NSFC) under Grant No.62172043. Wenguan Wang acknowledges partial support from Australian Research Council (ARC), DECRA DE220101390. Jianbing Shen acknowledges partial support from the grant SKL-IOTSC(UM)-2021-2023, and the Start-up Research Grant (SRG) of University of Macau (SRG2022-00023-IOTSC).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianbing Shen.

Additional information

Communicated by Wenguan Wang and Wei Liang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A preliminary version of this work has appeared in ECCV 2020 Wang et al. (2020b). Our algorithm implementations are available at https://github.com/HanqingWangAI/Active_VLN.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, H., Wang, W., Liang, W. et al. Active Perception for Visual-Language Navigation. Int J Comput Vis 131, 607–625 (2023). https://doi.org/10.1007/s11263-022-01721-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-022-01721-6

Keywords

Navigation