Abstract
Visual-language navigation (VLN) is the task of entailing an agent to carry out navigational instructions inside photo-realistic environments. One of the key challenges in VLN is how to conduct robust navigation by mitigating the uncertainty caused by ambiguous instructions and insufficient observation of the environment. Agents trained by current approaches typically suffer from this and would consequently struggle to take navigation actions at every step. In contrast, when humans face such a challenge, we can still maintain robust navigation by actively exploring the surroundings to gather more information and thus make a more confident navigation decision. This work draws inspiration from human navigation behavior and endows an agent with an active perception ability for more intelligent navigation. To achieve this, we propose an end-to-end framework for learning an exploration policy that decides (i) when and where to explore, (ii) what information is worth gathering during exploration, and (iii) how to adjust the navigation decision after the exploration. In this way, the agent is able to turn its past experiences as well as new explored knowledge to contexts for more confident navigation decision making. In addition, an external memory is used to explicitly store the visited visual environments and thus allows the agent to adopt a late action-taking strategy to avoid duplicate exploration and navigation movements. Our experimental results on two standard benchmark datasets show promising exploration strategies emerged from training, which leads to significant boost in navigation performance.
Similar content being viewed by others
References
Anderson, P., Chang, A., Chaplot, D.S., Dosovitskiy, A., Gupta, S., Koltun, V., Kosecka, J., Malik, J., Mottaghi, R., & Savva, M., et al.: (2018) On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757
Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S., & van den Hengel, A.: (2018)Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE conference on computer vision and pattern recognition pp. 3674–3683.
Andreas, J., Klein, D.: (2015) Alignment-based compositional semantics for instruction following. In Conference on empirical methods in natural language processing pp. 1165–1174.
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., Parikh, D.: (2015) VQA: Visual question answering. In Proceedings of the IEEE international conference on computer vision pp. 2425–2433.
Bengio, Y., Louradour, J., Collobert, R., Weston, J.: (2009) Curriculum learning. In International conference on machine learning pp. 41–48.
Chang, A., Dai, A., Funkhouser, T., Halber, M., Niessner, M., Savva, M., Song, S., Zeng, A., Zhang, Y.: (2017) Matterport3D: Learning from RGB-D data in indoor environments. 3DV pp. 667–676.
Chen, D.L., Mooney, R.J.: (2011)Learning to interpret natural language navigation instructions from observations. In The AAAI conference on artificial intelligence pp. 859–865.
Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J.M., Parikh, D., Batra, D.: (2017) Visual dialog. In Proceedings of the IEEE conference on computer vision pattern recognition pp. 326–335.
Deng, Z., Narasimhan, K., Russakovsky, O.: (2020) Evolving graphical planner: Contextual global planning for vision-and-language navigation. In Proceedings of the advances in neural information processing systems.
Fried, D., Hu, R., Cirik, V., Rohrbach, A., Andreas, J., Morency, L.P., Berg-Kirkpatrick, T., Saenko, K., Klein, D., Darrell, T.: (2018) Speaker-follower models for vision-and-language navigation. In Proceedings of the advances in neural information processing systems pp. 3314–3325.
He, K., Zhang, X., Ren, S., Sun, J.: (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition pp. 770–778.
Hu, R., Fried, D., Rohrbach, A., Klein, D., Darrell, T., Saenko, K.: (2019) Are you looking? grounding to multiple modalities in vision-and-language navigation. In ACL pp. 6551–6557.
Huang, H., Jain, V., Mehta, H., Ku, A., Magalhaes, G., Baldridge, J., Ie, E.: (2019) Transferable representation learning in vision-and-language navigation. In Proceedings of the IEEE/CVF international conference on computer vision pp. 7404–7413.
Jain, V., Magalhaes, G., Ku, A., Vaswani, A., Ie, E., Baldridge, J.: (2019) Stay on the path: Instruction fidelity in vision-and-language navigation. In Annual meeting of the association for computational linguistics.
Ke, L., Li, X., Bisk, Y., Holtzman, A., Gan, Z., Liu, J., Gao, J., Choi, Y., Srinivasa, S.: (2019) Tactical rewind: Self-correction via backtracking in vision-and-language navigation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp. 6741–6749.
Landi, F., Baraldi, L., Cornia, M., Corsini, M., Cucchiara, R.: (2020)Perceive, transform, and act: Multi-modal attention networks for vision-and-language navigation. arXiv preprint arXiv:1911.12377
Liang, C., Wang, W., Zhou, T., Miao, J., Luo, Y., Yang, Y.: (2022) Local-global context aware transformer for language-guided video segmentation. arXiv preprint arXiv:2203.09773
Ma, C.Y., Lu, J., Wu, Z., AlRegib, G., Kira, Z., Socher, R., Xiong, C.: (2019) Self-monitoring navigation agent via auxiliary progress estimation. In Proceedings of the international conference on learning representations.
Ma, C.Y., Wu, Z., AlRegib, G., Xiong, C., Kira, Z.: (2019) The regretful agent: Heuristic-aided navigation through progress estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp. 6732–6740.
MacMahon, M., Stankiewicz, B., Kuipers, B.: (2006) Walk the talk: connecting language, knowledge, and action in route instructions. In The AAAI conference on artificial intelligence pp. 1475–1482.
Mei, H., Bansal, M., Walter, M.R.: (2016) Listen, attend, and walk: Neural mapping of navigational instructions to action sequences. In The AAAI conference on artificial intelligence pp. 2772–2778.
Mirowski, P., Pascanu, R., Viola, F., Soyer, H., Ballard, A.J., Banino, A., Denil, M., Goroshin, R., Sifre, L., Kavukcuoglu, K., et al.: (2017) Learning to navigate in complex environments. In Proceedings of the international conference learning representations.
Misra, D., Bennett, A., Blukis, V., Niklasson, E., Shatkhin, M., Artzi, Y.: (2018)Mapping instructions to actions in 3d environments with visual goal prediction. In Conference on empirical methods in natural language processing pp. 2667–2678.
Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., Kavukcuoglu, K.: (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning pp. 1928–1937.
Qi, Y., Pan, Z., Zhang, S., Hengel, A.v.d., Wu, Q.: (2020) Object-and-action aware model for visual language navigation. In Proceedings of the european conference on computer vision
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.
Tan, H., Yu, L., Bansal, M. (2019). Learning to navigate unseen environments: Back translation with environmental dropout. In Proceedings of the conference on North American chapter of the ACL pp. 2610–2621.
Tellex, S., Kollar, T., Dickerson, S., Walter, M.R., Banerjee, A.G., Teller, S., Roy, N. (2011). Understanding natural language commands for robotic navigation and mobile manipulation. In The AAAI conference on artificial intelligence pp. 1507–1514.
Thomason, J., Gordon, D., Bisk, Y. (2019). Shifting the baseline: Single modality performance on visual navigation & qa. In Proceedings of the conference on north american chapter of the ACL.
Wang, H., Liang, W., Shen, J., Van Gool, L., Wang, W. (2022). Counterfactual cycle-consistent learning for instruction following and generation in vision-language navigation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
Wang, H., Wang, W., Liang, W., Xiong, C., Shen, J. (2021). Structured scene memory for vision-language navigation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp. 8455–8464.
Wang, H., Wang, W., Shu, T., Liang, W., Shen, J. (2020). Active visual information gathering for vision-language navigation. In European conference on computer vision pp. 307–322.
Wang, H., Wang, W., Zhu, X., Dai, J., Wang, L. (2021). Collaborative visual navigation. arXiv preprint arXiv:2107.01151.
Wang, X., Huang, Q., Celikyilmaz, A., Gao, J., Shen, D., Wang, Y.F., Wang, W.Y., Zhang, L. (2019). Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp. 6629–6638.
Wang, X., Xiong, W., Wang, H., Yang Wang, W. (2018). Look before you leap: Bridging model-free and model-based reinforcement learning for planned-ahead vision-and-language navigation. In Proceedings of the european conference on computer vision (ECCV) pp. 37–53.
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning pp. 2048–2057.
Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L. (2016). Modeling context in referring expressions. In European conference on computer vision pp. 69–85.
Zheng, Z., Wang, W., Qi, S., Zhu, S.C. (2019). Reasoning visual dialogs with structural and partial observations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp. 6669–6678.
Zhu, F., Zhu, Y., Chang, X., Liang, X. (2020). Vision-language navigation with self-supervised auxiliary reasoning tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition pp. 10012–10022.
Zhu, Y., Mottaghi, R., Kolve, E., Lim, J.J., Gupta, A., Fei-Fei, L., Farhadi, A. (2017). Target-driven visual navigation in indoor scenes using deep reinforcement learning. In International conference on robotics and automation pp. 3357–3364.
Acknowledgements
Wei Liang acknowledges partial support from China National Key R &D Program (2021YFB3101900) and National Natural Science Foundation of China(NSFC) under Grant No.62172043. Wenguan Wang acknowledges partial support from Australian Research Council (ARC), DECRA DE220101390. Jianbing Shen acknowledges partial support from the grant SKL-IOTSC(UM)-2021-2023, and the Start-up Research Grant (SRG) of University of Macau (SRG2022-00023-IOTSC).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Wenguan Wang and Wei Liang.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A preliminary version of this work has appeared in ECCV 2020 Wang et al. (2020b). Our algorithm implementations are available at https://github.com/HanqingWangAI/Active_VLN.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, H., Wang, W., Liang, W. et al. Active Perception for Visual-Language Navigation. Int J Comput Vis 131, 607–625 (2023). https://doi.org/10.1007/s11263-022-01721-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-022-01721-6