Toward a Unified Framework for RGB and RGB-D Visual Navigation

Du, Heming; Huang, Zi; Chapman, Scott; Yu, Xin

doi:10.1007/978-981-99-8391-9_29

Heming Du¹¹,
Zi Huang¹¹,
Scott Chapman¹¹ &
…
Xin Yu¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14472))

Included in the following conference series:

Australasian Joint Conference on Artificial Intelligence

552 Accesses

Abstract

In object-goal navigation, an agent is steered toward a target object based on its observations. The solution pipeline is usually composed of scene representation learning and navigation policy learning: the former reflects the agent observation, and the latter determines the navigation action. To this end, this article proposes a unified visual navigation framework, dubbed VTP, which can employ either RGB or RGB-D observations for object-goal visual navigation. Using a unified Visual Transformer Navigation network (VTN), the agent analyzes image areas in relation to specific objects, producing visual representations that capture both instance-to-instance relationships and instance-to-region relationships. Meanwhile, we utilize depth maps to explore the spatial relationship between instances and the agent. Additionally, we develop a pre-training scheme to associate visual representations with navigation signals. Furthermore, we adopt Tentative Policy Learning (TPL) to guide an agent to escape from deadlocks. When an agent is detected as being in deadlock states in training, we utilize Tentative Imitation learning (TIL) to provide the agent expert demonstrations for deadlock escape and such demonstrations are learned in a separate Tentative Policy Network (TPN). In testing, under deadlocks, estimated expert demonstrations are given to the policy network to find an escape action. Our system outperforms other methods in both iTHOR and RoboTHOR.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. arXiv preprint arXiv:2005.12872 (2020)
Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning, pp. 1928–1937 (2016)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Wortsman, M., Ehsani, K., Rastegari, M., Farhadi, A., Mottaghi, R.: Learning to learn how to learn: self-adaptive visual navigation using meta-learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6750–6759 (2019)
Google Scholar
Kolve, E., et al.: AI2-THOR: an interactive 3D environment for visual AI. arXiv (2017)
Google Scholar
Oriolo, G., Vendittelli, M., Ulivi, G.: On-line map building and navigation for autonomous mobile robots. In: Proceedings of 1995 IEEE International Conference on Robotics and Automation, vol. 3, pp. 2900–2906. IEEE (1995)
Google Scholar
Kortenkamp, D., Weymouth, T.: Topological mapping for mobile robots using a combination of sonar and vision sensing. In: AAAI, vol. 94, pp. 979–984 (1994)
Google Scholar
Borenstein, J., Koren, Y.: Real-time obstacle avoidance for fast mobile robots. IEEE Trans. Syst. Man Cybern. 19(5), 1179–1187 (1989)
Article Google Scholar
Borenstein, J., Koren, Y.: The vector field histogram-fast obstacle avoidance for mobile robots. IEEE Trans. Robot. Autom. 7(3), 278–288 (1991)
Article Google Scholar
Dissanayake, M.G., Newman, P., Clark, S., Durrant-Whyte, H.F., Csorba, M.: A solution to the simultaneous localization and map building (SLAM) problem. IEEE Trans. Robot. Autom. 17(3), 229–241 (2001)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Gupta, S., Davidson, J., Levine, S., Sukthankar, R., Malik, J.: Cognitive mapping and planning for visual navigation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2616–2625 (2017)
Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (2018)
Google Scholar
Mirowski, P., et al.: Learning to navigate in complex environments. arXiv preprint arXiv:1611.03673 (2016)
Chen, T., Gupta, S., Gupta, A.: Learning exploration policies for navigation. arXiv preprint arXiv:1903.01959 (2019)
Fang, K., Toshev, A., Fei-Fei, L., Savarese, S.: Scene memory transformer for embodied agents in long-horizon tasks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 538–547 (2019)
Google Scholar
Sepulveda, G., Niebles, J.C., Soto, A.: A deep learning based behavioral approach to indoor autonomous navigation. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 4646–4653. IEEE (2018)
Google Scholar
Chen, K., et al.: A behavioral approach to visual navigation with graph localization networks. arXiv preprint arXiv:1903.00445 (2019)
Hao, W., Li, C., Li, X., Carin, L., Gao, J.: Towards learning a generic agent for vision-and-language navigation via pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13137–13146 (2020)
Google Scholar
Kahn, G., Villaflor, A., Ding, B., Abbeel, P., Levine, S.: Self-supervised deep reinforcement learning with generalized computation graphs for robot navigation. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. IEEE (2018)
Google Scholar
Wu, Y., Wu, Y., Tamar, A., Russell, S., Gkioxari, G., Tian, Y.: Bayesian relational memory for semantic visual navigation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2769–2779 (2019)
Google Scholar
Shen, W.B., Xu, D., Zhu, Y., Guibas, L.J., Fei-Fei, L., Savarese, S.: Situational fusion of visual representation for visual navigation. arXiv preprint arXiv:1908.09073 (2019)
Zhu, Y., et al.: Target-driven visual navigation in indoor scenes using deep reinforcement learning. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 3357–3364. IEEE (2017)
Google Scholar
Yang, W., Wang, X., Farhadi, A., Gupta, A., Mottaghi, R.: Visual semantic navigation using scene priors. arXiv preprint arXiv:1810.06543 (2018)
Mousavian, A., Toshev, A., Fišer, M., Košecká, J., Wahid, A., Davidson, J.: Visual representations for semantic target driven navigation. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 8846–8852. IEEE (2019)
Google Scholar
Du, H., Yu, X., Zheng, L.: Learning object relation graph and tentative policy for visual navigation. arXiv preprint arXiv:2007.11018 (2020)
Zhang, S., Song, X., Bai, Y., Li, W., Chu, Y., Jiang, S.: Hierarchical object-to-zone graph for object navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15130–15140 (2021)
Google Scholar
Khandelwal, A., Weihs, L., Mottaghi, R., Kembhavi, A.: Simple but effective: clip embeddings for embodied AI. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14829–14838 (2022)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision (2021)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Google Scholar
Khan, S., Naseer, M., Hayat, M., Zamir, S.W., Khan, F.S., Shah, M.: Transformers in vision: a survey. arXiv preprint arXiv:2101.01169 (2021)
Song, H., et al.: ViDT: an efficient and effective fully transformer-based object detector. arXiv preprint arXiv:2110.03921 (2021)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar

Download references

Acknowledgements

This research is funded in part by ARC-Discovery grant (DP220100800 to XY) and ARC-DECRA grant (DE230100477 to XY). We thank all anonymous reviewers and ACs for their constructive suggestions.

Author information

Authors and Affiliations

The University of Queensland, Brisbane, Australia
Heming Du, Zi Huang, Scott Chapman & Xin Yu

Authors

Heming Du
View author publications
You can also search for this author in PubMed Google Scholar
Zi Huang
View author publications
You can also search for this author in PubMed Google Scholar
Scott Chapman
View author publications
You can also search for this author in PubMed Google Scholar
Xin Yu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xin Yu .

Editor information

Editors and Affiliations

The University of Sydney, Darlington, NSW, Australia
Tongliang Liu
Monash University, Clayton, VIC, Australia
Geoff Webb
The University of Newcastle, Callaghan, NSW, Australia
Lin Yue
CSIRO Data61, Sydney, NSW, Australia
Dadong Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Du, H., Huang, Z., Chapman, S., Yu, X. (2024). Toward a Unified Framework for RGB and RGB-D Visual Navigation. In: Liu, T., Webb, G., Yue, L., Wang, D. (eds) AI 2023: Advances in Artificial Intelligence. AI 2023. Lecture Notes in Computer Science(), vol 14472. Springer, Singapore. https://doi.org/10.1007/978-981-99-8391-9_29

Download citation

DOI: https://doi.org/10.1007/978-981-99-8391-9_29
Published: 27 November 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8390-2
Online ISBN: 978-981-99-8391-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Toward a Unified Framework for RGB and RGB-D Visual Navigation