Skip to main content

Toward a Unified Framework for RGB and RGB-D Visual Navigation

  • Conference paper
  • First Online:
AI 2023: Advances in Artificial Intelligence (AI 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14472))

Included in the following conference series:

  • 552 Accesses

Abstract

In object-goal navigation, an agent is steered toward a target object based on its observations. The solution pipeline is usually composed of scene representation learning and navigation policy learning: the former reflects the agent observation, and the latter determines the navigation action. To this end, this article proposes a unified visual navigation framework, dubbed VTP, which can employ either RGB or RGB-D observations for object-goal visual navigation. Using a unified Visual Transformer Navigation network (VTN), the agent analyzes image areas in relation to specific objects, producing visual representations that capture both instance-to-instance relationships and instance-to-region relationships. Meanwhile, we utilize depth maps to explore the spatial relationship between instances and the agent. Additionally, we develop a pre-training scheme to associate visual representations with navigation signals. Furthermore, we adopt Tentative Policy Learning (TPL) to guide an agent to escape from deadlocks. When an agent is detected as being in deadlock states in training, we utilize Tentative Imitation learning (TIL) to provide the agent expert demonstrations for deadlock escape and such demonstrations are learned in a separate Tentative Policy Network (TPN). In testing, under deadlocks, estimated expert demonstrations are given to the policy network to find an escape action. Our system outperforms other methods in both iTHOR and RoboTHOR.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. arXiv preprint arXiv:2005.12872 (2020)

  2. Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning, pp. 1928–1937 (2016)

    Google Scholar 

  3. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  4. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)

    Google Scholar 

  5. Wortsman, M., Ehsani, K., Rastegari, M., Farhadi, A., Mottaghi, R.: Learning to learn how to learn: self-adaptive visual navigation using meta-learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6750–6759 (2019)

    Google Scholar 

  6. Kolve, E., et al.: AI2-THOR: an interactive 3D environment for visual AI. arXiv (2017)

    Google Scholar 

  7. Oriolo, G., Vendittelli, M., Ulivi, G.: On-line map building and navigation for autonomous mobile robots. In: Proceedings of 1995 IEEE International Conference on Robotics and Automation, vol. 3, pp. 2900–2906. IEEE (1995)

    Google Scholar 

  8. Kortenkamp, D., Weymouth, T.: Topological mapping for mobile robots using a combination of sonar and vision sensing. In: AAAI, vol. 94, pp. 979–984 (1994)

    Google Scholar 

  9. Borenstein, J., Koren, Y.: Real-time obstacle avoidance for fast mobile robots. IEEE Trans. Syst. Man Cybern. 19(5), 1179–1187 (1989)

    Article  Google Scholar 

  10. Borenstein, J., Koren, Y.: The vector field histogram-fast obstacle avoidance for mobile robots. IEEE Trans. Robot. Autom. 7(3), 278–288 (1991)

    Article  Google Scholar 

  11. Dissanayake, M.G., Newman, P., Clark, S., Durrant-Whyte, H.F., Csorba, M.: A solution to the simultaneous localization and map building (SLAM) problem. IEEE Trans. Robot. Autom. 17(3), 229–241 (2001)

    Article  Google Scholar 

  12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  13. Gupta, S., Davidson, J., Levine, S., Sukthankar, R., Malik, J.: Cognitive mapping and planning for visual navigation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2616–2625 (2017)

    Google Scholar 

  14. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (2018)

    Google Scholar 

  15. Mirowski, P., et al.: Learning to navigate in complex environments. arXiv preprint arXiv:1611.03673 (2016)

  16. Chen, T., Gupta, S., Gupta, A.: Learning exploration policies for navigation. arXiv preprint arXiv:1903.01959 (2019)

  17. Fang, K., Toshev, A., Fei-Fei, L., Savarese, S.: Scene memory transformer for embodied agents in long-horizon tasks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 538–547 (2019)

    Google Scholar 

  18. Sepulveda, G., Niebles, J.C., Soto, A.: A deep learning based behavioral approach to indoor autonomous navigation. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 4646–4653. IEEE (2018)

    Google Scholar 

  19. Chen, K., et al.: A behavioral approach to visual navigation with graph localization networks. arXiv preprint arXiv:1903.00445 (2019)

  20. Hao, W., Li, C., Li, X., Carin, L., Gao, J.: Towards learning a generic agent for vision-and-language navigation via pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13137–13146 (2020)

    Google Scholar 

  21. Kahn, G., Villaflor, A., Ding, B., Abbeel, P., Levine, S.: Self-supervised deep reinforcement learning with generalized computation graphs for robot navigation. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. IEEE (2018)

    Google Scholar 

  22. Wu, Y., Wu, Y., Tamar, A., Russell, S., Gkioxari, G., Tian, Y.: Bayesian relational memory for semantic visual navigation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2769–2779 (2019)

    Google Scholar 

  23. Shen, W.B., Xu, D., Zhu, Y., Guibas, L.J., Fei-Fei, L., Savarese, S.: Situational fusion of visual representation for visual navigation. arXiv preprint arXiv:1908.09073 (2019)

  24. Zhu, Y., et al.: Target-driven visual navigation in indoor scenes using deep reinforcement learning. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 3357–3364. IEEE (2017)

    Google Scholar 

  25. Yang, W., Wang, X., Farhadi, A., Gupta, A., Mottaghi, R.: Visual semantic navigation using scene priors. arXiv preprint arXiv:1810.06543 (2018)

  26. Mousavian, A., Toshev, A., Fišer, M., Košecká, J., Wahid, A., Davidson, J.: Visual representations for semantic target driven navigation. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 8846–8852. IEEE (2019)

    Google Scholar 

  27. Du, H., Yu, X., Zheng, L.: Learning object relation graph and tentative policy for visual navigation. arXiv preprint arXiv:2007.11018 (2020)

  28. Zhang, S., Song, X., Bai, Y., Li, W., Chu, Y., Jiang, S.: Hierarchical object-to-zone graph for object navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 15130–15140 (2021)

    Google Scholar 

  29. Khandelwal, A., Weihs, L., Mottaghi, R., Kembhavi, A.: Simple but effective: clip embeddings for embodied AI. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14829–14838 (2022)

    Google Scholar 

  30. Radford, A., et al.: Learning transferable visual models from natural language supervision (2021)

    Google Scholar 

  31. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)

    Google Scholar 

  32. Khan, S., Naseer, M., Hayat, M., Zamir, S.W., Khan, F.S., Shah, M.: Transformers in vision: a survey. arXiv preprint arXiv:2101.01169 (2021)

  33. Song, H., et al.: ViDT: an efficient and effective fully transformer-based object detector. arXiv preprint arXiv:2110.03921 (2021)

  34. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  35. Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)

    Google Scholar 

Download references

Acknowledgements

This research is funded in part by ARC-Discovery grant (DP220100800 to XY) and ARC-DECRA grant (DE230100477 to XY). We thank all anonymous reviewers and ACs for their constructive suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xin Yu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Du, H., Huang, Z., Chapman, S., Yu, X. (2024). Toward a Unified Framework for RGB and RGB-D Visual Navigation. In: Liu, T., Webb, G., Yue, L., Wang, D. (eds) AI 2023: Advances in Artificial Intelligence. AI 2023. Lecture Notes in Computer Science(), vol 14472. Springer, Singapore. https://doi.org/10.1007/978-981-99-8391-9_29

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8391-9_29

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8390-2

  • Online ISBN: 978-981-99-8391-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics