Skip to main content

DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-Level Control

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Building a general-purpose intelligent home-assistant agent skilled in diverse tasks by human commands is a long-term blueprint of embodied AI research, which poses requirements on task planning, environment modeling, and object interaction. In this work, we study primitive mobile manipulations for embodied agents, i.e. how to navigate and interact based on an instructed verb-noun pair. We propose DISCO, which features non-trivial advancements in contextualized scene modeling and efficient controls. In particular, DISCO incorporates differentiable scene representations of rich semantics in object and affordance, which is dynamically learned on the fly and facilitates navigation planning. Besides, we propose dual-level coarse-to-fine action controls leveraging both global and local cues to accomplish mobile manipulation tasks efficiently. DISCO easily integrates into embodied tasks such as embodied instruction following. To validate our approach, we take the ALFRED benchmark of large-scale long-horizon vision-language navigation and interaction tasks as a test bed. In extensive experiments, we make comprehensive evaluations and demonstrate that DISCO outperforms the art by a sizable +8.6% success rate margin in unseen scenes even without step-by-step instructions. Our code is publicly released at https://github.com/AllenXuuu/DISCO.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Kadian*, A., et al.: Are we making real progress in simulated environments? measuring the sim2real gap in embodied visual navigation. arXiv:1912.06321 (2019)

  2. Ahn, M., et al.: Do as i can, not as i say: grounding language in robotic affordances (2022)

    Google Scholar 

  3. Anderson, P., et al.: Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

    Google Scholar 

  4. Batra, D., et al.: ObjectNav revisited: on evaluation of embodied agents navigating to objects. arXiv:2006.13171 (2020)

  5. Blukis, V., Paxton, C., Fox, D., Garg, A., Artzi, Y.: A persistent spatial semantic representation for high-level natural language instruction execution. In: Conference on Robot Learning, pp. 706–717. PMLR (2022)

    Google Scholar 

  6. Brohan, A., et al.: Rt-2: vision-language-action models transfer web knowledge to robotic control (2023)

    Google Scholar 

  7. Brohan, A., et al.: Rt-1: robotics transformer for real-world control at scale (2023)

    Google Scholar 

  8. Chaplot, D.S., Gandhi, D., Gupta, S., Gupta, A., Salakhutdinov, R.: Learning to explore using active neural slam. In: International Conference on Learning Representations (ICLR) (2020)

    Google Scholar 

  9. Chen, C., et al.: SoundSpaces: audio-visual navigation in 3D environments. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 17–36. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_2

    Chapter  Google Scholar 

  10. Chen, H., Suhr, A., Misra, D., Snavely, N., Artzi, Y.: Touchdown: natural language navigation and spatial reasoning in visual street environments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12538–12547 (2019)

    Google Scholar 

  11. Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Embodied question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

    Google Scholar 

  12. Deng, S., Xu, X., Wu, C., Chen, K., Jia, K.: 3d affordancenet: a benchmark for visual object affordance understanding (2021)

    Google Scholar 

  13. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  14. Driess, D., et al.: Palm-e: an embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)

  15. Ehsani, K., et al.: Imitating shortest paths in simulation enables effective navigation and manipulation in the real world. arXiv preprint arXiv:2312.02976 (2023)

  16. Gadre, S., Ehsani, K., Song, S., Mottaghi, R.: Continuous scene representations for embodied AI. In: CVPR (2022)

    Google Scholar 

  17. Gao, X., Gao, Q., Gong, R., Lin, K., Thattai, G., Sukhatme, G.S.: Dialfred: dialogue-enabled agents for embodied instruction following. IEEE Rob. Autom. Lett. 7(4), 10049–10056 (2022). https://doi.org/10.1109/lra.2022.3193254

    Article  Google Scholar 

  18. Gibson, J.J.: The ecological approach to the visual perception of pictures. Leonardo 11(3), 227–235 (1978)

    Article  Google Scholar 

  19. Gordon, D., Kembhavi, A., Rastegari, M., Redmon, J., Fox, D., Farhadi, A.: IQA: visual question answering in interactive environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4089–4098 (2018)

    Google Scholar 

  20. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)

    Google Scholar 

  21. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

    Google Scholar 

  22. Hong, Y., et al.: 3d-llm: injecting the 3d world into large language models. arXiv (2023)

    Google Scholar 

  23. Huang, W., Wang, C., Zhang, R., Li, Y., Wu, J., Fei-Fei, L.: Voxposer: composable 3d value maps for robotic manipulation with language models (2023)

    Google Scholar 

  24. Inoue, Y., Ohashi, H.: Prompter: utilizing large language model prompting for a data efficient embodied instruction following (2022). https://doi.org/10.48550/ARXIV.2211.03267. https://arxiv.org/abs/2211.03267

  25. Kim, B., Bhambri, S., Singh, K.P.: Agent with the big picture: perceiving surroundings for interactive instruction following (2021)

    Google Scholar 

  26. Kim, B., Kim, J., Kim, Y., Min, C., Choi, J.: Context-aware planning and environment-aware memory for instruction following embodied agents. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10936–10946 (2023)

    Google Scholar 

  27. Kolve, E., et al.: Ai2-thor: an interactive 3D environment for visual AI. arXiv preprint arXiv:1712.05474 (2017)

  28. Ku, A., Anderson, P., Patel, R., Ie, E., Baldridge, J.: Room-across-room: multilingual vision-and-language navigation with dense spatiotemporal grounding. In: Conference on Empirical Methods for Natural Language Processing (EMNLP) (2020)

    Google Scholar 

  29. Li, C., et al.: igibson 2.0: object-centric simulation for robot learning of everyday household tasks. arXiv preprint arXiv:2108.03272 (2021)

  30. Li, X., et al.: Imagemanip: image-based robotic manipulation with affordance-guided next view selection (2023)

    Google Scholar 

  31. Li, Y.L., et al.: Beyond object recognition: a new benchmark towards object concept learning (2023)

    Google Scholar 

  32. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  33. Liu, X., Palacios, H., Muise, C.: A planning based neural-symbolic approach for embodied instruction following. In: CVPR Embodied AI Workshop (2022)

    Google Scholar 

  34. Long, Y., Li, X., Cai, W., Dong, H.: Discuss before moving: visual language navigation via multi-expert discussions (2023)

    Google Scholar 

  35. Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts (2017)

    Google Scholar 

  36. Min, S.Y., Chaplot, D.S., Ravikumar, P., Bisk, Y., Salakhutdinov, R.: Film: following instructions in language with modular methods. arXiv preprint arXiv:2110.07342 (2021)

  37. Murray, M., Cakmak, M.: Following natural language instructions for household tasks with landmark guided search and reinforced pose adjustment. IEEE Rob. Autom. Lett. 7(3), 6870–6877 (2022)

    Article  Google Scholar 

  38. Nagarajan, T., Grauman, K.: Learning affordance landscapes for interaction exploration in 3D environments (2020)

    Google Scholar 

  39. Nguyen, V.Q., Suganuma, M., Okatani, T.: Look wide and interpret twice: improving performance on interactive instruction-following tasks. arXiv preprint arXiv:2106.00596 (2021)

  40. Ouyang, L., et al.: Training language models to follow instructions with human feedback (2022)

    Google Scholar 

  41. Pashevich, A., Schmid, C., Sun, C.: Episodic transformer for vision-and-language navigation. In: ICCV (2021)

    Google Scholar 

  42. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28

    Chapter  Google Scholar 

  43. Savva, M., et al.: Habitat: a platform for embodied AI research. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019)

    Google Scholar 

  44. Shridhar, M., et al.: Alfred: a benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10740–10749 (2020)

    Google Scholar 

  45. Shridhar, M., Yuan, X., Côté, M.A., Bisk, Y., Trischler, A., Hausknecht, M.: ALFWorld: aligning text and embodied environments for interactive learning. In: Proceedings of the International Conference on Learning Representations (ICLR) (2021). https://arxiv.org/abs/2010.03768

  46. Singh, K.P., Bhambri, S., Kim, B., Mottaghi, R., Choi, J.: Factorizing perception and policy for interactive instruction following. arXiv preprint arXiv:2012.03208 (2020)

  47. Song, C.H., Kil, J., Pan, T.Y., Sadler, B.M., Chao, W.L., Su, Y.: One step at a time: long-horizon vision-and-language navigation with milestones. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15482–15491 (2022)

    Google Scholar 

  48. Srivastava, S., et al.: Behavior: benchmark for everyday household activities in virtual, interactive, and ecological environments. In: Conference in Robot Learning (CoRL) (2021)

    Google Scholar 

  49. Wang, Y., et al.: AdaAfford: learning to adapt manipulation affordance for 3d articulated objects via few-shot interactions (2022)

    Google Scholar 

  50. Weihs, L., Deitke, M., Kembhavi, A., Mottaghi, R.: Visual room rearrangement. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  51. Wu, Z., Wang, Z., Xu, X., Lu, J., Yan, H.: Embodied task planning with large language models (2023)

    Google Scholar 

  52. Xu, C., Chen, Y., Wang, H., Zhu, S.C., Zhu, Y., Huang, S.: Partafford: part-level affordance discovery from 3d objects (2022)

    Google Scholar 

  53. Yenamandra, S., et al.: Homerobot: open vocabulary mobile manipulation (2023)

    Google Scholar 

  54. Zhang, Y., Chai, J.: Hierarchical task learning from language instructions with unified transformers and self-monitoring. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4202–4213. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.findings-acl.368. https://aclanthology.org/2021.findings-acl.368

  55. Zhou, G., Hong, Y., Wu, Q.: Navgpt: explicit reasoning in vision-and-language navigation with large language models (2023)

    Google Scholar 

  56. Zhu, Y., et al.: Visual semantic planning using deep successor representations. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 483–492 (2017)

    Google Scholar 

Download references

Acknowledgments

This work was supported by the National Key Research and Development Project of China (No. 2022ZD0160102), National Key Research and Development Project of China (No. 2021ZD0110704), Shanghai Artificial Intelligence Laboratory, and XPLORER PRIZE grants.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Yong-Lu Li or Cewu Lu .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 792 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xu, X., Luo, S., Yang, Y., Li, YL., Lu, C. (2025). DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-Level Control. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15076. Springer, Cham. https://doi.org/10.1007/978-3-031-72649-1_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72649-1_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72648-4

  • Online ISBN: 978-3-031-72649-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics