DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-Level Control

Xu, Xinyu; Luo, Shengcheng; Yang, Yanchao; Li, Yong-Lu; Lu, Cewu

doi:10.1007/978-3-031-72649-1_7

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15076))

Included in the following conference series:

European Conference on Computer Vision

308 Accesses

Abstract

Building a general-purpose intelligent home-assistant agent skilled in diverse tasks by human commands is a long-term blueprint of embodied AI research, which poses requirements on task planning, environment modeling, and object interaction. In this work, we study primitive mobile manipulations for embodied agents, i.e. how to navigate and interact based on an instructed verb-noun pair. We propose DISCO, which features non-trivial advancements in contextualized scene modeling and efficient controls. In particular, DISCO incorporates differentiable scene representations of rich semantics in object and affordance, which is dynamically learned on the fly and facilitates navigation planning. Besides, we propose dual-level coarse-to-fine action controls leveraging both global and local cues to accomplish mobile manipulation tasks efficiently. DISCO easily integrates into embodied tasks such as embodied instruction following. To validate our approach, we take the ALFRED benchmark of large-scale long-horizon vision-language navigation and interaction tasks as a test bed. In extensive experiments, we make comprehensive evaluations and demonstrate that DISCO outperforms the art by a sizable +8.6% success rate margin in unseen scenes even without step-by-step instructions. Our code is publicly released at https://github.com/AllenXuuu/DISCO.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Object Manipulation via Visual Target Localization

ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments

Sim-2-Sim Transfer for Vision-and-Language Navigation in Continuous Environments

References

Kadian*, A., et al.: Are we making real progress in simulated environments? measuring the sim2real gap in embodied visual navigation. arXiv:1912.06321 (2019)
Ahn, M., et al.: Do as i can, not as i say: grounding language in robotic affordances (2022)
Google Scholar
Anderson, P., et al.: Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Batra, D., et al.: ObjectNav revisited: on evaluation of embodied agents navigating to objects. arXiv:2006.13171 (2020)
Blukis, V., Paxton, C., Fox, D., Garg, A., Artzi, Y.: A persistent spatial semantic representation for high-level natural language instruction execution. In: Conference on Robot Learning, pp. 706–717. PMLR (2022)
Google Scholar
Brohan, A., et al.: Rt-2: vision-language-action models transfer web knowledge to robotic control (2023)
Google Scholar
Brohan, A., et al.: Rt-1: robotics transformer for real-world control at scale (2023)
Google Scholar
Chaplot, D.S., Gandhi, D., Gupta, S., Gupta, A., Salakhutdinov, R.: Learning to explore using active neural slam. In: International Conference on Learning Representations (ICLR) (2020)
Google Scholar
Chen, C., et al.: SoundSpaces: audio-visual navigation in 3D environments. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 17–36. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_2
Chapter Google Scholar
Chen, H., Suhr, A., Misra, D., Snavely, N., Artzi, Y.: Touchdown: natural language navigation and spatial reasoning in visual street environments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12538–12547 (2019)
Google Scholar
Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Embodied question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Deng, S., Xu, X., Wu, C., Chen, K., Jia, K.: 3d affordancenet: a benchmark for visual object affordance understanding (2021)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Driess, D., et al.: Palm-e: an embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)
Ehsani, K., et al.: Imitating shortest paths in simulation enables effective navigation and manipulation in the real world. arXiv preprint arXiv:2312.02976 (2023)
Gadre, S., Ehsani, K., Song, S., Mottaghi, R.: Continuous scene representations for embodied AI. In: CVPR (2022)
Google Scholar
Gao, X., Gao, Q., Gong, R., Lin, K., Thattai, G., Sukhatme, G.S.: Dialfred: dialogue-enabled agents for embodied instruction following. IEEE Rob. Autom. Lett. 7(4), 10049–10056 (2022). https://doi.org/10.1109/lra.2022.3193254
Article Google Scholar
Gibson, J.J.: The ecological approach to the visual perception of pictures. Leonardo 11(3), 227–235 (1978)
Article Google Scholar
Gordon, D., Kembhavi, A., Rastegari, M., Redmon, J., Fox, D., Farhadi, A.: IQA: visual question answering in interactive environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4089–4098 (2018)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Hong, Y., et al.: 3d-llm: injecting the 3d world into large language models. arXiv (2023)
Google Scholar
Huang, W., Wang, C., Zhang, R., Li, Y., Wu, J., Fei-Fei, L.: Voxposer: composable 3d value maps for robotic manipulation with language models (2023)
Google Scholar
Inoue, Y., Ohashi, H.: Prompter: utilizing large language model prompting for a data efficient embodied instruction following (2022). https://doi.org/10.48550/ARXIV.2211.03267. https://arxiv.org/abs/2211.03267
Kim, B., Bhambri, S., Singh, K.P.: Agent with the big picture: perceiving surroundings for interactive instruction following (2021)
Google Scholar
Kim, B., Kim, J., Kim, Y., Min, C., Choi, J.: Context-aware planning and environment-aware memory for instruction following embodied agents. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10936–10946 (2023)
Google Scholar
Kolve, E., et al.: Ai2-thor: an interactive 3D environment for visual AI. arXiv preprint arXiv:1712.05474 (2017)
Ku, A., Anderson, P., Patel, R., Ie, E., Baldridge, J.: Room-across-room: multilingual vision-and-language navigation with dense spatiotemporal grounding. In: Conference on Empirical Methods for Natural Language Processing (EMNLP) (2020)
Google Scholar
Li, C., et al.: igibson 2.0: object-centric simulation for robot learning of everyday household tasks. arXiv preprint arXiv:2108.03272 (2021)
Li, X., et al.: Imagemanip: image-based robotic manipulation with affordance-guided next view selection (2023)
Google Scholar
Li, Y.L., et al.: Beyond object recognition: a new benchmark towards object concept learning (2023)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, X., Palacios, H., Muise, C.: A planning based neural-symbolic approach for embodied instruction following. In: CVPR Embodied AI Workshop (2022)
Google Scholar
Long, Y., Li, X., Cai, W., Dong, H.: Discuss before moving: visual language navigation via multi-expert discussions (2023)
Google Scholar
Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts (2017)
Google Scholar
Min, S.Y., Chaplot, D.S., Ravikumar, P., Bisk, Y., Salakhutdinov, R.: Film: following instructions in language with modular methods. arXiv preprint arXiv:2110.07342 (2021)
Murray, M., Cakmak, M.: Following natural language instructions for household tasks with landmark guided search and reinforced pose adjustment. IEEE Rob. Autom. Lett. 7(3), 6870–6877 (2022)
Article Google Scholar
Nagarajan, T., Grauman, K.: Learning affordance landscapes for interaction exploration in 3D environments (2020)
Google Scholar
Nguyen, V.Q., Suganuma, M., Okatani, T.: Look wide and interpret twice: improving performance on interactive instruction-following tasks. arXiv preprint arXiv:2106.00596 (2021)
Ouyang, L., et al.: Training language models to follow instructions with human feedback (2022)
Google Scholar
Pashevich, A., Schmid, C., Sun, C.: Episodic transformer for vision-and-language navigation. In: ICCV (2021)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Savva, M., et al.: Habitat: a platform for embodied AI research. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Shridhar, M., et al.: Alfred: a benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10740–10749 (2020)
Google Scholar
Shridhar, M., Yuan, X., Côté, M.A., Bisk, Y., Trischler, A., Hausknecht, M.: ALFWorld: aligning text and embodied environments for interactive learning. In: Proceedings of the International Conference on Learning Representations (ICLR) (2021). https://arxiv.org/abs/2010.03768
Singh, K.P., Bhambri, S., Kim, B., Mottaghi, R., Choi, J.: Factorizing perception and policy for interactive instruction following. arXiv preprint arXiv:2012.03208 (2020)
Song, C.H., Kil, J., Pan, T.Y., Sadler, B.M., Chao, W.L., Su, Y.: One step at a time: long-horizon vision-and-language navigation with milestones. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15482–15491 (2022)
Google Scholar
Srivastava, S., et al.: Behavior: benchmark for everyday household activities in virtual, interactive, and ecological environments. In: Conference in Robot Learning (CoRL) (2021)
Google Scholar
Wang, Y., et al.: AdaAfford: learning to adapt manipulation affordance for 3d articulated objects via few-shot interactions (2022)
Google Scholar
Weihs, L., Deitke, M., Kembhavi, A., Mottaghi, R.: Visual room rearrangement. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Google Scholar
Wu, Z., Wang, Z., Xu, X., Lu, J., Yan, H.: Embodied task planning with large language models (2023)
Google Scholar
Xu, C., Chen, Y., Wang, H., Zhu, S.C., Zhu, Y., Huang, S.: Partafford: part-level affordance discovery from 3d objects (2022)
Google Scholar
Yenamandra, S., et al.: Homerobot: open vocabulary mobile manipulation (2023)
Google Scholar
Zhang, Y., Chai, J.: Hierarchical task learning from language instructions with unified transformers and self-monitoring. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4202–4213. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.findings-acl.368. https://aclanthology.org/2021.findings-acl.368
Zhou, G., Hong, Y., Wu, Q.: Navgpt: explicit reasoning in vision-and-language navigation with large language models (2023)
Google Scholar
Zhu, Y., et al.: Visual semantic planning using deep successor representations. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 483–492 (2017)
Google Scholar

Download references

Acknowledgments

This work was supported by the National Key Research and Development Project of China (No. 2022ZD0160102), National Key Research and Development Project of China (No. 2021ZD0110704), Shanghai Artificial Intelligence Laboratory, and XPLORER PRIZE grants.

Author information

Authors and Affiliations

Shanghai Jiao Tong University, Shanghai, China
Xinyu Xu, Shengcheng Luo, Yong-Lu Li & Cewu Lu
The University of Hong Kong, Hong Kong, China
Yanchao Yang

Authors

Xinyu Xu
View author publications
You can also search for this author in PubMed Google Scholar
Shengcheng Luo
View author publications
You can also search for this author in PubMed Google Scholar
Yanchao Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yong-Lu Li
View author publications
You can also search for this author in PubMed Google Scholar
Cewu Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Yong-Lu Li or Cewu Lu .

Editor information

Editors and Affiliations

University of Birmingham, Birmingham, UK
Aleš Leonardis
University of Trento, Trento, Italy
Elisa Ricci
Technical University of Darmstadt, Darmstadt, Germany
Stefan Roth
Princeton University, Princeton, NJ, USA
Olga Russakovsky
Czech Technical University in Prague, Prague, Czech Republic
Torsten Sattler
École des Ponts ParisTech, Marne-la-Vallée, France
Gül Varol

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 792 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, X., Luo, S., Yang, Y., Li, YL., Lu, C. (2025). DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-Level Control. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15076. Springer, Cham. https://doi.org/10.1007/978-3-031-72649-1_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-72649-1_7
Published: 30 September 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72648-4
Online ISBN: 978-3-031-72649-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-Level Control