Abstract
Building a general-purpose intelligent home-assistant agent skilled in diverse tasks by human commands is a long-term blueprint of embodied AI research, which poses requirements on task planning, environment modeling, and object interaction. In this work, we study primitive mobile manipulations for embodied agents, i.e. how to navigate and interact based on an instructed verb-noun pair. We propose DISCO, which features non-trivial advancements in contextualized scene modeling and efficient controls. In particular, DISCO incorporates differentiable scene representations of rich semantics in object and affordance, which is dynamically learned on the fly and facilitates navigation planning. Besides, we propose dual-level coarse-to-fine action controls leveraging both global and local cues to accomplish mobile manipulation tasks efficiently. DISCO easily integrates into embodied tasks such as embodied instruction following. To validate our approach, we take the ALFRED benchmark of large-scale long-horizon vision-language navigation and interaction tasks as a test bed. In extensive experiments, we make comprehensive evaluations and demonstrate that DISCO outperforms the art by a sizable +8.6% success rate margin in unseen scenes even without step-by-step instructions. Our code is publicly released at https://github.com/AllenXuuu/DISCO.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Kadian*, A., et al.: Are we making real progress in simulated environments? measuring the sim2real gap in embodied visual navigation. arXiv:1912.06321 (2019)
Ahn, M., et al.: Do as i can, not as i say: grounding language in robotic affordances (2022)
Anderson, P., et al.: Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Batra, D., et al.: ObjectNav revisited: on evaluation of embodied agents navigating to objects. arXiv:2006.13171 (2020)
Blukis, V., Paxton, C., Fox, D., Garg, A., Artzi, Y.: A persistent spatial semantic representation for high-level natural language instruction execution. In: Conference on Robot Learning, pp. 706–717. PMLR (2022)
Brohan, A., et al.: Rt-2: vision-language-action models transfer web knowledge to robotic control (2023)
Brohan, A., et al.: Rt-1: robotics transformer for real-world control at scale (2023)
Chaplot, D.S., Gandhi, D., Gupta, S., Gupta, A., Salakhutdinov, R.: Learning to explore using active neural slam. In: International Conference on Learning Representations (ICLR) (2020)
Chen, C., et al.: SoundSpaces: audio-visual navigation in 3D environments. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 17–36. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_2
Chen, H., Suhr, A., Misra, D., Snavely, N., Artzi, Y.: Touchdown: natural language navigation and spatial reasoning in visual street environments. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12538–12547 (2019)
Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Embodied question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Deng, S., Xu, X., Wu, C., Chen, K., Jia, K.: 3d affordancenet: a benchmark for visual object affordance understanding (2021)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Driess, D., et al.: Palm-e: an embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023)
Ehsani, K., et al.: Imitating shortest paths in simulation enables effective navigation and manipulation in the real world. arXiv preprint arXiv:2312.02976 (2023)
Gadre, S., Ehsani, K., Song, S., Mottaghi, R.: Continuous scene representations for embodied AI. In: CVPR (2022)
Gao, X., Gao, Q., Gong, R., Lin, K., Thattai, G., Sukhatme, G.S.: Dialfred: dialogue-enabled agents for embodied instruction following. IEEE Rob. Autom. Lett. 7(4), 10049–10056 (2022). https://doi.org/10.1109/lra.2022.3193254
Gibson, J.J.: The ecological approach to the visual perception of pictures. Leonardo 11(3), 227–235 (1978)
Gordon, D., Kembhavi, A., Rastegari, M., Redmon, J., Fox, D., Farhadi, A.: IQA: visual question answering in interactive environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4089–4098 (2018)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Hong, Y., et al.: 3d-llm: injecting the 3d world into large language models. arXiv (2023)
Huang, W., Wang, C., Zhang, R., Li, Y., Wu, J., Fei-Fei, L.: Voxposer: composable 3d value maps for robotic manipulation with language models (2023)
Inoue, Y., Ohashi, H.: Prompter: utilizing large language model prompting for a data efficient embodied instruction following (2022). https://doi.org/10.48550/ARXIV.2211.03267. https://arxiv.org/abs/2211.03267
Kim, B., Bhambri, S., Singh, K.P.: Agent with the big picture: perceiving surroundings for interactive instruction following (2021)
Kim, B., Kim, J., Kim, Y., Min, C., Choi, J.: Context-aware planning and environment-aware memory for instruction following embodied agents. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10936–10946 (2023)
Kolve, E., et al.: Ai2-thor: an interactive 3D environment for visual AI. arXiv preprint arXiv:1712.05474 (2017)
Ku, A., Anderson, P., Patel, R., Ie, E., Baldridge, J.: Room-across-room: multilingual vision-and-language navigation with dense spatiotemporal grounding. In: Conference on Empirical Methods for Natural Language Processing (EMNLP) (2020)
Li, C., et al.: igibson 2.0: object-centric simulation for robot learning of everyday household tasks. arXiv preprint arXiv:2108.03272 (2021)
Li, X., et al.: Imagemanip: image-based robotic manipulation with affordance-guided next view selection (2023)
Li, Y.L., et al.: Beyond object recognition: a new benchmark towards object concept learning (2023)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, X., Palacios, H., Muise, C.: A planning based neural-symbolic approach for embodied instruction following. In: CVPR Embodied AI Workshop (2022)
Long, Y., Li, X., Cai, W., Dong, H.: Discuss before moving: visual language navigation via multi-expert discussions (2023)
Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts (2017)
Min, S.Y., Chaplot, D.S., Ravikumar, P., Bisk, Y., Salakhutdinov, R.: Film: following instructions in language with modular methods. arXiv preprint arXiv:2110.07342 (2021)
Murray, M., Cakmak, M.: Following natural language instructions for household tasks with landmark guided search and reinforced pose adjustment. IEEE Rob. Autom. Lett. 7(3), 6870–6877 (2022)
Nagarajan, T., Grauman, K.: Learning affordance landscapes for interaction exploration in 3D environments (2020)
Nguyen, V.Q., Suganuma, M., Okatani, T.: Look wide and interpret twice: improving performance on interactive instruction-following tasks. arXiv preprint arXiv:2106.00596 (2021)
Ouyang, L., et al.: Training language models to follow instructions with human feedback (2022)
Pashevich, A., Schmid, C., Sun, C.: Episodic transformer for vision-and-language navigation. In: ICCV (2021)
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Savva, M., et al.: Habitat: a platform for embodied AI research. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019)
Shridhar, M., et al.: Alfred: a benchmark for interpreting grounded instructions for everyday tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10740–10749 (2020)
Shridhar, M., Yuan, X., Côté, M.A., Bisk, Y., Trischler, A., Hausknecht, M.: ALFWorld: aligning text and embodied environments for interactive learning. In: Proceedings of the International Conference on Learning Representations (ICLR) (2021). https://arxiv.org/abs/2010.03768
Singh, K.P., Bhambri, S., Kim, B., Mottaghi, R., Choi, J.: Factorizing perception and policy for interactive instruction following. arXiv preprint arXiv:2012.03208 (2020)
Song, C.H., Kil, J., Pan, T.Y., Sadler, B.M., Chao, W.L., Su, Y.: One step at a time: long-horizon vision-and-language navigation with milestones. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15482–15491 (2022)
Srivastava, S., et al.: Behavior: benchmark for everyday household activities in virtual, interactive, and ecological environments. In: Conference in Robot Learning (CoRL) (2021)
Wang, Y., et al.: AdaAfford: learning to adapt manipulation affordance for 3d articulated objects via few-shot interactions (2022)
Weihs, L., Deitke, M., Kembhavi, A., Mottaghi, R.: Visual room rearrangement. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)
Wu, Z., Wang, Z., Xu, X., Lu, J., Yan, H.: Embodied task planning with large language models (2023)
Xu, C., Chen, Y., Wang, H., Zhu, S.C., Zhu, Y., Huang, S.: Partafford: part-level affordance discovery from 3d objects (2022)
Yenamandra, S., et al.: Homerobot: open vocabulary mobile manipulation (2023)
Zhang, Y., Chai, J.: Hierarchical task learning from language instructions with unified transformers and self-monitoring. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 4202–4213. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.findings-acl.368. https://aclanthology.org/2021.findings-acl.368
Zhou, G., Hong, Y., Wu, Q.: Navgpt: explicit reasoning in vision-and-language navigation with large language models (2023)
Zhu, Y., et al.: Visual semantic planning using deep successor representations. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 483–492 (2017)
Acknowledgments
This work was supported by the National Key Research and Development Project of China (No. 2022ZD0160102), National Key Research and Development Project of China (No. 2021ZD0110704), Shanghai Artificial Intelligence Laboratory, and XPLORER PRIZE grants.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Xu, X., Luo, S., Yang, Y., Li, YL., Lu, C. (2025). DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-Level Control. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15076. Springer, Cham. https://doi.org/10.1007/978-3-031-72649-1_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-72649-1_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72648-4
Online ISBN: 978-3-031-72649-1
eBook Packages: Computer ScienceComputer Science (R0)