research-article

LLM-enhanced Scene Graph Learning for Household Rearrangement

Authors:

Kai XuAuthors Info & Claims

SA '24: SIGGRAPH Asia 2024 Conference Papers

Article No.: 32, Pages 1 - 11

https://doi.org/10.1145/3680528.3687607

Published: 03 December 2024 Publication History

Abstract

The household rearrangement task involves spotting misplaced objects in a scene and accommodate them with proper places. It depends both on common-sense knowledge on the objective side and human user preference on the subjective side. In achieving such a task, we propose to mine object functionality with user preference alignment directly from the scene itself, without relying on human intervention. To do so, we work with scene graph representation and propose LLM-enhanced scene graph learning which transforms the input scene graph into an affordance-enhanced graph (AEG) with information-enhanced nodes and newly discovered edges (relations). In AEG, the nodes corresponding to the receptacle objects are augmented with context-induced affordance which encodes what kind of carriable objects can be placed on it. New edges are discovered with newly discovered non-local relations. With AEG, we perform task planning for scene rearrangement by detecting misplaced carriables and determining a proper placement for each of them. We test our method by implementing a tiding robot in simulator and perform evaluation on a new benchmark we build. Extensive evaluations demonstrate that our method achieves state-of-the-art performance in misplacement detection and the following rearrangement planning.

Supplemental Material

PDF File

Supplementary material and video

Download
3.62 MB

MP4 File

Supplementary material and video

Download
86.39 MB

References

[1]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:https://arXiv.org/abs/2303.08774 (2023).

[2]

Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed Elhoseiny, and Leonidas Guibas. 2020. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. Springer, 422–440.

Digital Library

[3]

Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. 2022. Scanqa: 3d question answering for spatial scene understanding. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 19129–19139.

[4]

Dhruv Batra, Angel X Chang, Sonia Chernova, Andrew J Davison, Jia Deng, Vladlen Koltun, Sergey Levine, Jitendra Malik, Igor Mordatch, Roozbeh Mottaghi, et al. 2020. Rearrangement: A challenge for embodied ai. arXiv preprint arXiv:https://arXiv.org/abs/2011.01975 (2020).

[5]

Shengheng Deng, Xun Xu, Chaozheng Wu, Ke Chen, and Kui Jia. 2021. 3d affordancenet: A benchmark for visual object affordance understanding. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1778–1787.

[6]

Yao Duan, Chenyang Zhu, Yuqing Lan, Renjiao Yi, Xinwang Liu, and Kai Xu. 2022. Disarm: displacement aware relation module for 3d detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16980–16989.

[7]

Kuan Fang, Te-Lin Wu, Daniel Yang, Silvio Savarese, and Joseph J Lim. 2018. Demo2vec: Reasoning object affordances from online videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2139–2147.

[8]

Matthew Fisher, Manolis Savva, Yangyan Li, Pat Hanrahan, and Matthias Nießner. 2015. Activity-centric scene synthesis for functional 3D scene modeling. ACM Transactions on Graphics (TOG) 34, 6 (2015), 1–13.

[9]

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Qianyu Guo, Meng Wang, and Haofen Wang. 2023b. Retrieval-Augmented Generation for Large Language Models: A Survey. CoRR abs/2312.10997 (2023). arXiv:https://arXiv.org/abs/2312.10997

[10]

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. 2023a. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:https://arXiv.org/abs/2312.10997 (2023).

[11]

Georgios Georgakis, Arsalan Mousavian, Alexander C Berg, and Jana Kosecka. 2017. Synthesizing training data for object detection in indoor scenes. arXiv preprint arXiv:https://arXiv.org/abs/1702.07836 (2017).

[12]

James J Gibson. 1977. The theory of affordances. Hilldale, USA 1, 2 (1977), 67–82.

[13]

Helmut Grabner, Juergen Gall, and Luc Van Gool. 2011. What makes a chair a chair?. In CVPR 2011. IEEE, 1529–1536.

Digital Library

[14]

Abhinav Gupta, Scott Satkin, Alexei A Efros, and Martial Hebert. 2011. From 3d scene geometry to human workspace. In CVPR 2011. IEEE, 1961–1968.

Digital Library

[15]

Dongge Han, Trevor McInroe, Adam Jelley, Stefano V Albrecht, Peter Bell, and Amos Storkey. 2024. LLM-Personalize: Aligning LLM Planners with Human Preferences via Reinforced Self-Training for Housekeeping Robots. arXiv preprint arXiv:https://arXiv.org/abs/2404.14285 (2024).

[16]

Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 2023. 3d-llm: Injecting the 3d world into large language models. Advances in Neural Information Processing Systems 36 (2023), 20482–20494.

[17]

Ruizhen Hu, Manolis Savva, and Oliver van Kaick. 2018. Functionality representations and applications for shape analysis. In Computer Graphics Forum, Vol. 37. Wiley Online Library, 603–624.

[18]

Dehao Huang, Chao Tang, and Hong Zhang. 2023. Efficient Object Rearrangement via Multi-view Fusion. arXiv preprint arXiv:https://arXiv.org/abs/2309.08994 (2023).

[19]

Krishna Murthy Jatavallabhula, Alihusein Kuwajerwala, Qiao Gu, Mohd Omama, Tao Chen, Shuang Li, Ganesh Iyer, Soroush Saryazdi, Nikhil Keetha, Ayush Tewari, Joshua B. Tenenbaum, Celso Miguel de Melo, Madhava Krishna, Liam Paull, Florian Shkurti, and Antonio Torralba. 2023. ConceptFusion: Open-set Multimodal 3D Mapping. Robotics: Science and Systems (RSS) (2023).

[20]

Yash Kant, Arun Ramachandran, Sriram Yenamandra, Igor Gilitschenski, Dhruv Batra, Andrew Szot, and Harsh Agrawal. 2022. Housekeep: Tidying virtual households using commonsense reasoning. In European Conference on Computer Vision. Springer, 355–373.

Digital Library

[21]

Mukul Khanna, Yongsen Mao, Hanxiao Jiang, Sanjay Haresh, Brennan Shacklett, Dhruv Batra, Alexander Clegg, Eric Undersander, Angel X. Chang, and Manolis Savva. 2023. Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation. arXiv preprint (2023). arxiv:https://arXiv.org/abs/2306.11290 [cs.CV]

[22]

Hema S Koppula and Ashutosh Saxena. 2014. Physically grounded spatio-temporal object affordances. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part III 13. Springer, 831–847.

[23]

Sumith Kulal, Tim Brooks, Alex Aiken, Jiajun Wu, Jimei Yang, Jingwan Lu, Alexei A Efros, and Krishna Kumar Singh. 2023. Putting people in their place: Affordance-aware human insertion into scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17089–17099.

[24]

Changyang Li, Haikun Huang, Jyh-Ming Lien, and Lap-Fai Yu. 2021. Synthesizing scene-aware virtual reality teleport graphs. ACM Transactions on Graphics (TOG) 40, 6 (2021), 1–15.

[25]

Gen Li, Deqing Sun, Laura Sevilla-Lara, and Varun Jampani. 2023. One-Shot Open Affordance Learning with Foundation Models. arXiv preprint arXiv:https://arXiv.org/abs/2311.17776 (2023).

[26]

QI LI, Kaichun Mo, Yanchao Yang, Hang Zhao, and Leonidas Guibas. 2022. IFR-Explore: Learning Inter-object Functional Relationships in 3D Indoor Scenes. In International Conference on Learning Representations. https://openreview.net/forum?id=OT3mLgR8Wg8

[27]

Xueting Li, Sifei Liu, Kihwan Kim, Xiaolong Wang, Ming-Hsuan Yang, and Jan Kautz. 2019. Putting humans in a scene: Learning affordance in 3d indoor environments. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12368–12376.

[28]

Weiyu Liu, Yilun Du, Tucker Hermans, Sonia Chernova, and Chris Paxton. 2023. StructDiffusion: Language-Guided Creation of Physically-Valid Structures using Unseen Objects. In RSS 2023.

[29]

Weiyu Liu, Chris Paxton, Tucker Hermans, and Dieter Fox. 2022. Structformer: Learning spatial structure for language-guided semantic rearrangement of novel objects. In 2022 International Conference on Robotics and Automation (ICRA). IEEE, 6322–6329.

Digital Library

[30]

Ziyuan Liu, Wei Liu, Yuzhe Qin, Fanbo Xiang, Minghao Gou, Songyan Xin, Maximo A Roa, Berk Calli, Hao Su, Yu Sun, et al. 2021. Ocrtoc: A cloud-based competition and benchmark for robotic grasping and manipulation. IEEE Robotics and Automation Letters 7, 1 (2021), 486–493.

[31]

Kaichun Mo, Yuzhe Qin, Fanbo Xiang, Hao Su, and Leonidas Guibas. 2022. O2O-Afford: Annotation-free large-scale object-object affordance learning. In Conference on robot learning. PMLR, 1666–1677.

[32]

Tushar Nagarajan, Christoph Feichtenhofer, and Kristen Grauman. 2019. Grounded human-object interaction hotspots from video. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8688–8697.

[33]

Tushar Nagarajan, Yanghao Li, Christoph Feichtenhofer, and Kristen Grauman. 2020. Ego-topo: Environment affordances from egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 163–172.

[34]

Toan Nguyen, Minh Nhat Vu, An Vuong, Dzung Nguyen, Thieu Vo, Ngan Le, and Anh Nguyen. 2023. Open-vocabulary affordance detection in 3d point clouds. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 5692–5698.

[35]

Zhe Ni, Xiao-Xin Deng, Cong Tai, Xin-Yue Zhu, Xiang Wu, Yong-Jin Liu, and Long Zeng. 2023. Grid: Scene-graph-based instruction-driven robotic task planning. arXiv preprint arXiv:https://arXiv.org/abs/2309.07726 (2023).

[36]

Rafael Padilla, Sergio L Netto, and Eduardo AB Da Silva. 2020. A survey on performance metrics for object-detection algorithms. In 2020 international conference on systems, signals and image processing (IWSSIP). IEEE, 237–242.

[37]

Akshay Gadi Patil, Supriya Gadi Patil, Manyi Li, Matthew Fisher, Manolis Savva, and Hao Zhang. 2024. Advances in Data-Driven Analysis and Synthesis of 3D Indoor Scenes. In Computer Graphics Forum, Vol. 43. Wiley Online Library, e14927.

[38]

Xavier Puig, Eric Undersander, Andrew Szot, Mikael Dallaire Cote, Tsung-Yen Yang, Ruslan Partsey, Ruta Desai, Alexander William Clegg, Michal Hlavac, So Yeon Min, et al. 2023. Habitat 3.0: A co-habitat for humans, avatars and robots. arXiv preprint arXiv:https://arXiv.org/abs/2310.13724 (2023).

[39]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.

[40]

Abhinav Rajvanshi, Karan Sikka, Xiao Lin, Bhoram Lee, Han-Pang Chiu, and Alvaro Velasquez. 2023. Saynav: Grounding large language models for dynamic planning to navigation in new environments. arXiv preprint arXiv:https://arXiv.org/abs/2309.04077 (2023).

[41]

Krishan Rana, Jesse Haviland, Sourav Garg, Jad Abou-Chakra, Ian Reid, and Niko Suenderhauf. 2023. Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning. In 7th Annual Conference on Robot Learning.

[42]

Gabriel Sarch, Zhaoyuan Fang, Adam W Harley, Paul Schydlo, Michael J Tarr, Saurabh Gupta, and Katerina Fragkiadaki. 2022. Tidee: Tidying up novel rooms using visuo-semantic commonsense priors. In European conference on computer vision. Springer, 480–496.

Digital Library

[43]

Manolis Savva, Angel X Chang, Pat Hanrahan, Matthew Fisher, and Matthias Nießner. 2014. SceneGrok: Inferring action maps in 3D environments. ACM transactions on graphics (TOG) 33, 6 (2014), 1–10.

[44]

Manolis Savva, Angel X Chang, Pat Hanrahan, Matthew Fisher, and Matthias Nießner. 2016. Pigraphs: learning interaction snapshots from observations. ACM Transactions On Graphics (TOG) 35, 4 (2016), 1–12.

[45]

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. 2012. Indoor segmentation and support inference from rgbd images. In Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12. Springer, 746–760.

Digital Library

[46]

Chao Tang, Jingwen Yu, Weinan Chen, and Hong Zhang. 2021. Relationship oriented affordance learning through manipulation graph construction. arXiv preprint arXiv:https://arXiv.org/abs/2110.14137 (2021).

[47]

Tuan Van Vo, Minh Nhat Vu, Baoru Huang, Toan Nguyen, Ngan Le, Thieu Vo, and Anh Nguyen. 2023. Open-vocabulary affordance detection using knowledge distillation and text-point correlation. arXiv preprint arXiv:https://arXiv.org/abs/2309.10932 (2023).

[48]

Zan Wang, Yixin Chen, Baoxiong Jia, Puhao Li, Jinlu Zhang, Jingze Zhang, Tengyu Liu, Yixin Zhu, Wei Liang, and Siyuan Huang. 2024. Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance. arXiv preprint arXiv:https://arXiv.org/abs/2403.18036 (2024).

[49]

Zehan Wang, Haifeng Huang, Yang Zhao, Ziang Zhang, and Zhou Zhao. 2023. Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes. arXiv preprint arXiv:https://arXiv.org/abs/2308.08769 (2023).

[50]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837.

[51]

Luca Weihs, Matt Deitke, Aniruddha Kembhavi, and Roozbeh Mottaghi. 2021. Visual room rearrangement. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5922–5931.

[52]

Jimmy Wu, Rika Antonova, Adam Kan, Marion Lepert, Andy Zeng, Shuran Song, Jeannette Bohg, Szymon Rusinkiewicz, and Thomas Funkhouser. 2023. TidyBot: Personalized Robot Assistance with Large Language Models. Autonomous Robots (2023).

[53]

Sriram Yenamandra, Arun Ramachandran, Karmesh Yadav, Austin Wang, Mukul Khanna, Theophile Gervet, Tsung-Yen Yang, Vidhi Jain, Alex William Clegg, John Turner, Zsolt Kira, Manolis Savva, Angel Chang, Devendra Singh Chaplot, Dhruv Batra, Roozbeh Mottaghi, Yonatan Bisk, and Chris Paxton. 2023. HomeRobot: Open Vocab Mobile Manipulation. https://aihabitat.org/static/challenge/home_robot_ovmm_2023/OVMM.pdf

[54]

Ceng Zhang, Xin Meng, Dongchen Qi, and Gregory S Chirikjian. 2024. RAIL: Robot Affordance Imagination with Large Language Models. arXiv preprint arXiv:https://arXiv.org/abs/2403.19369 (2024).

[55]

Jiazhao Zhang, Liu Dai, Fanpeng Meng, Qingnan Fan, Xuelin Chen, Kai Xu, and He Wang. 2023. 3d-aware object goal navigation via simultaneous exploration and identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6672–6682.

[56]

Jiazhao Zhang, Chenyang Zhu, Lintao Zheng, and Kai Xu. 2020. Fusion-aware point convolution for online semantic 3d scene segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4534–4543.

[57]

Yixin Zhu, Chenfanfu Jiang, Yibiao Zhao, Demetri Terzopoulos, and Song-Chun Zhu. 2016. Inferring forces and learning human utilities from videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3823–3833.

Recommendations

Genome rearrangement: a planning approach
AAAI'10: Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence

Evolutionary trees of species can be reconstructed by pair-wise comparison of their entire genomes. Such a comparison can be quantified by determining the number of events that change the order of genes in a genome. Earlier Erdem and Tillier formulated ...
Object coding on the semantic graph for scene classification
MM '13: Proceedings of the 21st ACM international conference on Multimedia

In the scene classification, a scene can be considered as a set of object cliques. Objects inside each clique have semantic correlations with each other, while two objects from different cliques are relatively independent. To utilize these correlations ...
Rearrangement Phylogeny of Genomes in Contig Form

There has been a trend in increasing the phylogenetic scope of genome sequencing while decreasing the quality of the published sequence for each genome. With reduced finishing effort, there is an increasing number of genomes being published in contig ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SA '24: SIGGRAPH Asia 2024 Conference Papers

December 2024

1620 pages

ISBN:9798400711312

DOI:10.1145/3680528

Copyright © 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGGRAPH: ACM Special Interest Group on Computer Graphics and Interactive Techniques

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 December 2024

Check for updates

Qualifiers

Research-article

Conference

SA '24

Sponsor:

SIGGRAPH

SA '24: SIGGRAPH Asia 2024 Conference Papers

December 3 - 6, 2024

Tokyo, Japan

Acceptance Rates

Overall Acceptance Rate 178 of 869 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
526
Total Downloads

Downloads (Last 12 months)526
Downloads (Last 6 weeks)106

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View full text|Download PDF

View Table of Conten