Housekeep: Tidying Virtual Households Using Commonsense Reasoning

Kant, Yash; Ramachandran, Arun; Yenamandra, Sriram; Gilitschenski, Igor; Batra, Dhruv; Szot, Andrew; Agrawal, Harsh

doi:10.1007/978-3-031-19842-7_21

Yash Kant^12,13,
Arun Ramachandran¹³,
Sriram Yenamandra¹³,
Igor Gilitschenski¹²,
Dhruv Batra^13,14,
Andrew Szot¹³ &
…
Harsh Agrawal¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13699))

Included in the following conference series:

European Conference on Computer Vision

3891 Accesses
21 Citations

Abstract

We introduce Housekeep, a benchmark to evaluate commonsense reasoning in the home for embodied AI. In Housekeep, an embodied agent must tidy a house by rearranging misplaced objects without explicit instructions specifying which objects need to be rearranged. Instead, the agent must learn from and is evaluated against human preferences of which objects belong where in a tidy house. Specifically, we collect a dataset of where humans typically place objects in tidy and untidy houses constituting 1799 objects, 268 object categories, 585 placements, and 105 rooms. Next, we propose a modular baseline approach for Housekeep that integrates planning, exploration, and navigation. It leverages a fine-tuned large language model (LLM) trained on an internet text corpus for effective planning. We find that our baseline planner generalizes to some extent when rearranging objects in unknown environments. See our webpage for code, data and more details: https://yashkant.github.io/housekeep/.

Y. Kant—Work done partially when visiting Georgia Tech.

A. Szot and H. Agrawal—Equal Contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments

AnyHome: Open-Vocabulary Generation of Structured and Textured 3D Homes

DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-Level Control

References

Abdo, N., Stachniss, C., Spinello, L., Burgard, W.: Robot, organize my shelves! tidying up objects by predicting user preferences. In: 2015 IEEE International Conference on Robotics and Automation (ICRA) (2015)
Google Scholar
Agrawal, H., Chandrasekaran, A., Batra, D., Parikh, D., Bansal, M.: Sort story: sorting jumbled images and captions into stories. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (2016)
Google Scholar
Anderson, P., et al.: On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757 (2018)
Anderson, P., et al.: Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018 (2018)
Google Scholar
Armeni, I., et al.: 3D scene graph: a structure for unified semantics, 3D space, and camera. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), 27 October- 2 November 2019 (2019)
Google Scholar
Batra, D., et al.: Rearrangement: a challenge for embodied AI (2020)
Google Scholar
Batra, D., et al.: ObjectNav revisited: on evaluation of embodied agents navigating to objects. arXiv preprint arXiv:2006.13171 (2020)
Bhagavatula, C., et al.: Abductive commonsense reasoning. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020 (2020)
Google Scholar
Bisk, Y., Zellers, R., LeBras, R., Gao, J., Choi, Y.: PIQA: reasoning about physical commonsense in natural language. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, 7–12 February 2020 (2020)
Google Scholar
Bosselut, A., Rashkin, H., Sap, M., Malaviya, C., Celikyilmaz, A., Choi, Y.: COMET: commonsense transformers for automatic knowledge graph construction. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (2019)
Google Scholar
Brown, T.B., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, 6–12 December 2020, virtual (2020)
Google Scholar
Calli, B., Singh, A., Walsman, A., Srinivasa, S., Abbeel, P., Dollar, A.M.: The YCB object and model set: towards common benchmarks for manipulation research. In: 2015 International Conference on Advanced Robotics (ICAR). IEEE (2015)
Google Scholar
Cartillier, V., Ren, Z., Jain, N., Lee, S., Essa, I., Batra, D.: Semantic MapNet: building allocentric semanticmaps and representations from egocentric views. arXiv preprint arXiv:2010.01191 (2020)
Craswell, N.: Mean reciprocal rank. In: Encyclopedia of Database Systems (2009)
Google Scholar
Crowston, K.: Amazon mechanical turk: a research tool for organizations and information systems scholars. In: Bhattacherjee, A., Fitzgerald, B. (eds.) IS &O 2012. IAICT, vol. 389, pp. 210–221. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35142-6_14
Chapter Google Scholar
Daruna, A., Liu, W., Kira, Z., Chernova, S.: RoboCSE: robot common sense embedding (2019)
Google Scholar
Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Embodied question answering. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018 (2018)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (2019)
Google Scholar
Ehsani, K., et al.: ManipulaTHOR: a framework for visual object manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021)
Google Scholar
Fleiss, J., et al.: Measuring nominal scale agreement among many raters. Psychol. Bull. 76(5), 378 (1971)
Article Google Scholar
Gan, C., et al.: Threedworld: a platform for interactive multi-modal physical simulation. In: NeurIPS, abs/2007.04954 (2020)
Google Scholar
Gordon, D., Kembhavi, A., Rastegari, M., Redmon, J., Fox, D., Farhadi, A.: IQA: visual question answering in interactive environments. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018 (2018)
Google Scholar
Granroth-Wilding, M., Clark, S.: What happens next? Event prediction using a compositional neural network model. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, Arizona, USA, 12–17 February 2016 (2016)
Google Scholar
Habitat: Habitat Challenge (2021). https://aihabitat.org/challenge/2021/
Hill, F., Mokra, S., Wong, N., Harley, T.: Human instruction-following with deep reinforcement learning via transfer-learning from text. arXiv:abs/2005.09382 (2020)
Hong, Y., Wu, Q., Qi, Y., Rodriguez-Opazo, C., Gould, S.: A recurrent vision-and-language BERT for navigation. In: ECCV (2021)
Google Scholar
Hu, X., et al.: Vivo: surpassing human performance in novel object captioning with visual vocabulary pre-training. arXiv:abs/2009.13682 (2020)
Huang, W., Abbeel, P., Pathak, D., Mordatch, I.: Language models as zero-shot planners: extracting actionable knowledge for embodied agents. ArXiv abs/2201.07207 (2022)
Google Scholar
Jasmine, C., et al.: ABO: dataset and benchmarks for real-world 3D object understanding. arXiv preprint arXiv:2110.06199 (2021)
Jiang, J., Zheng, L., Luo, F., Zhang, Z.: RedNet: residual encoder-decoder network for indoor RGB-D semantic segmentation. arXiv preprint arXiv:1806.01054 (2018)
Jiang, Y., Lim, M., Saxena, A.: Learning object arrangements in 3D scenes using human context. In: Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, 26 June- 1 July 2012 (2012)
Google Scholar
Kapelyukh, I., Johns, E.: My house, my rules: learning tidying preferences with graph neural networks. In: CoRL (2021)
Google Scholar
Kolve, E., et al.: AI2-THOR: an interactive 3D environment for visual AI. arXiv (2017)
Google Scholar
Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics (1977)
Google Scholar
Levesque, H.J., Davis, E., Morgenstern, L.: The winograd schema challenge. In: KR (2011)
Google Scholar
Li, S., et al.: Pre-trained language models for interactive decision-making. arXiv:abs/2202.01771 (2022)
Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8
Chapter Google Scholar
Liu, W., Bansal, D., Daruna, A., Chernova, S.: Learning instance-level n-ary semantic knowledge at scale for robots operating in everyday environments. In: Proceedings of Robotics: Science and Systems (2021)
Google Scholar
Liu, W., Paxton, C., Hermans, T., Fox, D.: StructFormer: learning spatial structure for language-guided semantic rearrangement of novel objects. arXiv preprint arXiv:2110.10189 (2021)
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv:1907.11692 (2019)
Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, Vancouver, BC, Canada, 8–14 2019 December (2019)
Google Scholar
Lu, K., Grover, A., Abbeel, P., Mordatch, I.: Pretrained transformers as universal computation engines. arXiv:abs/2103.05247 (2021)
Majumdar, A., Shrivastava, A., Lee, S., Anderson, P., Parikh, D., Batra, D.: Improving vision-and-language navigation with image-text pairs from the web. arXiv:abs/2004.14973 (2020)
Mostafazadeh, N., et al.: A corpus and cloze evaluation for deeper understanding of commonsense stories. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2016)
Google Scholar
Moudgil, A., Majumdar, A., Agrawal, H., Lee, S., Batra, D.: SOAT: a scene-and object-aware transformer for vision-and-language navigation. In: Advances in Neural Information Processing Systems 34 (2021)
Google Scholar
Narasimhan, M., et al.: Seeing the un-scene: learning amodal semantic maps for room navigation. CoRR abs/2007.09841 (2020)
Google Scholar
van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)
Padmakumar, A., et al.: Teach: task-driven embodied agents that chat. arXiv:abs/2110.00534 (2021)
Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014)
Google Scholar
Petroni, F., et al.: How context affects language models’ factual predictions. In: Automated Knowledge Base Construction (2020)
Google Scholar
Petroni, F., et al.: Language models as knowledge bases? In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (2019)
Google Scholar
Ramakrishnan, S.K., Jayaraman, D., Grauman, K.: An exploration of embodied visual exploration. Int. J. Comput. Vis. 129(5), 1616–1649 (2021). https://doi.org/10.1007/s11263-021-01437-z
Article Google Scholar
Google Research: Google Scanned Objects (2020). https://app.ignitionrobotics.org/GoogleResearch/fuel/collections/Google%20Scanned%20Objects. Accessed Feb 2022
Roberts, A., Raffel, C., Shazeer, N.: How much knowledge can you pack into the parameters of a language model? In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2020)
Google Scholar
Fetch robotics: Fetch (2020). http://fetchrobotics.com/
Sakaguchi, K., Le Bras, R., Bhagavatula, C., Choi, Y.: WinoGrande: an adversarial winograd schema challenge at scale. In: AAAI (2020)
Google Scholar
Salganik, M.J.: Bit by Bit: Social Research in the Digital Age. Open review edition (2017)
Google Scholar
Sap, M., et al.: ATOMIC: an atlas of machine commonsense for if-then reasoning. In: The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, 27 January–1 February 2019 (2019)
Google Scholar
Sap, M., Rashkin, H., Chen, D., Le Bras, R., Choi, Y.: Social IQa: commonsense reasoning about social interactions. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (2019)
Google Scholar
Savva, M., et al.: Habitat: a platform for embodied AI research. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), 27 October–2 November 2019 (2019)
Google Scholar
Shen, B., et al.: iGibson, a simulation environment for interactive tasks in large realistic scenes. arXiv preprint arXiv:2012.02924 (2020)
Shridhar, M., et al.: ALFRED: a benchmark for interpreting grounded instructions for everyday tasks. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020 (2020)
Google Scholar
Srivastava, S., et al.: Behavior: benchmark for everyday household activities in virtual, interactive, and ecological environments. In: CoRL (2021)
Google Scholar
Szot, A., et al.: Habitat 2.0: training home assistants to rearrange their habitat. In: Advances in Neural Information Processing Systems 34 (2021)
Google Scholar
Taniguchi, A., Isobe, S., Hafi, L.E., Hagiwara, Y., Taniguchi, T.: Autonomous planning based on spatial concepts to tidy up home environments with service robots. Adv. Robot. 35, 471–489 (2021)
Article Google Scholar
Thomason, J., Murray, M., Cakmak, M., Zettlemoyer, L.: Vision-and-dialog navigation. In: CoRL (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017 (2017)
Google Scholar
Wang, W., Bao, H., Dong, L., Wei, F.: VLMO: unified vision-language pre-training with mixture-of-modality-experts. arXiv:abs/2111.02358 (2021)
Wani, S., Patel, S., Jain, U., Chang, A.X., Savva, M.: MultiON: benchmarking semantic map memory using multi-object navigation. In: NeurIPS (2020)
Google Scholar
Weihs, L., Deitke, M., Kembhavi, A., Mottaghi, R.: Visual room rearrangement. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021)
Google Scholar
Wijmans, E., et al.: Embodied question answering in photorealistic environments with point cloud perception. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019 (2019)
Google Scholar
Yamauchi, B.: A frontier-based approach for autonomous exploration. In: CIRA, vol. 97 (1997)
Google Scholar
Zellers, R., Bisk, Y., Farhadi, A., Choi, Y.: From recognition to cognition: Visual commonsense reasoning. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019 (2019)
Google Scholar
Zellers, R., Bisk, Y., Schwartz, R., Choi, Y.: SWAG: a large-scale adversarial dataset for grounded commonsense inference. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (2018)
Google Scholar
Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., Choi, Y.: HellaSwag: can a machine really finish your sentence? In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (2019)
Google Scholar
Zhou, B., Khashabi, D., Ning, Q., Roth, D.: “Going on a vacation” takes longer than “going for a walk”: a study of temporal commonsense understanding. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (2019)
Google Scholar

Download references

Acknowledgements

We thank the Habitat team for their support. The Georgia Tech effort was supported in part by NSF, AFRL, DARPA, ONR YIPs, ARO PECASE, Amazon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government, or any sponsor.

Author information

Authors and Affiliations

University of Toronto, Toronto, Canada
Yash Kant & Igor Gilitschenski
Georgia Tech, Atlanta, USA
Yash Kant, Arun Ramachandran, Sriram Yenamandra, Dhruv Batra, Andrew Szot & Harsh Agrawal
Meta AI, Menlo Park, USA
Dhruv Batra

Authors

Yash Kant
View author publications
You can also search for this author in PubMed Google Scholar
Arun Ramachandran
View author publications
You can also search for this author in PubMed Google Scholar
Sriram Yenamandra
View author publications
You can also search for this author in PubMed Google Scholar
Igor Gilitschenski
View author publications
You can also search for this author in PubMed Google Scholar
Dhruv Batra
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Szot
View author publications
You can also search for this author in PubMed Google Scholar
Harsh Agrawal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yash Kant .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 8964 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kant, Y. et al. (2022). Housekeep: Tidying Virtual Households Using Commonsense Reasoning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13699. Springer, Cham. https://doi.org/10.1007/978-3-031-19842-7_21

Download citation

DOI: https://doi.org/10.1007/978-3-031-19842-7_21
Published: 23 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19841-0
Online ISBN: 978-3-031-19842-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics