Skip to main content

Housekeep: Tidying Virtual Households Using Commonsense Reasoning

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13699))

Included in the following conference series:

Abstract

We introduce Housekeep, a benchmark to evaluate commonsense reasoning in the home for embodied AI. In Housekeep, an embodied agent must tidy a house by rearranging misplaced objects without explicit instructions specifying which objects need to be rearranged. Instead, the agent must learn from and is evaluated against human preferences of which objects belong where in a tidy house. Specifically, we collect a dataset of where humans typically place objects in tidy and untidy houses constituting 1799 objects, 268 object categories, 585 placements, and 105 rooms. Next, we propose a modular baseline approach for Housekeep that integrates planning, exploration, and navigation. It leverages a fine-tuned large language model (LLM) trained on an internet text corpus for effective planning. We find that our baseline planner generalizes to some extent when rearranging objects in unknown environments. See our webpage for code, data and more details: https://yashkant.github.io/housekeep/.

Y. Kant—Work done partially when visiting Georgia Tech.

A. Szot and H. Agrawal—Equal Contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Abdo, N., Stachniss, C., Spinello, L., Burgard, W.: Robot, organize my shelves! tidying up objects by predicting user preferences. In: 2015 IEEE International Conference on Robotics and Automation (ICRA) (2015)

    Google Scholar 

  2. Agrawal, H., Chandrasekaran, A., Batra, D., Parikh, D., Bansal, M.: Sort story: sorting jumbled images and captions into stories. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (2016)

    Google Scholar 

  3. Anderson, P., et al.: On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757 (2018)

  4. Anderson, P., et al.: Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018 (2018)

    Google Scholar 

  5. Armeni, I., et al.: 3D scene graph: a structure for unified semantics, 3D space, and camera. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), 27 October- 2 November 2019 (2019)

    Google Scholar 

  6. Batra, D., et al.: Rearrangement: a challenge for embodied AI (2020)

    Google Scholar 

  7. Batra, D., et al.: ObjectNav revisited: on evaluation of embodied agents navigating to objects. arXiv preprint arXiv:2006.13171 (2020)

  8. Bhagavatula, C., et al.: Abductive commonsense reasoning. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020 (2020)

    Google Scholar 

  9. Bisk, Y., Zellers, R., LeBras, R., Gao, J., Choi, Y.: PIQA: reasoning about physical commonsense in natural language. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, 7–12 February 2020 (2020)

    Google Scholar 

  10. Bosselut, A., Rashkin, H., Sap, M., Malaviya, C., Celikyilmaz, A., Choi, Y.: COMET: commonsense transformers for automatic knowledge graph construction. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (2019)

    Google Scholar 

  11. Brown, T.B., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, 6–12 December 2020, virtual (2020)

    Google Scholar 

  12. Calli, B., Singh, A., Walsman, A., Srinivasa, S., Abbeel, P., Dollar, A.M.: The YCB object and model set: towards common benchmarks for manipulation research. In: 2015 International Conference on Advanced Robotics (ICAR). IEEE (2015)

    Google Scholar 

  13. Cartillier, V., Ren, Z., Jain, N., Lee, S., Essa, I., Batra, D.: Semantic MapNet: building allocentric semanticmaps and representations from egocentric views. arXiv preprint arXiv:2010.01191 (2020)

  14. Craswell, N.: Mean reciprocal rank. In: Encyclopedia of Database Systems (2009)

    Google Scholar 

  15. Crowston, K.: Amazon mechanical turk: a research tool for organizations and information systems scholars. In: Bhattacherjee, A., Fitzgerald, B. (eds.) IS &O 2012. IAICT, vol. 389, pp. 210–221. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35142-6_14

    Chapter  Google Scholar 

  16. Daruna, A., Liu, W., Kira, Z., Chernova, S.: RoboCSE: robot common sense embedding (2019)

    Google Scholar 

  17. Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Embodied question answering. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018 (2018)

    Google Scholar 

  18. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (2019)

    Google Scholar 

  19. Ehsani, K., et al.: ManipulaTHOR: a framework for visual object manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021)

    Google Scholar 

  20. Fleiss, J., et al.: Measuring nominal scale agreement among many raters. Psychol. Bull. 76(5), 378 (1971)

    Article  Google Scholar 

  21. Gan, C., et al.: Threedworld: a platform for interactive multi-modal physical simulation. In: NeurIPS, abs/2007.04954 (2020)

    Google Scholar 

  22. Gordon, D., Kembhavi, A., Rastegari, M., Redmon, J., Fox, D., Farhadi, A.: IQA: visual question answering in interactive environments. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, 18–22 June 2018 (2018)

    Google Scholar 

  23. Granroth-Wilding, M., Clark, S.: What happens next? Event prediction using a compositional neural network model. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, Arizona, USA, 12–17 February 2016 (2016)

    Google Scholar 

  24. Habitat: Habitat Challenge (2021). https://aihabitat.org/challenge/2021/

  25. Hill, F., Mokra, S., Wong, N., Harley, T.: Human instruction-following with deep reinforcement learning via transfer-learning from text. arXiv:abs/2005.09382 (2020)

  26. Hong, Y., Wu, Q., Qi, Y., Rodriguez-Opazo, C., Gould, S.: A recurrent vision-and-language BERT for navigation. In: ECCV (2021)

    Google Scholar 

  27. Hu, X., et al.: Vivo: surpassing human performance in novel object captioning with visual vocabulary pre-training. arXiv:abs/2009.13682 (2020)

  28. Huang, W., Abbeel, P., Pathak, D., Mordatch, I.: Language models as zero-shot planners: extracting actionable knowledge for embodied agents. ArXiv abs/2201.07207 (2022)

    Google Scholar 

  29. Jasmine, C., et al.: ABO: dataset and benchmarks for real-world 3D object understanding. arXiv preprint arXiv:2110.06199 (2021)

  30. Jiang, J., Zheng, L., Luo, F., Zhang, Z.: RedNet: residual encoder-decoder network for indoor RGB-D semantic segmentation. arXiv preprint arXiv:1806.01054 (2018)

  31. Jiang, Y., Lim, M., Saxena, A.: Learning object arrangements in 3D scenes using human context. In: Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, 26 June- 1 July 2012 (2012)

    Google Scholar 

  32. Kapelyukh, I., Johns, E.: My house, my rules: learning tidying preferences with graph neural networks. In: CoRL (2021)

    Google Scholar 

  33. Kolve, E., et al.: AI2-THOR: an interactive 3D environment for visual AI. arXiv (2017)

    Google Scholar 

  34. Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics (1977)

    Google Scholar 

  35. Levesque, H.J., Davis, E., Morgenstern, L.: The winograd schema challenge. In: KR (2011)

    Google Scholar 

  36. Li, S., et al.: Pre-trained language models for interactive decision-making. arXiv:abs/2202.01771 (2022)

  37. Li, X., et al.: Oscar: object-semantics aligned pre-training for vision-language tasks. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12375, pp. 121–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58577-8_8

    Chapter  Google Scholar 

  38. Liu, W., Bansal, D., Daruna, A., Chernova, S.: Learning instance-level n-ary semantic knowledge at scale for robots operating in everyday environments. In: Proceedings of Robotics: Science and Systems (2021)

    Google Scholar 

  39. Liu, W., Paxton, C., Hermans, T., Fox, D.: StructFormer: learning spatial structure for language-guided semantic rearrangement of novel objects. arXiv preprint arXiv:2110.10189 (2021)

  40. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv:1907.11692 (2019)

  41. Lu, J., Batra, D., Parikh, D., Lee, S.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, Vancouver, BC, Canada, 8–14 2019 December (2019)

    Google Scholar 

  42. Lu, K., Grover, A., Abbeel, P., Mordatch, I.: Pretrained transformers as universal computation engines. arXiv:abs/2103.05247 (2021)

  43. Majumdar, A., Shrivastava, A., Lee, S., Anderson, P., Parikh, D., Batra, D.: Improving vision-and-language navigation with image-text pairs from the web. arXiv:abs/2004.14973 (2020)

  44. Mostafazadeh, N., et al.: A corpus and cloze evaluation for deeper understanding of commonsense stories. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2016)

    Google Scholar 

  45. Moudgil, A., Majumdar, A., Agrawal, H., Lee, S., Batra, D.: SOAT: a scene-and object-aware transformer for vision-and-language navigation. In: Advances in Neural Information Processing Systems 34 (2021)

    Google Scholar 

  46. Narasimhan, M., et al.: Seeing the un-scene: learning amodal semantic maps for room navigation. CoRR abs/2007.09841 (2020)

    Google Scholar 

  47. van den Oord, A., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018)

  48. Padmakumar, A., et al.: Teach: task-driven embodied agents that chat. arXiv:abs/2110.00534 (2021)

  49. Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014)

    Google Scholar 

  50. Petroni, F., et al.: How context affects language models’ factual predictions. In: Automated Knowledge Base Construction (2020)

    Google Scholar 

  51. Petroni, F., et al.: Language models as knowledge bases? In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (2019)

    Google Scholar 

  52. Ramakrishnan, S.K., Jayaraman, D., Grauman, K.: An exploration of embodied visual exploration. Int. J. Comput. Vis. 129(5), 1616–1649 (2021). https://doi.org/10.1007/s11263-021-01437-z

    Article  Google Scholar 

  53. Google Research: Google Scanned Objects (2020). https://app.ignitionrobotics.org/GoogleResearch/fuel/collections/Google%20Scanned%20Objects. Accessed Feb 2022

  54. Roberts, A., Raffel, C., Shazeer, N.: How much knowledge can you pack into the parameters of a language model? In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2020)

    Google Scholar 

  55. Fetch robotics: Fetch (2020). http://fetchrobotics.com/

  56. Sakaguchi, K., Le Bras, R., Bhagavatula, C., Choi, Y.: WinoGrande: an adversarial winograd schema challenge at scale. In: AAAI (2020)

    Google Scholar 

  57. Salganik, M.J.: Bit by Bit: Social Research in the Digital Age. Open review edition (2017)

    Google Scholar 

  58. Sap, M., et al.: ATOMIC: an atlas of machine commonsense for if-then reasoning. In: The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, 27 January–1 February 2019 (2019)

    Google Scholar 

  59. Sap, M., Rashkin, H., Chen, D., Le Bras, R., Choi, Y.: Social IQa: commonsense reasoning about social interactions. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (2019)

    Google Scholar 

  60. Savva, M., et al.: Habitat: a platform for embodied AI research. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), 27 October–2 November 2019 (2019)

    Google Scholar 

  61. Shen, B., et al.: iGibson, a simulation environment for interactive tasks in large realistic scenes. arXiv preprint arXiv:2012.02924 (2020)

  62. Shridhar, M., et al.: ALFRED: a benchmark for interpreting grounded instructions for everyday tasks. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020 (2020)

    Google Scholar 

  63. Srivastava, S., et al.: Behavior: benchmark for everyday household activities in virtual, interactive, and ecological environments. In: CoRL (2021)

    Google Scholar 

  64. Szot, A., et al.: Habitat 2.0: training home assistants to rearrange their habitat. In: Advances in Neural Information Processing Systems 34 (2021)

    Google Scholar 

  65. Taniguchi, A., Isobe, S., Hafi, L.E., Hagiwara, Y., Taniguchi, T.: Autonomous planning based on spatial concepts to tidy up home environments with service robots. Adv. Robot. 35, 471–489 (2021)

    Article  Google Scholar 

  66. Thomason, J., Murray, M., Cakmak, M., Zettlemoyer, L.: Vision-and-dialog navigation. In: CoRL (2019)

    Google Scholar 

  67. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017 (2017)

    Google Scholar 

  68. Wang, W., Bao, H., Dong, L., Wei, F.: VLMO: unified vision-language pre-training with mixture-of-modality-experts. arXiv:abs/2111.02358 (2021)

  69. Wani, S., Patel, S., Jain, U., Chang, A.X., Savva, M.: MultiON: benchmarking semantic map memory using multi-object navigation. In: NeurIPS (2020)

    Google Scholar 

  70. Weihs, L., Deitke, M., Kembhavi, A., Mottaghi, R.: Visual room rearrangement. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021)

    Google Scholar 

  71. Wijmans, E., et al.: Embodied question answering in photorealistic environments with point cloud perception. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019 (2019)

    Google Scholar 

  72. Yamauchi, B.: A frontier-based approach for autonomous exploration. In: CIRA, vol. 97 (1997)

    Google Scholar 

  73. Zellers, R., Bisk, Y., Farhadi, A., Choi, Y.: From recognition to cognition: Visual commonsense reasoning. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June 2019 (2019)

    Google Scholar 

  74. Zellers, R., Bisk, Y., Schwartz, R., Choi, Y.: SWAG: a large-scale adversarial dataset for grounded commonsense inference. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (2018)

    Google Scholar 

  75. Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., Choi, Y.: HellaSwag: can a machine really finish your sentence? In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (2019)

    Google Scholar 

  76. Zhou, B., Khashabi, D., Ning, Q., Roth, D.: “Going on a vacation” takes longer than “going for a walk”: a study of temporal commonsense understanding. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (2019)

    Google Scholar 

Download references

Acknowledgements

We thank the Habitat team for their support. The Georgia Tech effort was supported in part by NSF, AFRL, DARPA, ONR YIPs, ARO PECASE, Amazon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government, or any sponsor.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yash Kant .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 8964 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kant, Y. et al. (2022). Housekeep: Tidying Virtual Households Using Commonsense Reasoning. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13699. Springer, Cham. https://doi.org/10.1007/978-3-031-19842-7_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19842-7_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19841-0

  • Online ISBN: 978-3-031-19842-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics