Compositional Human-Scene Interaction Synthesis with Semantic Control

Zhao, Kaifeng; Wang, Shaofei; Zhang, Yan; Beeler, Thabo; Tang, Siyu

doi:10.1007/978-3-031-20068-7_18

Kaifeng Zhao¹²,
Shaofei Wang¹²,
Yan Zhang¹²,
Thabo Beeler¹³ &
…
Siyu Tang¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13666))

Included in the following conference series:

European Conference on Computer Vision

2862 Accesses

Abstract

Synthesizing natural interactions between virtual humans and their 3D environments is critical for numerous applications, such as computer games and AR/VR experiences. Recent methods mainly focus on modeling geometric relations between 3D environments and humans, where the high-level semantics of the human-scene interaction has frequently been ignored. Our goal is to synthesize humans interacting with a given 3D scene controlled by high-level semantic specifications as pairs of action categories and object instances, e.g., “sit on the chair”. The key challenge of incorporating interaction semantics into the generation framework is to learn a joint representation that effectively captures heterogeneous information, including human body articulation, 3D object geometry, and the intent of the interaction. To address this challenge, we design a novel transformer-based generative model, in which the articulated 3D human body surface points and 3D objects are jointly encoded in a unified latent space, and the semantics of the interaction between the human and objects are embedded via positional encoding. Furthermore, inspired by the compositional nature of interactions that humans can simultaneously interact with multiple objects, we define interaction semantics as the composition of varying numbers of atomic action-object pairs. Our proposed generative model can naturally incorporate varying numbers of atomic interactions, which enables synthesizing compositional human-scene interactions without requiring composite interaction data. We extend the PROX dataset with interaction semantic labels and scene instance segmentation to evaluate our method and demonstrate that our method can generate realistic human-scene interactions with semantic control. Our perceptual study shows that our synthesized virtual humans can naturally interact with 3D scenes, considerably outperforming existing methods. We name our method COINS, for COmpositional INteraction Synthesis with Semantic Control. Code and data are available at https://github.com/zkf1997/COINS.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Controllable Human-Object Interaction Synthesis

COUCH: Towards Controllable Human-Chair Interactions

PhysDreamer: Physics-Based Interaction with 3D Objects via Video Generation

References

Ahn, H., Ha, T., Choi, Y., Yoo, H., Oh, S.: Text2action: generative adversarial synthesis from language to action. In: Proceedings of ICRA. IEEE (2018)
Google Scholar
Ahuja, C., Morency, L.P.: Language2pose: natural language grounded pose forecasting. In: Proceedings of 3DV. IEEE (2019)
Google Scholar
Akhter, I., Black, M.J.: Pose-conditioned joint angle limits for 3d human pose reconstruction. In: Proceedings of CVPR (2015)
Google Scholar
Engelmann, F., Rematas, K., Leibe, B., Ferrari, V.: From points to multi-object 3D reconstruction. In: Proceedings of CVPR (2021)
Google Scholar
Gkioxari, G., Girshick, R., Dollár, P., He, K.: Detecting and recognizing human-object interactions. In: Proceedings of CVPR (2018)
Google Scholar
Grabner, H., Gall, J., Van Gool, L.: What makes a chair a chair? In: Proceedings of CVPR. IEEE (2011)
Google Scholar
Gupta, A., Kembhavi, A., Davis, L.S.: Observing human-object interactions: using spatial and functional compatibility for recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31, 1775–1789 (2009)
Article Google Scholar
Gupta, A., Satkin, S., Efros, A.A., Hebert, M.: From 3d scene geometry to human workspace. In: Proceedings of CVPR. IEEE (2011)
Google Scholar
Hassan, M., et al.: Stochastic scene-aware motion prediction. In: Proceedings of ICCV (2021)
Google Scholar
Hassan, M., Choutas, V., Tzionas, D., Black, M.J.: Resolving 3D human pose ambiguities with 3D scene constraints. In: Proceedings of ICCV (2019)
Google Scholar
Hassan, M., Ghosh, P., Tesch, J., Tzionas, D., Black, M.J.: Populating 3D scenes by learning human-scene interaction. In: Proceedings of CVPR (2021)
Google Scholar
Hu, R., et al.: Predictive and generative neural networks for object functionality. arXiv preprint. arXiv:2006.15520 (2020)
Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2012)
Article Google Scholar
Kay, W., et al.: The kinetics human action video dataset. arXiv preprint. arXiv:1705.06950 (2017)
Kim, V.G., Chaudhuri, S., Guibas, L.J., Funkhouser, T.: Shape2pose: human-centric shape analysis. In: ACM Transactions on Graphics, (Proceedings SIGGRAPH) (2014)
Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint. arXiv:1312.6114 (2013)
Li, X., Liu, S., Kim, K., Wang, X., Yang, M.H., Kautz, J.: Putting humans in a scene: learning affordance in 3d indoor environments. In: Proceedings of CVPR (2019)
Google Scholar
Lieber, R., Stekauer, P.: The Oxford Handbook of Compounding (2011)
Google Scholar
Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: archive of motion capture as surface shapes. In: Proceedings of ICCV (2019)
Google Scholar
Mandery, C., Terlemez, O., Do, M., Vahrenkamp, N., Asfour, T.: The kit whole-body human motion database. In: Proceedings of ICAR (2015)
Google Scholar
Mineshima, K., Martínez-Gómez, P., Miyao, Y., Bekki, D.: Higher-order logical inference with compositional semantics. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (2015)
Google Scholar
Mitchell, J., Lapata, M.: Vector-based models of semantic composition. In: Proceedings of ACL (2008)
Google Scholar
Pavlakos, G., et al.: Expressive body capture: 3d hands, face, and body from a single image. In: Proceedings of CVPR (2019)
Google Scholar
Petrovich, M., Black, M.J., Varol, G.: Action-conditioned 3d human motion synthesis with transformer vae. In: Proceedings of ICCV (2021)
Google Scholar
Plag, I.: Word-formation in English. Cambridge University Press, Cambridge (2018)
Book Google Scholar
Punnakkal, A.R., Chandrasekaran, A., Athanasiou, N., Quiros-Ramirez, A., Black, M.J.: BABEL: bodies, action and behavior with english labels. In: Proceedings of CVPR (2021)
Google Scholar
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: deep hierarchical feature learning on point sets in a metric space. arXiv preprint. arXiv:1706.02413 (2017)
Ranjan, A., Bolkart, T., Sanyal, S., Black, M.J.: Generating 3D faces using convolutional mesh autoencoders. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 725–741. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_43
Chapter Google Scholar
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. California Univ San Diego La Jolla Inst for Cognitive Science. Technical report (1985)
Google Scholar
Savva, M., Chang, A.X., Hanrahan, P., Fisher, M., Nießner, M.: Scenegrok: inferring action maps in 3d environments. In: ACM Transactions on Graphics (TOG), (Proceedings SIGGRAPH), vol. 33, no. 6, pp. 1–10 (2014)
Google Scholar
Savva, M., Chang, A.X., Hanrahan, P., Fisher, M., Nießner, M.: PiGraphs: learning interaction snapshots from observations. In: ACM Transactions on Graphics, (Proceedings SIGGRAPH), vol. 35, no. 4 (2016)
Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Proceedings of NeurIPS (2014)
Google Scholar
Sohn, K., Lee, H., Yan, X.: Learning structured output representation using deep conditional generative models. In: Proceedings of NeurIPS (2015)
Google Scholar
Starke, S., Zhang, H., Komura, T., Saito, J.: Neural state machine for character-scene interactions. In: ACM Transactions Graphics (ACM SIGGRAPH Asia) (2019)
Google Scholar
Tevet, G., Gordon, B., Hertz, A., Bermano, A.H., Cohen-Or, D.: Motionclip: exposing human motion generation to clip space. arXiv preprint. arXiv:2203.08063 (2022)
De la Torre, F., Hodgins, J., Bargteil, A., Martin, X., Macey, J., Collado, A., Beltran, P.: Guide to the carnegie mellon university multimodal activity (cmu-mmac) database (2009)
Google Scholar
Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: Proceedings of CVPR (2018)
Google Scholar
Troje, N.F.: Decomposing biological motion: a framework for analysis and synthesis of human gait patterns. J. Vis. 2, 371–387 (2002)
Article Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Proceedings of NeurIPS (2017)
Google Scholar
Wang, J., Xu, H., Xu, J., Liu, S., Wang, X.: Synthesizing long-term 3d human motion and interaction in 3d scenes. In: Proceedings of CVPR (2021)
Google Scholar
Wang, J., Rong, Y., Liu, J., Yan, S., Lin, D., Dai, B.: Towards diverse and natural scene-aware 3d human motion synthesis. In: Proceedings of CVPR (2022)
Google Scholar
Yao, B., Fei-Fei, L.: Modeling mutual context of object and human pose in human-object interaction activities. In: Proceedings of CVPR (2010)
Google Scholar
Yin, D., Meng, T., Chang, K.W.: Sentibert: a transferable transformer-based architecture for compositional sentiment semantics. arXiv preprint. arXiv:2005.04114 (2020)
Zhang, S., Zhang, Y., Bogo, F., Marc, P., Tang, S.: Learning motion priors for 4d human body capture in 3d scenes. In: Proceedings of ICCV (2021)
Google Scholar
Zhang, S., Zhang, Y., Ma, Q., Black, M.J., Tang, S.: PLACE: Proximity learning of articulation and contact in 3D environments. In: Proceedings of 3DV (2020)
Google Scholar
Zhang, Y., Hassan, M., Neumann, H., Black, M.J., Tang, S.: Generating 3d people in scenes without people. In: Proceedings of CVPR (2020)
Google Scholar

Download references

Acknowledgements

We sincerely acknowledge the anonymous reviewers for their insightful suggestions. We thank Francis Engelmann for help with scene segmentation and proofreading, and Siwei Zhang for providing body fitting results. This work was supported by the SNF grant 200021 204840

Author information

Authors and Affiliations

ETH Zürich, Zürich, Switzerland
Kaifeng Zhao, Shaofei Wang, Yan Zhang & Siyu Tang
Google, Zurich, Switzerland
Thabo Beeler

Authors

Kaifeng Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Shaofei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Thabo Beeler
View author publications
You can also search for this author in PubMed Google Scholar
Siyu Tang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kaifeng Zhao .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 6273 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, K., Wang, S., Zhang, Y., Beeler, T., Tang, S. (2022). Compositional Human-Scene Interaction Synthesis with Semantic Control. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13666. Springer, Cham. https://doi.org/10.1007/978-3-031-20068-7_18

Download citation

DOI: https://doi.org/10.1007/978-3-031-20068-7_18
Published: 11 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20067-0
Online ISBN: 978-3-031-20068-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Compositional Human-Scene Interaction Synthesis with Semantic Control