Skip to main content

HUMOS: Human Motion Model Conditioned on Body Shape

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15074))

Included in the following conference series:

  • 400 Accesses

Abstract

Generating realistic human motion is crucial for many computer vision and graphics applications. The rich diversity of human body shapes and sizes significantly influences how people move. However, existing motion models typically overlook these differences, using a normalized, average body instead. This results in a homogenization of motion across human bodies, with motions not aligning with their physical attributes, thus limiting diversity. To address this, we propose a novel approach to learn a generative motion model conditioned on body shape. We demonstrate that it is possible to learn such a model from unpaired training data using cycle consistency, intuitive physics, and stability constraints that model the correlation between identity and movement. The resulting model produces diverse, physically plausible, and dynamically stable human motions that are quantitatively and qualitatively more realistic than existing state of the art. More details are available on our project page https://github.com/CarstenEpic/humos.

S. Tripathi—work done during an internship at Epic Games.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    All datasets were obtained and used only by the authors affiliated with academic institutions.

References

  1. Abdul-Massih, M., Yoo, I., Benes, B.: Motion style retargeting to characters with different morphologies. Comput. Graph. Forum 36(6), 86–99 (2017). https://doi.org/10.1111/cgf.12860, https://onlinelibrary.wiley.com/doi/abs/10.1111/cgf.12860

  2. Aberman, K., Li, P., Lischinski, D., Sorkine-Hornung, O., Cohen-Or, D., Chen, B.: Skeleton-aware networks for deep motion retargeting. ACM Trans. Graph. 39(4), 62:1–62:14 (2020). https://doi.org/10.1145/3386569.3392462, https://doi.org/10.1145/3386569.3392462

  3. Aberman, K., Wu, R., Lischinski, D., Chen, B., Cohen-Or, D.: Learning character-agnostic motion for motion retargeting in 2D. ACM Trans. Graph. 38(4), 1–14 (2019). https://doi.org/10.1145/3306346.3322999, http://dx.doi.org/10.1145/3306346.3322999

  4. Ahn, H., Ha, T., Choi, Y., Yoo, H., Oh, S.: Text2Action: Generative adversarial synthesis from language to action. In: International Conference on Robotics and Automation (ICRA) (2018)

    Google Scholar 

  5. Ahuja, C., Morency, L.P.: Language2Pose: natural language grounded pose forecasting. In: 2019 International Conference on 3D Vision (3DV), pp. 719–728. IEEE (2019)

    Google Scholar 

  6. Aksan, E., Kaufmann, M., Hilliges, O.: Structured prediction helps 3D human motion modelling. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 7144–7153 (2019)

    Google Scholar 

  7. Aliakbarian, S., Saleh, F.S., Salzmann, M., Petersson, L., Gould, S.: A stochastic conditioning scheme for diverse human motion prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5223–5232 (2020)

    Google Scholar 

  8. Athanasiou, N., Petrovich, M., Black, M.J., Varol, G.: TEACH: temporal action composition for 3D humans. In: 3DV, pp. 414–423. IEEE (2022)

    Google Scholar 

  9. Athanasiou, N., Petrovich, M., Black, M.J., Varol, G.: SINC: Spatial composition of 3D human motions for simultaneous action generation. In: Proceedings of International Conference on Computer Vision (ICCV), pp. 9984–9995 (2023)

    Google Scholar 

  10. Bao, F., Li, C., Sun, J., Zhu, J., Zhang, B.: Estimating the optimal covariance with imperfect mean in diffusion probabilistic models. In: International Conference on Machine Learning (2022)

    Google Scholar 

  11. Bao, F., Li, C., Zhu, J., Zhang, B.: Analytic-DPM: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. In: International Conference on Learning Representations (2022)

    Google Scholar 

  12. Barsoum, E., Kender, J., Liu, Z.: HP-GAN: probabilistic 3D human motion prediction via GAN. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1418–1427 (2018)

    Google Scholar 

  13. Basset, J., Wuhrer, S., Boyer, E., Multon, F.: Contact preserving shape transfer for rigging-free motion retargeting. In: Proceedings of the 12th ACM SIGGRAPH Conference on Motion, Interaction and Games. MIG ’19, Association for Computing Machinery, New York, NY, USA (2019). https://doi.org/10.1145/3359566.3360075

  14. Bergamin, K., Clavet, S., Holden, D., Forbes, J.R.: DReCon: data-driven responsive control of physics-based characters. ACM Trans. Graph. (TOG) 38(6), 1–11 (2019)

    Article  Google Scholar 

  15. Bhattacharya, U., Childs, E., Rewkowski, N., Manocha, D.: Speech2AffectiveGestures: synthesizing co-speech gestures with generative adversarial affective expression learning. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 2027–2036 (2021)

    Google Scholar 

  16. Bhattacharya, U., Rewkowski, N., Banerjee, A., Guhan, P., Bera, A., Manocha, D.: Text2Gestures: a transformer-based network for generating emotive body gestures for virtual agents. In: 2021 IEEE Virtual Reality and 3D User Interfaces (VR), pp. 1–10. IEEE (2021)

    Google Scholar 

  17. Celikcan, U., Yaz, I.O., Capin, T.: Example-based retargeting of human motion to arbitrary mesh models. Comput. Graph. Forum 34(1), 216–227 (2015). https://doi.org/10.1111/cgf.12507, https://onlinelibrary.wiley.com/doi/abs/10.1111/cgf.12507

  18. Chen, X., Jiang, B., Liu, W., Huang, Z., Fu, B., Chen, T., Yu, G.: Executing your commands via motion diffusion in latent space. In: CVPR, pp. 18000–18010. IEEE (2023)

    Google Scholar 

  19. Choi, J., Kim, S., Jeong, Y., Gwon, Y., Yoon, S.: ILVR: Conditioning method for denoising diffusion probabilistic models. arXiv preprint arXiv:2108.02938 (2021)

  20. Choi, K.J., Ko, H.S.: On-line motion retargetting. In: Proceedings of Seventh Pacific Conference on Computer Graphics and Applications (Cat. No.PR00293), pp. 32–42 (1999). https://doi.org/10.1109/PCCGA.1999.803346

  21. Dhariwal, P., Nichol, A.Q.: Diffusion models beat GANs on image synthesis. In: Advances in Neural Information Processing Systems (2021)

    Google Scholar 

  22. Dockhorn, T., Vahdat, A., Kreis, K.: GENIE: higher-order denoising diffusion solvers. In: Advances in Neural Information Processing Systems (2022)

    Google Scholar 

  23. Fragkiadaki, K., Levine, S., Felsen, P., Malik, J.: Recurrent network models for human dynamics. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4346–4354 (2015)

    Google Scholar 

  24. Fussell, L., Bergamin, K., Holden, D.: SuperTrack: motion tracking for physically simulated characters using supervised learning. ACM Trans. Graph. 40(6), 1–13 (2021). https://doi.org/10.1145/3478513.3480527

  25. Geman, S.: Statistical methods for tomographic image restoration. Bull. Internat. Statist. Inst. 52, 5–21 (1987)

    MathSciNet  Google Scholar 

  26. Ghosh, A., Cheema, N., Oguz, C., Theobalt, C., Slusallek, P.: Synthesis of compositional animations from textual descriptions. In: International Conference on Computer Vision (ICCV) (2021)

    Google Scholar 

  27. Ghosh, P., Song, J., Aksan, E., Hilliges, O.: Learning human motion models for long-term predictions. In: 2017 International Conference on 3D Vision (3DV), pp. 458–466. IEEE (2017)

    Google Scholar 

  28. Ginosar, S., Bar, A., Kohavi, G., Chan, C., Owens, A., Malik, J.: Learning individual styles of conversational gesture. In: Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  29. Gomes, T., Martins, R., Ferreira, J., Azevedo, R., Torres, G., Nascimento, E.: A shape-aware retargeting approach to transfer human motion and appearance in monocular videos. Int. J. Comput. Vision 129(7), 2057–2075 (2021). https://doi.org/10.1007/s11263-021-01471-x, https://inria.hal.science/hal-03257490, 19 pages, 13 figures

  30. Gopalakrishnan, A., Mali, A., Kifer, D., Giles, L., Ororbia, A.G.: A neural temporal model for human motion prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12116–12125 (2019)

    Google Scholar 

  31. Grenander, U., Miller, M.I.: Representations of knowledge in complex systems. J. Roy. Stat. Soc.: Ser. B (Methodol.) 56(4), 549–581 (1994)

    Article  MathSciNet  Google Scholar 

  32. Guo, C., et al.: Generating diverse and natural 3D human motions from text. In: Computer Vision and Pattern Recognition (CVPR), pp. 5152–5161 (2022)

    Google Scholar 

  33. Guo, C., et al.: Action2Motion: conditioned generation of 3D human motions. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2021–2029 (2020)

    Google Scholar 

  34. Habibie, I., Holden, D., Schwarz, J., Yearsley, J., Komura, T.: A recurrent variational autoencoder for human motion synthesis. In: British Machine Vision Conference (BMVC) (2017)

    Google Scholar 

  35. He, C., Saito, J., Zachary, J., Rushmeier, H.E., Zhou, Y.: NeMF: neural motion fields for kinematic animation. In: NeurIPS (2022)

    Google Scholar 

  36. Holden, D., Saito, J., Komura, T.: A deep learning framework for character motion synthesis and editing. ACM Trans. Graph. (TOG) 35(4), 1–11 (2016)

    Article  Google Scholar 

  37. Hoyet, L., McDonnell, R., O’Sullivan, C.: Push it real: perceiving causality in virtual interactions. ACM Trans. Graph. 31(4), 90:1–90:9 (2012)

    Google Scholar 

  38. Kang, H.j., et al.: Realization of biped walking on uneven terrain by new foot mechanism capable of detecting ground surface. In: 2010 IEEE International Conference on Robotics and Automation, pp. 5167–5172 (2010). https://doi.org/10.1109/ROBOT.2010.5509348

  39. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (2015)

    Google Scholar 

  40. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. In: ICLR (2014)

    Google Scholar 

  41. Kondak, K., Hommel, G.: Control and online computation of stable movement for biped robots. IEEE/RSJ Int. Conf. Intell. Robot. Syst. 1, 874–879 (2003)

    Google Scholar 

  42. Lee, H., Yang, X., Liu, M., Wang, T., Lu, Y., Yang, M., Kautz, J.: Dancing to music. In: Neural Information Processing Systems (NeurIPS) (2019)

    Google Scholar 

  43. Lee, S., Kang, T., Park, J., Lee, J., Won, J.: SAME: skeleton-agnostic motion embedding for character animation. In: SIGGRAPH Asia 2023 Conference Papers. SA ’23, Association for Computing Machinery, New York, NY, USA (2023). https://doi.org/10.1145/3610548.3618206

  44. Li, B., Zhao, Y., Zhelun, S., Sheng, L.: DanceFormer: music conditioned 3D dance generation with parametric motion transformer. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 36, pp. 1272–1279 (2022)

    Google Scholar 

  45. Li, J., Yin, Y., Chu, H., Zhou, Y., Wang, T., Fidler, S., Li, H.: Learning to generate diverse dance motions with transformer. arXiv preprint arXiv:2008.08171 (2020)

  46. Li, R., Yang, S., Ross, D.A., Kanazawa, A.: AI choreographer: music conditioned 3D dance generation with AIST++. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13401–13412 (2021)

    Google Scholar 

  47. Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: A skinned multi-person linear model. Trans. Graph. (TOG) 34(6), 248:1–248:16 (2015)

    Google Scholar 

  48. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2017). https://api.semanticscholar.org/CorpusID:53592270

  49. Mahmood, N., Ghorbani, N., F. Troje, N., Pons-Moll, G., Black, M.J.: AMASS: archive of motion capture as surface shapes. In: International Conference on Computer Vision (ICCV), pp. 5441–5450 (2019)

    Google Scholar 

  50. Makoviychuk, V., et al.: Isaac gym: high performance GPU based physics simulation for robot learning. In: Vanschoren, J., Yeung, S. (eds.) Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual (2021). https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/28dd2c7955ce926456240b2ff0100bde-Abstract-round2.html

  51. Motion builder. https://www.autodesk.com/products/motionbuilder/overview

  52. Peng, X.B., Abbeel, P., Levine, S., van de Panne, M.: DeepMimic: example-guided deep reinforcement learning of physics-based character skills. ACM Trans. Graph. (TOG) 37(4), 1–14 (2018)

    Google Scholar 

  53. Peng, X.B., Kanazawa, A., Malik, J., Abbeel, P., Levine, S.: SFV: reinforcement learning of physical skills from videos. ACM Trans. Graph. (TOG) 37(6), 1–14 (2018)

    Article  Google Scholar 

  54. Peng, X.B., van de Panne, M.: Learning locomotion skills using DeepRL: does the choice of action space matter? In: Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pp. 1–13 (2017)

    Google Scholar 

  55. Petrovich, M., Black, M.J., Varol, G.: Action-conditioned 3D human motion synthesis with transformer VAE. In: ICCV, pp. 10965–10975. IEEE (2021)

    Google Scholar 

  56. Petrovich, M., Black, M.J., Varol, G.: TEMOS: generating diverse human motions from textual descriptions. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022. ECCV 2022. LNCS, vol. 13682. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_28

  57. Popovic, M.B., Goswami, A., Herr, H.: Ground reference points in legged locomotion: definitions, biological trajectories and control implications. Int. J. Robot. Res. 24(10), 1013–1032 (2005)

    Google Scholar 

  58. Regateiro, J., Boyer, E.: Temporal shape transfer network for 3D human motion. In: 2022 International Conference on 3D Vision (3DV), pp. 424–432 (2022). https://doi.org/10.1109/3DV57658.2022.00054

  59. Reitsma, P.S.A., Pollard, N.S.: Perceptual metrics for character animation: sensitivity to errors in ballistic motion. ACM Trans. Graph. 22(3), 537–542 (2003)

    Article  Google Scholar 

  60. Rempe, D., Birdal, T., Hertzmann, A., Yang, J., Sridhar, S., Guibas, L.J.: HuMoR: 3D human motion model for robust pose estimation. In: International Conference on Computer Vision (ICCV), pp. 11468–11479. IEEE (2021)

    Google Scholar 

  61. Ren, Z., Pan, Z., Zhou, X., Kang, L.: Diffusion motion: Generate text-guided 3D human motion by diffusion model. arXiv preprint arXiv:2210.12315 (2022)

  62. Rokoko. https://www.rokoko.com/

  63. Rokoko: Rokoko studio live plugin for blender. https://github.com/Rokoko/rokoko-studio-live-blender (2023)

  64. Schulman, J., Moritz, P., Levine, S., Jordan, M.I., Abbeel, P.: High-dimensional continuous control using generalized advantage estimation. In: Bengio, Y., LeCun, Y. (eds.) 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings (2016)

    Google Scholar 

  65. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

  66. Shimada, S., Golyanik, V., Xu, W., Pérez, P., Theobalt, C.: Neural monocular 3D human motion capture with physical awareness. ACM Trans. Graph. (ToG) 40(4), 1–15 (2021)

    Article  Google Scholar 

  67. Shimada, S., Golyanik, V., Xu, W., Theobalt, C.: PhysCap: physically plausible monocular 3D motion capture in real time. ACM Trans. Graph. (TOG) 39(6), 235 (2020)

    Google Scholar 

  68. Taheri, O., Choutas, V., Black, M.J., Tzionas, D.: GOAL: generating 4D whole-body motion for hand-object grasping. In: Computer Vision and Pattern Recognition (CVPR), pp. 13253–13263 (2022)

    Google Scholar 

  69. Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. In: ICLR. OpenReview.net (2023)

    Google Scholar 

  70. Tripathi, S., Müller, L., Huang, C.H.P., Omid, T., Black, M.J., Tzionas, D.: 3D human pose estimation via intuitive physics. In: Computer Vision and Pattern Recognition (CVPR), pp. 4713–4725 (2023). https://ipman.is.tue.mpg.de

  71. Vaswani, A., et al.: Attention is all you need. In: NeurIPS. vol. 30 (2017)

    Google Scholar 

  72. Villegas, R., Ceylan, D., Hertzmann, A., Yang, J., Saito, J.: Contact-aware retargeting of skinned motion. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9700–9709 (2021). https://doi.org/10.1109/ICCV48922.2021.00958

  73. Vukobratović, M., Borovac, B.: Zero-moment point-thirty five years of its life. In: International Journal of Humanoid Robotics, pp. 157–173 (2004)

    Google Scholar 

  74. Wang, J., et al.: Neural pose transfer by spatially adaptive instance normalization. CoRR abs/2003.07254 (2020). https://arxiv.org/abs/2003.07254

  75. Won, J., Gopinath, D., Hodgins, J.: A scalable approach to control diverse behaviors for physically simulated characters. ACM Trans. Graph. (TOG) 39(4), 33:1-33:12 (2020)

    Article  Google Scholar 

  76. Yamane, K., Ariki, Y., Hodgins, J.: Animating non-humanoid characters with human motion data. In: Proceedings of the 2010 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pp. 169–178. SCA ’10, Eurographics Association, Goslar, DEU (2010)

    Google Scholar 

  77. Yi, X., et al.: Physical inertial poser (PIP): physics-aware real-time human motion tracking from sparse inertial sensors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13167–13178 (2022)

    Google Scholar 

  78. Yuan, Y., Kitani, K.: DLow: diversifying latent flows for diverse human motion prediction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 346–364. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_20

    Chapter  Google Scholar 

  79. Yuan, Y., Kitani, K.: Residual force control for agile human behavior imitation and extended motion synthesis. In: Advances in Neural Information Processing Systems (2020)

    Google Scholar 

  80. Yuan, Y., Song, J., Iqbal, U., Vahdat, A., Kautz, J.: PhysDiff: physics-guided human motion diffusion model. In: ICCV, pp. 15964–15975. IEEE (2023)

    Google Scholar 

  81. Yuan, Y., Wei, S.E., Simon, T., Kitani, K., Saragih, J.: SimPoE: simulated character control for 3D human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  82. Zell, P., Wandt, B., Rosenhahn, B.: Joint 3D human motion capture and physical analysis from monocular videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 17–26 (2017)

    Google Scholar 

  83. Zhang, J., et al.: Skinned motion retargeting with residual perception of motion semantics & geometry. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13864–13872 (2023). https://doi.org/10.1109/CVPR52729.2023.01332

  84. Zhang, M., et al.: MotionDiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001 (2022)

  85. Zhou, K., Bhatnagar, B.L., Pons-Moll, G.: Unsupervised shape and pose disentanglement for 3D meshes. CoRR abs/2007.11341 (2020), https://arxiv.org/abs/2007.11341

  86. Zhou, Y., Barnes, C., Lu, J., Yang, J., Li, H.: On the continuity of rotation representations in neural networks. In: CVPR, pp. 5745–5753. Computer Vision Foundation/IEEE (2019)

    Google Scholar 

  87. Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV, pp. 2223–2232 (2017)

    Google Scholar 

Download references

Acknowledgements

We sincerely thank Tsvetelina Alexiadis, Alpar Cseke, Tomasz Niewiadomski, and Taylor McConnell for facilitating the perceptual study, and Giorgio Becherini for his help with the Rokoko baseline. We are grateful to Iain Matthews, Brian Karis, Nikos Athanasiou, Markos Diomataris, and Mathis Petrovich for valuable discussions and advice. Their invaluable contributions enriched this research significantly.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shashank Tripathi .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 6897 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tripathi, S., Taheri, O., Lassner, C., Black, M., Holden, D., Stoll, C. (2025). HUMOS: Human Motion Model Conditioned on Body Shape. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15074. Springer, Cham. https://doi.org/10.1007/978-3-031-72640-8_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-72640-8_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-72639-2

  • Online ISBN: 978-3-031-72640-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics