Skip to main content

Learning Uncoupled-Modulation CVAE for 3D Action-Conditioned Human Motion Synthesis

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Abstract

Motion capture data has been largely needed in the movie and game industry in recent years. Since the motion capture system is expensive and requires manual post-processing, motion synthesis is a plausible solution to acquire more motion data. However, generating the action-conditioned, realistic, and diverse 3D human motions given the semantic action labels is still challenging because the mapping from semantic labels to real motion sequences is hard to depict. Previous work made some positive attempts like appending label tokens to pose encoding and performing action bias on latent space. However, how to synthesize diverse motions that accurately match the given label is still not fully explored. In this paper, we propose the Uncoupled-Modulation Conditional Variational AutoEncoder (UM-CVAE) to generate action-conditioned motions from scratch in an uncoupled manner. The main idea is twofold: (i) training an action-agnostic encoder to weaken the action-related information to learn the easy-modulated latent representation; (ii) strengthening the action-conditioned process with FiLM-based action-aware modulation. We conduct extensive experiments on the HumanAct12, UESTC, and BABEL datasets, demonstrating that our method achieves state-of-the-art performance both qualitatively and quantitatively with potential applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Ahn, H., Ha, T., Choi, Y., Yoo, H., Oh, S.: Text2Action: generative adversarial synthesis from language to action. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 5915–5920. IEEE (2018)

    Google Scholar 

  2. Ahuja, C., Morency, L.P.: Language2Pose: natural language grounded pose forecasting. In: 2019 International Conference on 3D Vision (3DV), pp. 719–728. IEEE (2019)

    Google Scholar 

  3. Barsoum, E., Kender, J., Liu, Z.: HP-GAN: probabilistic 3D human motion prediction via GAN. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1418–1427 (2018)

    Google Scholar 

  4. Brand, M., Hertzmann, A.: Style machines. In: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, pp. 183–192 (2000)

    Google Scholar 

  5. Cheng, X., Xu, W., Wang, T., Chu, W.: Variational semi-supervised aspect-term sentiment analysis via transformer. arXiv preprint arXiv:1810.10437 (2018)

  6. Clavet, S.: Motion matching and the road to next-gen animation. In: Proceedings of the GDC (2016)

    Google Scholar 

  7. Corona, E., Pumarola, A., Alenya, G., Moreno-Noguer, F.: Context-aware human motion prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6992–7001 (2020)

    Google Scholar 

  8. Cui, Q., Sun, H., Yang, F.: Learning dynamic relationships for 3D human motion prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6519–6527 (2020)

    Google Scholar 

  9. Fang, L., Zeng, T., Liu, C., Bo, L., Dong, W., Chen, C.: Transformer-based conditional variational autoencoder for controllable story generation. arXiv preprint arXiv:2101.00828 (2021)

  10. Guo, C., et al.: Action2Motion: conditioned generation of 3D human motions. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2021–2029 (2020)

    Google Scholar 

  11. Holden, D., Kanoun, O., Perepichka, M., Popa, T.: Learned motion matching. ACM Trans. Graph. 39(4), 53:1–53:12 (2020)

    Google Scholar 

  12. Holden, D., Komura, T., Saito, J.: Phase-functioned neural networks for character control. ACM Trans. Graph. 36(4), 1–13 (2017)

    Article  Google Scholar 

  13. Holden, D., Saito, J., Komura, T.: A deep learning framework for character motion synthesis and editing. ACM Trans. Graph. 35(4), 1–11 (2016)

    Article  Google Scholar 

  14. Jain, A., Zamir, A.R., Savarese, S., Saxena, A.: Structural-RNN: deep learning on spatio-temporal graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5308–5317 (2016)

    Google Scholar 

  15. Ji, Y., Xu, F., Yang, Y., Shen, F., Shen, H.T., Zheng, W.S.: A large-scale RGB-D database for arbitrary-view human action recognition. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 1510–1518 (2018)

    Google Scholar 

  16. Jiang, J., Xia, G.G., Carlton, D.B., Anderson, C.N., Miyakawa, R.H.: Transformer VAE: a hierarchical model for structure-aware and interpretable music representation learning. In: ICASSP 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 516–520. IEEE (2020)

    Google Scholar 

  17. Kundu, J.N., Gor, M., Babu, R.V.: BiHMP-GAN: bidirectional 3D human motion prediction GAN. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8553–8560 (2019)

    Google Scholar 

  18. Lee, H.Y., et al.: Dancing to music. Adv. Neural Inf. Process. Syst. 32 (2019)

    Google Scholar 

  19. Li, J., et al.: Learning to generate diverse dance motions with transformer. arXiv preprint arXiv:2008.08171 (2020)

  20. Li, R., Yang, S., Ross, D.A., Kanazawa, A.: AI choreographer: music conditioned 3D dance generation with AIST++. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13401–13412 (2021)

    Google Scholar 

  21. Lin, X., Amer, M.R.: Human motion modeling using DVGANs. arXiv preprint arXiv:1804.10652 (2018)

  22. Ling, H.Y., Zinno, F., Cheng, G., Van De Panne, M.: Character controllers using motion VAEs. ACM Trans. Graph. 39(4), 40:1–40:12 (2020)

    Google Scholar 

  23. Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. 34(6), 1–16 (2015)

    Article  Google Scholar 

  24. Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11) (2008)

    Google Scholar 

  25. Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: AMASS: archive of motion capture as surface shapes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5442–5451 (2019)

    Google Scholar 

  26. Mao, W., Liu, M., Salzmann, M.: History repeats itself: human motion prediction via motion attention. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 474–489. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_28

    Chapter  Google Scholar 

  27. Martinez, J., Black, M.J., Romero, J.: On human motion prediction using recurrent neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2891–2900 (2017)

    Google Scholar 

  28. Mason, I., Starke, S., Komura, T.: Real-time style modelling of human locomotion via feature-wise transformations and local motion phases. arXiv preprint arXiv:2201.04439 (2022)

  29. Perez, E., Strub, F., De Vries, H., Dumoulin, V., Courville, A.: FiLM: visual reasoning with a general conditioning layer. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

    Google Scholar 

  30. Petrovich, M., Black, M.J., Varol, G.: Action-conditioned 3D human motion synthesis with transformer VAE. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10985–10995 (2021)

    Google Scholar 

  31. Punnakkal, A.R., Chandrasekaran, A., Athanasiou, N., Quiros-Ramirez, A., Black, M.J.: BABEL: bodies, action and behavior with English labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 722–731 (2021)

    Google Scholar 

  32. Romero, J., Tzionas, D., Black, M.J.: Embodied hands: modeling and capturing hands and bodies together. ACM Trans. Graph. (Proc. SIGGRAPH Asia) 36(6) (2017)

    Google Scholar 

  33. Shi, L., Zhang, Y., Cheng, J., Lu, H.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12026–12035 (2019)

    Google Scholar 

  34. Sofianos, T., Sampieri, A., Franco, L., Galasso, F.: Space-time-separable graph convolutional network for pose forecasting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11209–11218 (2021)

    Google Scholar 

  35. Starke, S., Zhang, H., Komura, T., Saito, J.: Neural state machine for character-scene interactions. ACM Trans. Graph. 38(6), 178:1–178:14 (2019)

    Google Scholar 

  36. Starke, S., Zhao, Y., Komura, T., Zaman, K.: Local motion phases for learning multi-contact character movements. ACM Trans. Graph. 39(4), 54:1–54:13 (2020)

    Google Scholar 

  37. Ulyanov, D., Vedaldi, A., Lempitsky, V.: Improved texture networks: maximizing quality and diversity in feed-forward stylization and texture synthesis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6924–6932 (2017)

    Google Scholar 

  38. Walker, J., Marino, K., Gupta, A., Hebert, M.: The pose knows: video forecasting by generating pose futures. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3332–3341 (2017)

    Google Scholar 

  39. Wang, J.M., Fleet, D.J., Hertzmann, A.: Gaussian process dynamical models for human motion. IEEE Trans. Pattern Anal. Mach. Intell. 30(2), 283–298 (2007)

    Article  Google Scholar 

  40. Wang, Z., Chai, J., Xia, S.: Combining recurrent neural networks and adversarial training for human motion synthesis and control. IEEE Trans. Visual Comput. Graphics 27(1), 14–28 (2019)

    Article  Google Scholar 

  41. Xia, S., Wang, C., Chai, J., Hodgins, J.: Realtime style transfer for unlabeled heterogeneous human motion. ACM Trans. Graph. 34(4), 1–10 (2015)

    Article  Google Scholar 

  42. Yuan, Y., Kitani, K.: DLow: diversifying latent flows for diverse human motion prediction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 346–364. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_20

    Chapter  Google Scholar 

  43. Zhang, H., Starke, S., Komura, T., Saito, J.: Mode-adaptive neural networks for quadruped motion control. ACM Trans. Graph. 37(4), 1–11 (2018)

    Article  Google Scholar 

  44. Zhong, C., Hu, L., Xia, S.: Spatial–temporal modeling for prediction of stylized human motion. Neurocomputing 511, 34–42 (2022)

    Article  Google Scholar 

  45. Zhong, C., Hu, L., Zhang, Z., Ye, Y., Xia, S.: Spatio-temporal gating-adjacency GCN for human motion prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6447–6456 (2022)

    Google Scholar 

  46. Zou, S., et al.: 3D human shape reconstruction from a polarization image. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 351–368. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_21

    Chapter  Google Scholar 

Download references

Acknowledgement

This work was supported by the National Key R &D Program of Science and Technology for Winter Olympics (No. 2020YFF0304701) and the National Natural Science Foundation of China (No. 61772499).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shihong Xia .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 2 (pdf 2091 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhong, C., Hu, L., Zhang, Z., Xia, S. (2022). Learning Uncoupled-Modulation CVAE for 3D Action-Conditioned Human Motion Synthesis. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13681. Springer, Cham. https://doi.org/10.1007/978-3-031-19803-8_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19803-8_42

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19802-1

  • Online ISBN: 978-3-031-19803-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics