Skip to main content
Log in

Motion synthesis via distilled absorbing discrete diffusion model

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

In this work, we explore the potential of discrete diffusion model in text-driven motion synthesis. Previous methods aimed at improving the quality of generated motions often led to an increase in model parameters, while neglecting the diversity of generated results. Here we introduce our Motion Absorbing Discrete Diffusion Model (MADDM), which combines the high diversity of continuous diffusion models with the high-quality generated results of discrete autoregressive models. Our results show that an absorbing discrete diffusion model can yield more precise discrete motion latent codes compared to previous autoregressive generation models. In MADDM, a lightweight discrete denoising model is designed to achieve more accurate generation results, which utilizes cross-layer parameter sharing to reduce the model’s parameters. A reweighted distribution loss is utilized to distill the model to adapt the distillation process more effectively to the discrete diffusion model. Our approach achieves state-of-the-art result on HumanML3D dataset with FID 0.073 and the model parameters are only one-third of the previous discrete autoregressive model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data availibility

Data will be made available on request.

References

  1. Chao, X., Hou, Z., Mo, Y., Shi, H., Yao, W.: Structural feature representation and fusion of human spatial cooperative motion for action recognition. Multimed. Syst. 29(3), 1301–1314 (2023)

    Article  Google Scholar 

  2. Verma, P., Sah, A., Srivastava, R.: Deep learning-based multi-modal approach using rgb and skeleton sequences for human activity recognition. Multimed. Syst. 26(6), 671–685 (2020)

    Article  Google Scholar 

  3. Liu, S., He, N., Wang, C., Yu, H., Han, W.: Lightweight human pose estimation algorithm based on polarized self-attention. Multimed. Syst. 29(1), 197–210 (2023)

    Article  Google Scholar 

  4. Yang, H., Liu, H., Zhang, Y., Wu, X.: Hsgnet: hierarchically stacked graph network with attention mechanism for 3d human pose estimation. Multimed. Syst. 29(4), 2085–2097 (2023)

    Article  Google Scholar 

  5. Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., Cheng, L.: Generating diverse and natural 3d human motions from text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5152–5161 (2022)

  6. Liu, Z., Wu, S., Jin, S., Ji, S., Liu, Q., Lu, S., Cheng, L.: Investigating pose representations and motion contexts modeling for 3d motion prediction. IEEE Trans. Pattern Anal. Mach. Intell. 45(1), 681–697 (2022)

    Article  Google Scholar 

  7. Mao, W., Liu, M., Salzmann, M., Li, H.: Learning trajectory dependencies for human motion prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9489–9497 (2019)

  8. Zhang, R., Shu, X., Yan, R., Zhang, J., Song, Y.: Skip-attention encoder-decoder framework for human motion prediction. Multimed. Syst. 28(2), 413–422 (2022)

    Article  Google Scholar 

  9. Geng, L., Yang, W., Jiao, Y., Zeng, S., Chen, X.: A multilayer human motion prediction perceptron by aggregating repetitive motion. Mach. Vis. Appl. 34(6), 98 (2023)

    Article  Google Scholar 

  10. Siyao, L., Yu, W., Gu, T., Lin, C., Wang, Q., Qian, C., Loy, C.C., Liu, Z.: Bailando: 3d dance generation by actor-critic gpt with choreographic memory. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11050–11059 (2022)

  11. Tseng, J., Castellon, R., Liu, K.: Edge: Editable dance generation from music. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 448–458 (2023)

  12. Zhou, Z., Wang, B.: Ude: A unified driving engine for human motion generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5632–5641 (2023)

  13. Guo, C., Zuo, X., Wang, S., Zou, S., Sun, Q., Deng, A., Gong, M., Cheng, L.: Action2motion: Conditioned generation of 3d human motions. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2021–2029 (2020)

  14. Petrovich, M., Black, M.J., Varol, G.: Action-conditioned 3d human motion synthesis with transformer vae. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10985–10995 (2021)

  15. Cervantes, P., Sekikawa, Y., Sato, I., Shinoda, K.: Implicit neural representations for variable length human motion generation. In: European Conference on Computer Vision, pp. 356–372 (2022)

  16. Petrovich, M., Black, M.J., Varol, G.: Temos: Generating diverse human motions from textual descriptions. In: European Conference on Computer Vision, pp. 480–497 (2022)

  17. Cai, H., Bai, C., Tai, Y.-W., Tang, C.-K.: Deep video generation, prediction and completion of human action sequences. In: European Conference on Computer Vision, pp. 374–390 (2018)

  18. Wang, Z., Yu, P., Zhao, Y., Zhang, R., Zhou, Y., Yuan, J., Chen, C.: Learning diverse stochastic human-action generators by learning smooth latent transitions. Proceed. AAAI Conf. Artif. Intell. 34, 12281–12288 (2020)

    Google Scholar 

  19. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. Adv. Neural Inform. Process. Syst. 27 (2014)

  20. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)

  21. Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., Liu, Z.: Motiondiffuse: Text-driven human motion generation with diffusion model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1–15 (2024)

  22. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural. Inf. Process. Syst. 33, 6840–6851 (2020)

    Google Scholar 

  23. Chen, X., Jiang, B., Liu, W., Huang, Z., Fu, B., Chen, T., Yu, G.: Executing your commands via motion diffusion in latent space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18000–18010 (2023)

  24. Zhang, M., Guo, X., Pan, L., Cai, Z., Hong, F., Li, H., Yang, L., Liu, Z.: Remodiffuse: Retrieval-augmented motion diffusion model. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 364–373 (2023)

  25. Guo, C., Zuo, X., Wang, S., Cheng, L.: Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In: European Conference on Computer Vision, pp. 580–597 (2022)

  26. Zhang, J., Zhang, Y., Cun, X., Zhang, Y., Zhao, H., Lu, H., Shen, X., Shan, Y.: Generating human motion from textual descriptions with discrete representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14730–14740 (2023)

  27. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inform. Process. Syst. 30 (2017)

  28. Zhong, C., Hu, L., Zhang, Z., Xia, S.: Attt2m: Text-driven human motion generation with multi-perspective attention mechanism. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 509–519 (2023)

  29. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International Conference on Machine Learning, pp. 2256–2265 (2015)

  30. Hoogeboom, E., Nielsen, D., Jaini, P., Forré, P., Welling, M.: Argmax flows and multinomial diffusion: Learning categorical distributions. Adv. Neural. Inf. Process. Syst. 34, 12454–12465 (2021)

    Google Scholar 

  31. Austin, J., Johnson, D.D., Ho, J., Tarlow, D., Van Den Berg, R.: Structured denoising diffusion models in discrete state-spaces. Adv. Neural. Inf. Process. Syst. 34, 17981–17993 (2021)

    Google Scholar 

  32. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  33. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019)

  34. Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. Adv. Neural Inform. Process. Syst. 30 (2017)

  35. Bond-Taylor, S., Hessey, P., Sasaki, H., Breckon, T.P., Willcocks, C.G.: Unleashing transformers: Parallel token prediction with discrete absorbing diffusion for fast high-resolution image generation from vector-quantized codes. In: European Conference on Computer Vision, pp. 170–188 (2022)

  36. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021)

  37. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)

  38. Lin, A.S., Wu, L., Corona, R., Tai, K., Huang, Q., Mooney, R.J.: Generating animated videos of human activities from natural language descriptions. Adv. Neural Inform. Process. Syst. (2018)

  39. Ahuja, C., Morency, L.-P.: Language2pose: Natural language grounded pose forecasting. In: 2019 International Conference on 3D Vision (3DV), pp. 719–728 (2019)

  40. Bhattacharya, U., Rewkowski, N., Banerjee, A., Guhan, P., Bera, A., Manocha, D.: Text2gestures: A transformer-based network for generating emotive body gestures for virtual agents. In: 2021 IEEE Virtual Reality and 3D User Interfaces (VR), pp. 1–10 (2021)

  41. Ghosh, A., Cheema, N., Oguz, C., Theobalt, C., Slusallek, P.: Synthesis of compositional animations from textual descriptions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1396–1406 (2021)

  42. Tulyakov, S., Liu, M.-Y., Yang, X., Kautz, J.: Mocogan: Decomposing motion and content for video generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1526–1535 (2018)

  43. Lee, H.-Y., Yang, X., Liu, M.-Y., Wang, T.-C., Lu, Y.-D., Yang, M.-H., Kautz, J.: Dancing to music. Adv. Neural Inform. Process. Syst. 32 (2019)

  44. Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. arXiv preprint arXiv:2209.14916 (2022)

  45. Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: Amass: Archive of motion capture as surface shapes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5442–5451 (2019)

  46. Plappert, C.M. Matthias., Asfour, T.: The kit motion-language dataset. Big Data 4(4), 236–252 (2016)

    Article  Google Scholar 

  47. Zheng, L., Yuan, J., Yu, L., Kong, L. : A reparameterized discrete diffusion model for text generation. arXiv preprint arXiv:2302.05737 (2023)

Download references

Acknowledgements

This work was supported by the Chunhui Plan Cooperative Project of Ministry of Education under Grant (HZKY20220424), the Guangdong Basic and Applied Basic Research Foundation (2022A1515140126, 2023A1515011172), the Young and Middle-aged Science and Technology Innovation Talent of Shenyang (RC220485), and the Fundamental Research Funds for the Central Universities under Grant (N2426002).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bangli Liu.

Additional information

Communicated by Bing-kun Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, J., Zheng, C., Liu, B. et al. Motion synthesis via distilled absorbing discrete diffusion model. Multimedia Systems 30, 320 (2024). https://doi.org/10.1007/s00530-024-01492-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00530-024-01492-9

Keywords