Motion synthesis via distilled absorbing discrete diffusion model

Wang, Junyi; Zheng, Chao; Liu, Bangli; Cai, Haibin; Meng, Qinggang

doi:10.1007/s00530-024-01492-9

Motion synthesis via distilled absorbing discrete diffusion model

Regular Paper
Published: 16 October 2024

Volume 30, article number 320, (2024)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Junyi Wang^1,2,
Chao Zheng¹,
Bangli Liu³,
Haibin Cai⁴ &
…
Qinggang Meng⁴

259 Accesses
Explore all metrics

Abstract

In this work, we explore the potential of discrete diffusion model in text-driven motion synthesis. Previous methods aimed at improving the quality of generated motions often led to an increase in model parameters, while neglecting the diversity of generated results. Here we introduce our Motion Absorbing Discrete Diffusion Model (MADDM), which combines the high diversity of continuous diffusion models with the high-quality generated results of discrete autoregressive models. Our results show that an absorbing discrete diffusion model can yield more precise discrete motion latent codes compared to previous autoregressive generation models. In MADDM, a lightweight discrete denoising model is designed to achieve more accurate generation results, which utilizes cross-layer parameter sharing to reduce the model’s parameters. A reweighted distribution loss is utilized to distill the model to adapt the distillation process more effectively to the discrete diffusion model. Our approach achieves state-of-the-art result on HumanML3D dataset with FID 0.073 and the model parameters are only one-third of the previous discrete autoregressive model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

BAMM: Bidirectional Autoregressive Motion Model

Length-Aware Motion Synthesis via Latent Diffusion

M2D2M: Multi-Motion Generation from Text with Discrete Diffusion Models

Data availibility

Data will be made available on request.

References

Chao, X., Hou, Z., Mo, Y., Shi, H., Yao, W.: Structural feature representation and fusion of human spatial cooperative motion for action recognition. Multimed. Syst. 29(3), 1301–1314 (2023)
Article Google Scholar
Verma, P., Sah, A., Srivastava, R.: Deep learning-based multi-modal approach using rgb and skeleton sequences for human activity recognition. Multimed. Syst. 26(6), 671–685 (2020)
Article Google Scholar
Liu, S., He, N., Wang, C., Yu, H., Han, W.: Lightweight human pose estimation algorithm based on polarized self-attention. Multimed. Syst. 29(1), 197–210 (2023)
Article Google Scholar
Yang, H., Liu, H., Zhang, Y., Wu, X.: Hsgnet: hierarchically stacked graph network with attention mechanism for 3d human pose estimation. Multimed. Syst. 29(4), 2085–2097 (2023)
Article Google Scholar
Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., Cheng, L.: Generating diverse and natural 3d human motions from text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5152–5161 (2022)
Liu, Z., Wu, S., Jin, S., Ji, S., Liu, Q., Lu, S., Cheng, L.: Investigating pose representations and motion contexts modeling for 3d motion prediction. IEEE Trans. Pattern Anal. Mach. Intell. 45(1), 681–697 (2022)
Article Google Scholar
Mao, W., Liu, M., Salzmann, M., Li, H.: Learning trajectory dependencies for human motion prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9489–9497 (2019)
Zhang, R., Shu, X., Yan, R., Zhang, J., Song, Y.: Skip-attention encoder-decoder framework for human motion prediction. Multimed. Syst. 28(2), 413–422 (2022)
Article Google Scholar
Geng, L., Yang, W., Jiao, Y., Zeng, S., Chen, X.: A multilayer human motion prediction perceptron by aggregating repetitive motion. Mach. Vis. Appl. 34(6), 98 (2023)
Article Google Scholar
Siyao, L., Yu, W., Gu, T., Lin, C., Wang, Q., Qian, C., Loy, C.C., Liu, Z.: Bailando: 3d dance generation by actor-critic gpt with choreographic memory. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11050–11059 (2022)
Tseng, J., Castellon, R., Liu, K.: Edge: Editable dance generation from music. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 448–458 (2023)
Zhou, Z., Wang, B.: Ude: A unified driving engine for human motion generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5632–5641 (2023)
Guo, C., Zuo, X., Wang, S., Zou, S., Sun, Q., Deng, A., Gong, M., Cheng, L.: Action2motion: Conditioned generation of 3d human motions. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2021–2029 (2020)
Petrovich, M., Black, M.J., Varol, G.: Action-conditioned 3d human motion synthesis with transformer vae. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10985–10995 (2021)
Cervantes, P., Sekikawa, Y., Sato, I., Shinoda, K.: Implicit neural representations for variable length human motion generation. In: European Conference on Computer Vision, pp. 356–372 (2022)
Petrovich, M., Black, M.J., Varol, G.: Temos: Generating diverse human motions from textual descriptions. In: European Conference on Computer Vision, pp. 480–497 (2022)
Cai, H., Bai, C., Tai, Y.-W., Tang, C.-K.: Deep video generation, prediction and completion of human action sequences. In: European Conference on Computer Vision, pp. 374–390 (2018)
Wang, Z., Yu, P., Zhao, Y., Zhang, R., Zhou, Y., Yuan, J., Chen, C.: Learning diverse stochastic human-action generators by learning smooth latent transitions. Proceed. AAAI Conf. Artif. Intell. 34, 12281–12288 (2020)
Google Scholar
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. Adv. Neural Inform. Process. Syst. 27 (2014)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., Liu, Z.: Motiondiffuse: Text-driven human motion generation with diffusion model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1–15 (2024)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural. Inf. Process. Syst. 33, 6840–6851 (2020)
Google Scholar
Chen, X., Jiang, B., Liu, W., Huang, Z., Fu, B., Chen, T., Yu, G.: Executing your commands via motion diffusion in latent space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18000–18010 (2023)
Zhang, M., Guo, X., Pan, L., Cai, Z., Hong, F., Li, H., Yang, L., Liu, Z.: Remodiffuse: Retrieval-augmented motion diffusion model. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 364–373 (2023)
Guo, C., Zuo, X., Wang, S., Cheng, L.: Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In: European Conference on Computer Vision, pp. 580–597 (2022)
Zhang, J., Zhang, Y., Cun, X., Zhang, Y., Zhao, H., Lu, H., Shen, X., Shan, Y.: Generating human motion from textual descriptions with discrete representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14730–14740 (2023)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inform. Process. Syst. 30 (2017)
Zhong, C., Hu, L., Zhang, Z., Xia, S.: Attt2m: Text-driven human motion generation with multi-perspective attention mechanism. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 509–519 (2023)
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International Conference on Machine Learning, pp. 2256–2265 (2015)
Hoogeboom, E., Nielsen, D., Jaini, P., Forré, P., Welling, M.: Argmax flows and multinomial diffusion: Learning categorical distributions. Adv. Neural. Inf. Process. Syst. 34, 12454–12465 (2021)
Google Scholar
Austin, J., Johnson, D.D., Ho, J., Tarlow, D., Van Den Berg, R.: Structured denoising diffusion models in discrete state-spaces. Adv. Neural. Inf. Process. Syst. 34, 17981–17993 (2021)
Google Scholar
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019)
Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. Adv. Neural Inform. Process. Syst. 30 (2017)
Bond-Taylor, S., Hessey, P., Sasaki, H., Breckon, T.P., Willcocks, C.G.: Unleashing transformers: Parallel token prediction with discrete absorbing diffusion for fast high-resolution image generation from vector-quantized codes. In: European Conference on Computer Vision, pp. 170–188 (2022)
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Lin, A.S., Wu, L., Corona, R., Tai, K., Huang, Q., Mooney, R.J.: Generating animated videos of human activities from natural language descriptions. Adv. Neural Inform. Process. Syst. (2018)
Ahuja, C., Morency, L.-P.: Language2pose: Natural language grounded pose forecasting. In: 2019 International Conference on 3D Vision (3DV), pp. 719–728 (2019)
Bhattacharya, U., Rewkowski, N., Banerjee, A., Guhan, P., Bera, A., Manocha, D.: Text2gestures: A transformer-based network for generating emotive body gestures for virtual agents. In: 2021 IEEE Virtual Reality and 3D User Interfaces (VR), pp. 1–10 (2021)
Ghosh, A., Cheema, N., Oguz, C., Theobalt, C., Slusallek, P.: Synthesis of compositional animations from textual descriptions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1396–1406 (2021)
Tulyakov, S., Liu, M.-Y., Yang, X., Kautz, J.: Mocogan: Decomposing motion and content for video generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1526–1535 (2018)
Lee, H.-Y., Yang, X., Liu, M.-Y., Wang, T.-C., Lu, Y.-D., Yang, M.-H., Kautz, J.: Dancing to music. Adv. Neural Inform. Process. Syst. 32 (2019)
Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. arXiv preprint arXiv:2209.14916 (2022)
Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: Amass: Archive of motion capture as surface shapes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5442–5451 (2019)
Plappert, C.M. Matthias., Asfour, T.: The kit motion-language dataset. Big Data 4(4), 236–252 (2016)
Article Google Scholar
Zheng, L., Yuan, J., Yu, L., Kong, L. : A reparameterized discrete diffusion model for text generation. arXiv preprint arXiv:2302.05737 (2023)

Download references

Acknowledgements

This work was supported by the Chunhui Plan Cooperative Project of Ministry of Education under Grant (HZKY20220424), the Guangdong Basic and Applied Basic Research Foundation (2022A1515140126, 2023A1515011172), the Young and Middle-aged Science and Technology Innovation Talent of Shenyang (RC220485), and the Fundamental Research Funds for the Central Universities under Grant (N2426002).

Author information

Authors and Affiliations

Faculty of Robot Science and Engineering, Northeastern University, Shenyang, 110169, China
Junyi Wang & Chao Zheng
Foshan Graduate School of Innovation, Northeastern University, Foshan, 528311, China
Junyi Wang
Faculty of Computing, Engineering, and Media, DeMontfort University, Leicester, LE1 9BH, UK
Bangli Liu
Department of Computer Science, Loughborough University, Loughborough, LE11 3TU, UK
Haibin Cai & Qinggang Meng

Authors

Junyi Wang
View author publications
You can also search for this author inPubMed Google Scholar
Chao Zheng
View author publications
You can also search for this author inPubMed Google Scholar
Bangli Liu
View author publications
You can also search for this author inPubMed Google Scholar
Haibin Cai
View author publications
You can also search for this author inPubMed Google Scholar
Qinggang Meng
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Bangli Liu.

Additional information

Communicated by Bing-kun Bao.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, J., Zheng, C., Liu, B. et al. Motion synthesis via distilled absorbing discrete diffusion model. Multimedia Systems 30, 320 (2024). https://doi.org/10.1007/s00530-024-01492-9

Download citation

Received: 22 April 2024
Accepted: 06 September 2024
Published: 16 October 2024
DOI: https://doi.org/10.1007/s00530-024-01492-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Motion synthesis via distilled absorbing discrete diffusion model

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

BAMM: Bidirectional Autoregressive Motion Model

Length-Aware Motion Synthesis via Latent Diffusion

M2D2M: Multi-Motion Generation from Text with Discrete Diffusion Models

Data availibility

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now