Abstract
In this work, we explore the potential of discrete diffusion model in text-driven motion synthesis. Previous methods aimed at improving the quality of generated motions often led to an increase in model parameters, while neglecting the diversity of generated results. Here we introduce our Motion Absorbing Discrete Diffusion Model (MADDM), which combines the high diversity of continuous diffusion models with the high-quality generated results of discrete autoregressive models. Our results show that an absorbing discrete diffusion model can yield more precise discrete motion latent codes compared to previous autoregressive generation models. In MADDM, a lightweight discrete denoising model is designed to achieve more accurate generation results, which utilizes cross-layer parameter sharing to reduce the model’s parameters. A reweighted distribution loss is utilized to distill the model to adapt the distillation process more effectively to the discrete diffusion model. Our approach achieves state-of-the-art result on HumanML3D dataset with FID 0.073 and the model parameters are only one-third of the previous discrete autoregressive model.




Similar content being viewed by others
Data availibility
Data will be made available on request.
References
Chao, X., Hou, Z., Mo, Y., Shi, H., Yao, W.: Structural feature representation and fusion of human spatial cooperative motion for action recognition. Multimed. Syst. 29(3), 1301–1314 (2023)
Verma, P., Sah, A., Srivastava, R.: Deep learning-based multi-modal approach using rgb and skeleton sequences for human activity recognition. Multimed. Syst. 26(6), 671–685 (2020)
Liu, S., He, N., Wang, C., Yu, H., Han, W.: Lightweight human pose estimation algorithm based on polarized self-attention. Multimed. Syst. 29(1), 197–210 (2023)
Yang, H., Liu, H., Zhang, Y., Wu, X.: Hsgnet: hierarchically stacked graph network with attention mechanism for 3d human pose estimation. Multimed. Syst. 29(4), 2085–2097 (2023)
Guo, C., Zou, S., Zuo, X., Wang, S., Ji, W., Li, X., Cheng, L.: Generating diverse and natural 3d human motions from text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5152–5161 (2022)
Liu, Z., Wu, S., Jin, S., Ji, S., Liu, Q., Lu, S., Cheng, L.: Investigating pose representations and motion contexts modeling for 3d motion prediction. IEEE Trans. Pattern Anal. Mach. Intell. 45(1), 681–697 (2022)
Mao, W., Liu, M., Salzmann, M., Li, H.: Learning trajectory dependencies for human motion prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9489–9497 (2019)
Zhang, R., Shu, X., Yan, R., Zhang, J., Song, Y.: Skip-attention encoder-decoder framework for human motion prediction. Multimed. Syst. 28(2), 413–422 (2022)
Geng, L., Yang, W., Jiao, Y., Zeng, S., Chen, X.: A multilayer human motion prediction perceptron by aggregating repetitive motion. Mach. Vis. Appl. 34(6), 98 (2023)
Siyao, L., Yu, W., Gu, T., Lin, C., Wang, Q., Qian, C., Loy, C.C., Liu, Z.: Bailando: 3d dance generation by actor-critic gpt with choreographic memory. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11050–11059 (2022)
Tseng, J., Castellon, R., Liu, K.: Edge: Editable dance generation from music. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 448–458 (2023)
Zhou, Z., Wang, B.: Ude: A unified driving engine for human motion generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5632–5641 (2023)
Guo, C., Zuo, X., Wang, S., Zou, S., Sun, Q., Deng, A., Gong, M., Cheng, L.: Action2motion: Conditioned generation of 3d human motions. In: Proceedings of the 28th ACM International Conference on Multimedia, pp. 2021–2029 (2020)
Petrovich, M., Black, M.J., Varol, G.: Action-conditioned 3d human motion synthesis with transformer vae. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10985–10995 (2021)
Cervantes, P., Sekikawa, Y., Sato, I., Shinoda, K.: Implicit neural representations for variable length human motion generation. In: European Conference on Computer Vision, pp. 356–372 (2022)
Petrovich, M., Black, M.J., Varol, G.: Temos: Generating diverse human motions from textual descriptions. In: European Conference on Computer Vision, pp. 480–497 (2022)
Cai, H., Bai, C., Tai, Y.-W., Tang, C.-K.: Deep video generation, prediction and completion of human action sequences. In: European Conference on Computer Vision, pp. 374–390 (2018)
Wang, Z., Yu, P., Zhao, Y., Zhang, R., Zhou, Y., Yuan, J., Chen, C.: Learning diverse stochastic human-action generators by learning smooth latent transitions. Proceed. AAAI Conf. Artif. Intell. 34, 12281–12288 (2020)
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. Adv. Neural Inform. Process. Syst. 27 (2014)
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Zhang, M., Cai, Z., Pan, L., Hong, F., Guo, X., Yang, L., Liu, Z.: Motiondiffuse: Text-driven human motion generation with diffusion model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1–15 (2024)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural. Inf. Process. Syst. 33, 6840–6851 (2020)
Chen, X., Jiang, B., Liu, W., Huang, Z., Fu, B., Chen, T., Yu, G.: Executing your commands via motion diffusion in latent space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18000–18010 (2023)
Zhang, M., Guo, X., Pan, L., Cai, Z., Hong, F., Li, H., Yang, L., Liu, Z.: Remodiffuse: Retrieval-augmented motion diffusion model. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 364–373 (2023)
Guo, C., Zuo, X., Wang, S., Cheng, L.: Tm2t: Stochastic and tokenized modeling for the reciprocal generation of 3d human motions and texts. In: European Conference on Computer Vision, pp. 580–597 (2022)
Zhang, J., Zhang, Y., Cun, X., Zhang, Y., Zhao, H., Lu, H., Shen, X., Shan, Y.: Generating human motion from textual descriptions with discrete representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14730–14740 (2023)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inform. Process. Syst. 30 (2017)
Zhong, C., Hu, L., Zhang, Z., Xia, S.: Attt2m: Text-driven human motion generation with multi-perspective attention mechanism. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 509–519 (2023)
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International Conference on Machine Learning, pp. 2256–2265 (2015)
Hoogeboom, E., Nielsen, D., Jaini, P., Forré, P., Welling, M.: Argmax flows and multinomial diffusion: Learning categorical distributions. Adv. Neural. Inf. Process. Syst. 34, 12454–12465 (2021)
Austin, J., Johnson, D.D., Ho, J., Tarlow, D., Van Den Berg, R.: Structured denoising diffusion models in discrete state-spaces. Adv. Neural. Inf. Process. Syst. 34, 17981–17993 (2021)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019)
Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. Adv. Neural Inform. Process. Syst. 30 (2017)
Bond-Taylor, S., Hessey, P., Sasaki, H., Breckon, T.P., Willcocks, C.G.: Unleashing transformers: Parallel token prediction with discrete absorbing diffusion for fast high-resolution image generation from vector-quantized codes. In: European Conference on Computer Vision, pp. 170–188 (2022)
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Lin, A.S., Wu, L., Corona, R., Tai, K., Huang, Q., Mooney, R.J.: Generating animated videos of human activities from natural language descriptions. Adv. Neural Inform. Process. Syst. (2018)
Ahuja, C., Morency, L.-P.: Language2pose: Natural language grounded pose forecasting. In: 2019 International Conference on 3D Vision (3DV), pp. 719–728 (2019)
Bhattacharya, U., Rewkowski, N., Banerjee, A., Guhan, P., Bera, A., Manocha, D.: Text2gestures: A transformer-based network for generating emotive body gestures for virtual agents. In: 2021 IEEE Virtual Reality and 3D User Interfaces (VR), pp. 1–10 (2021)
Ghosh, A., Cheema, N., Oguz, C., Theobalt, C., Slusallek, P.: Synthesis of compositional animations from textual descriptions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1396–1406 (2021)
Tulyakov, S., Liu, M.-Y., Yang, X., Kautz, J.: Mocogan: Decomposing motion and content for video generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1526–1535 (2018)
Lee, H.-Y., Yang, X., Liu, M.-Y., Wang, T.-C., Lu, Y.-D., Yang, M.-H., Kautz, J.: Dancing to music. Adv. Neural Inform. Process. Syst. 32 (2019)
Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. arXiv preprint arXiv:2209.14916 (2022)
Mahmood, N., Ghorbani, N., Troje, N.F., Pons-Moll, G., Black, M.J.: Amass: Archive of motion capture as surface shapes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5442–5451 (2019)
Plappert, C.M. Matthias., Asfour, T.: The kit motion-language dataset. Big Data 4(4), 236–252 (2016)
Zheng, L., Yuan, J., Yu, L., Kong, L. : A reparameterized discrete diffusion model for text generation. arXiv preprint arXiv:2302.05737 (2023)
Acknowledgements
This work was supported by the Chunhui Plan Cooperative Project of Ministry of Education under Grant (HZKY20220424), the Guangdong Basic and Applied Basic Research Foundation (2022A1515140126, 2023A1515011172), the Young and Middle-aged Science and Technology Innovation Talent of Shenyang (RC220485), and the Fundamental Research Funds for the Central Universities under Grant (N2426002).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Bing-kun Bao.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, J., Zheng, C., Liu, B. et al. Motion synthesis via distilled absorbing discrete diffusion model. Multimedia Systems 30, 320 (2024). https://doi.org/10.1007/s00530-024-01492-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00530-024-01492-9