Skip to main content

Improved Text-Driven Human Motion Generation via Out-of-Distribution Detection and Rectification

  • Conference paper
  • First Online:
Computational Visual Media (CVM 2024)

Abstract

Text-driven human motion generation is gaining momentum lately thanks to its great potential in shaping the new pathway of interactive computer graphics in the era of AI. Despite the enormous efforts made so far, existing methods still struggle to ensure fluidity and body coordination when generating motions, which seriously hinders its application in a wide spectrum of areas such as gaming, animation, and the emerging metaverse. One of the many causes is, that learning directly from motion data is prone to interference from noise within the data, resulting in reduced quality of the generated motions. In this study, we for the first time propose to promote text-to-motion generation via out-of-distribution detection in the embedding space. Leveraging the Z-score-based outlier detection algorithm, we apply masking to motion data within the motion encoder and replace target data with means, ensuring the consistency of data distribution. To verify the effectiveness of the proposed method, we have conducted extensive experiments on the widely used KIT-ML dataset. Experimental results indicate that compared to previous frameworks, our solution significantly improves the quality of text-driven human motion generation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

References

  1. Ahn, H., Ha, T., Choi, Y., Yoo, H., Oh, S.: Text2action: generative adversarial synthesis from language to action. In: 2018 IEEE International Conference on Robotics and Automation, pp. 5915–5920. IEEE (2018)

    Google Scholar 

  2. Ahuja, C., Morency, L.P.: Language2pose: Natural language grounded pose forecasting. In: 2019 International Conference on 3D Vision, pp. 719–728. IEEE (2019)

    Google Scholar 

  3. Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 93–104 (2000)

    Google Scholar 

  4. Chen, X., et al.: Executing your commands via motion diffusion in latent space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18000–18010 (2023)

    Google Scholar 

  5. Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)

  6. Dabral, R., Mughal, M.H., Golyanik, V., Theobalt, C.: Mofusion: a framework for denoising-diffusion-based motion synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9760–9770 (2023)

    Google Scholar 

  7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  8. Ghosh, A., Cheema, N., Oguz, C., Theobalt, C., Slusallek, P.: Synthesis of compositional animations from textual descriptions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1396–1406, October 2021

    Google Scholar 

  9. Goodfellow, I., et al.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)

    Article  MathSciNet  Google Scholar 

  10. Guo, C., et al.: Generating diverse and natural 3d human motions from text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5152–5161, June 2022

    Google Scholar 

  11. Guo, C., Zuo, X., Wang, S., Cheng, L.: TM2T: stochastic and tokenized modeling for the reciprocal generation of 3D human motions and texts. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision. ECCV 2022. LNCS, vol. 13695, pp. 580–597. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_34

  12. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural. Inf. Process. Syst. 33, 6840–6851 (2020)

    Google Scholar 

  13. Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., Chen, T.: MotionGPT: human motion as a foreign language. arXiv preprint arXiv:2306.14795 (2023)

  14. Kim, J., Kim, J., Choi, S.: Flame: free-form language-based motion synthesis & editing. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 8255–8263 (2023)

    Google Scholar 

  15. Lin, A.S., Wu, L., Corona, R., Tai, K., Huang, Q., Mooney, R.J.: Generating animated videos of human activities from natural language descriptions. Learning 2018(1), 1 (2018)

    Google Scholar 

  16. Lin, X., Amer, M.R.: Human motion modeling using DVGANs. arXiv preprint arXiv:1804.10652 (2018)

  17. Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation forest. In: 2008 Eighth IEEE International Conference on Data Mining, pp. 413–422. IEEE (2008)

    Google Scholar 

  18. Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. In: Seminal Graphics Papers: Pushing the Boundaries, vol. 2, pp. 851–866 (2023)

    Google Scholar 

  19. Memory, L.S.T.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (2010)

    Google Scholar 

  20. Min, B., et al.: Recent advances in natural language processing via large pre-trained language models: a survey. ACM Comput. Surv. 56(2), 1–40 (2023)

    Article  Google Scholar 

  21. Petrovich, M., Black, M.J., Varol, G. (2022). TEMOS: Generating Diverse Human Motions from Textual Descriptions. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision. ECCV 2022. LNCS, vol. 13682, pp. 480–497. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_28

  22. Plappert, M., Mandery, C., Asfour, T.: The kit motion-language dataset. Big Data 4(4), 236–252 (2016)

    Article  Google Scholar 

  23. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  24. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)

    Google Scholar 

  25. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of Bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)

  26. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International Conference on Machine Learning, pp. 2256–2265. PMLR (2015)

    Google Scholar 

  27. Souiden, I., Omri, M.N., Brahmi, Z.: A survey of outlier detection in high dimensional data streams. Comput. Sci. Rev. 44, 100463 (2022)

    Article  MathSciNet  Google Scholar 

  28. Terlemez, Ö., Ulbrich, S., Mandery, C., Do, M., Vahrenkamp, N., Asfour, T.: Master motor map (mmm)-framework and toolkit for capturing, representing, and reproducing human motion on humanoid robots. In: 2014 IEEE-RAS International Conference on Humanoid Robots, pp. 894–901. IEEE (2014)

    Google Scholar 

  29. Tevet, G., Gordon, B., Hertz, A., Bermano, A.H., Cohen-Or, D.: MotionCLIP: exposing human motion generation to CLIP space. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision. ECCV 2022. LNCS, vol. 13682, pp. 358–374. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_21

  30. Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. arXiv preprint arXiv:2209.14916 (2022)

  31. Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  32. Vaswani, A., et al.: Attention is all you need. In: Advances In Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  33. Zhang, J., Zhang, Y., Cun, X., Huang, S., Zhang, Y., Zhao, H., Lu, H., Shen, X.: T2M-GPT: generating human motion from textual descriptions with discrete representations. arXiv preprint arXiv:2301.06052 (2023)

  34. Zhang, M., et al.: Motiondiffuse: text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001 (2022)

  35. Zhang, M., et al.: Remodiffuse: retrieval-augmented motion diffusion model. arXiv preprint arXiv:2304.01116 (2023)

Download references

Acknowledgments

This study was funded by Natural Science Foundation of Guangdong Province (2023A1515011639), National Key R &D Program of the Ministry of Science and Technology, China (2022YFF0903103); Natural Science Foundation of China (62371305), and Fundamental Research Funds for the Central Universities, Sun Yat-sen University (23xkjc019).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Baoquan Zhao .

Editor information

Editors and Affiliations

Ethics declarations

Disclosure of Interests

None.

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Fu, Y., Zhao, B., Lv, C., Yue, G., Wang, R., Zhou, F. (2024). Improved Text-Driven Human Motion Generation via Out-of-Distribution Detection and Rectification. In: Zhang, FL., Sharf, A. (eds) Computational Visual Media. CVM 2024. Lecture Notes in Computer Science, vol 14592. Springer, Singapore. https://doi.org/10.1007/978-981-97-2095-8_12

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-2095-8_12

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-2094-1

  • Online ISBN: 978-981-97-2095-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics