Improved Text-Driven Human Motion Generation via Out-of-Distribution Detection and Rectification

Fu, Yiyu; Zhao, Baoquan; Lv, Chenlei; Yue, Guanghui; Wang, Ruomei; Zhou, Fan

doi:10.1007/978-981-97-2095-8_12

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14592))

Included in the following conference series:

International Conference on Computational Visual Media

129 Accesses

Abstract

Text-driven human motion generation is gaining momentum lately thanks to its great potential in shaping the new pathway of interactive computer graphics in the era of AI. Despite the enormous efforts made so far, existing methods still struggle to ensure fluidity and body coordination when generating motions, which seriously hinders its application in a wide spectrum of areas such as gaming, animation, and the emerging metaverse. One of the many causes is, that learning directly from motion data is prone to interference from noise within the data, resulting in reduced quality of the generated motions. In this study, we for the first time propose to promote text-to-motion generation via out-of-distribution detection in the embedding space. Leveraging the Z-score-based outlier detection algorithm, we apply masking to motion data within the motion encoder and replace target data with means, ensuring the consistency of data distribution. To verify the effectiveness of the proposed method, we have conducted extensive experiments on the widely used KIT-ML dataset. Experimental results indicate that compared to previous frameworks, our solution significantly improves the quality of text-driven human motion generation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

References

Ahn, H., Ha, T., Choi, Y., Yoo, H., Oh, S.: Text2action: generative adversarial synthesis from language to action. In: 2018 IEEE International Conference on Robotics and Automation, pp. 5915–5920. IEEE (2018)
Google Scholar
Ahuja, C., Morency, L.P.: Language2pose: Natural language grounded pose forecasting. In: 2019 International Conference on 3D Vision, pp. 719–728. IEEE (2019)
Google Scholar
Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 93–104 (2000)
Google Scholar
Chen, X., et al.: Executing your commands via motion diffusion in latent space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18000–18010 (2023)
Google Scholar
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Dabral, R., Mughal, M.H., Golyanik, V., Theobalt, C.: Mofusion: a framework for denoising-diffusion-based motion synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9760–9770 (2023)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Ghosh, A., Cheema, N., Oguz, C., Theobalt, C., Slusallek, P.: Synthesis of compositional animations from textual descriptions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1396–1406, October 2021
Google Scholar
Goodfellow, I., et al.: Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)
Article MathSciNet Google Scholar
Guo, C., et al.: Generating diverse and natural 3d human motions from text. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5152–5161, June 2022
Google Scholar
Guo, C., Zuo, X., Wang, S., Cheng, L.: TM2T: stochastic and tokenized modeling for the reciprocal generation of 3D human motions and texts. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision. ECCV 2022. LNCS, vol. 13695, pp. 580–597. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19833-5_34
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Adv. Neural. Inf. Process. Syst. 33, 6840–6851 (2020)
Google Scholar
Jiang, B., Chen, X., Liu, W., Yu, J., Yu, G., Chen, T.: MotionGPT: human motion as a foreign language. arXiv preprint arXiv:2306.14795 (2023)
Kim, J., Kim, J., Choi, S.: Flame: free-form language-based motion synthesis & editing. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 8255–8263 (2023)
Google Scholar
Lin, A.S., Wu, L., Corona, R., Tai, K., Huang, Q., Mooney, R.J.: Generating animated videos of human activities from natural language descriptions. Learning 2018(1), 1 (2018)
Google Scholar
Lin, X., Amer, M.R.: Human motion modeling using DVGANs. arXiv preprint arXiv:1804.10652 (2018)
Liu, F.T., Ting, K.M., Zhou, Z.H.: Isolation forest. In: 2008 Eighth IEEE International Conference on Data Mining, pp. 413–422. IEEE (2008)
Google Scholar
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. In: Seminal Graphics Papers: Pushing the Boundaries, vol. 2, pp. 851–866 (2023)
Google Scholar
Memory, L.S.T.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (2010)
Google Scholar
Min, B., et al.: Recent advances in natural language processing via large pre-trained language models: a survey. ACM Comput. Surv. 56(2), 1–40 (2023)
Article Google Scholar
Petrovich, M., Black, M.J., Varol, G. (2022). TEMOS: Generating Diverse Human Motions from Textual Descriptions. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision. ECCV 2022. LNCS, vol. 13682, pp. 480–497. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_28
Plappert, M., Mandery, C., Asfour, T.: The kit motion-language dataset. Big Data 4(4), 236–252 (2016)
Article Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Google Scholar
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training (2018)
Google Scholar
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of Bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019)
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: International Conference on Machine Learning, pp. 2256–2265. PMLR (2015)
Google Scholar
Souiden, I., Omri, M.N., Brahmi, Z.: A survey of outlier detection in high dimensional data streams. Comput. Sci. Rev. 44, 100463 (2022)
Article MathSciNet Google Scholar
Terlemez, Ö., Ulbrich, S., Mandery, C., Do, M., Vahrenkamp, N., Asfour, T.: Master motor map (mmm)-framework and toolkit for capturing, representing, and reproducing human motion on humanoid robots. In: 2014 IEEE-RAS International Conference on Humanoid Robots, pp. 894–901. IEEE (2014)
Google Scholar
Tevet, G., Gordon, B., Hertz, A., Bermano, A.H., Cohen-Or, D.: MotionCLIP: exposing human motion generation to CLIP space. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision. ECCV 2022. LNCS, vol. 13682, pp. 358–374. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20047-2_21
Tevet, G., Raab, S., Gordon, B., Shafir, Y., Cohen-Or, D., Bermano, A.H.: Human motion diffusion model. arXiv preprint arXiv:2209.14916 (2022)
Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances In Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Zhang, J., Zhang, Y., Cun, X., Huang, S., Zhang, Y., Zhao, H., Lu, H., Shen, X.: T2M-GPT: generating human motion from textual descriptions with discrete representations. arXiv preprint arXiv:2301.06052 (2023)
Zhang, M., et al.: Motiondiffuse: text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001 (2022)
Zhang, M., et al.: Remodiffuse: retrieval-augmented motion diffusion model. arXiv preprint arXiv:2304.01116 (2023)

Download references

Acknowledgments

This study was funded by Natural Science Foundation of Guangdong Province (2023A1515011639), National Key R &D Program of the Ministry of Science and Technology, China (2022YFF0903103); Natural Science Foundation of China (62371305), and Fundamental Research Funds for the Central Universities, Sun Yat-sen University (23xkjc019).

Author information

Authors and Affiliations

School of Artificial Intelligence, Sun Yat-sen University, Zhuhai, China
Yiyu Fu & Baoquan Zhao
College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
Chenlei Lv
School of Biomedical Engineering, Shenzhen University, Shenzhen, China
Guanghui Yue
School of Computer Science, Sun Yat-sen University, Guangzhou, China
Ruomei Wang & Fan Zhou

Authors

Yiyu Fu
View author publications
You can also search for this author in PubMed Google Scholar
Baoquan Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Chenlei Lv
View author publications
You can also search for this author in PubMed Google Scholar
Guanghui Yue
View author publications
You can also search for this author in PubMed Google Scholar
Ruomei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Fan Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Baoquan Zhao .

Editor information

Editors and Affiliations

Victoria University of Wellington, Wellington, New Zealand
Fang-Lue Zhang
Ben-Gurion University, Be'er Sheva, Israel
Andrei Sharf

Ethics declarations

Disclosure of Interests

None.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fu, Y., Zhao, B., Lv, C., Yue, G., Wang, R., Zhou, F. (2024). Improved Text-Driven Human Motion Generation via Out-of-Distribution Detection and Rectification. In: Zhang, FL., Sharf, A. (eds) Computational Visual Media. CVM 2024. Lecture Notes in Computer Science, vol 14592. Springer, Singapore. https://doi.org/10.1007/978-981-97-2095-8_12

Download citation

DOI: https://doi.org/10.1007/978-981-97-2095-8_12
Published: 30 March 2024
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-2094-1
Online ISBN: 978-981-97-2095-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Improved Text-Driven Human Motion Generation via Out-of-Distribution Detection and Rectification