skip to main content
10.1145/3610543.3626164acmconferencesArticle/Chapter ViewAbstractPublication Pagessiggraph-asiaConference Proceedingsconference-collections
research-article

Motion to Dance Music Generation using Latent Diffusion Model

Published: 28 November 2023 Publication History

Editorial Notes

The authors have requested minor, non-substantive changes to the Version of Record and, in accordance with ACM policies, a Corrected Version of Record was published on December 29, 2023. For reference purposes, the VoR may still be accessed via the Supplemental Material section on this page.

Abstract

The role of music in games and animation, particularly in dance content, is essential for creating immersive and entertaining experiences. Although recent studies have made strides in generating dance music from videos, their practicality in integrating music into games and animation remains limited. In this context, we present a method capable of generating plausible dance music from 3D motion data and genre labels. Our approach leverages a combination of a UNET-based latent diffusion model and a pre-trained VAE model. To evaluate the performance of the proposed model, we employ evaluation metrics to assess various audio properties, including beat alignment, audio quality, motion-music correlation, and genre score. The quantitative results show that our approach outperforms previous methods. Furthermore, we demonstrate that our model can generate audio that seamlessly fits to in-the-wild motion data. This capability enables us to create plausible dance music that complements dynamic movements of characters and enhances overall audiovisual experience in interactive media. Examples from our proposed model are available at this link: https://dmdproject.github.io/.

Supplementary Material

Version of Record for "Motion to Dance Music Generation using Latent Diffusion Model" by Tan et al., SIGGRAPH Asia 2023 Technical Communications. (3626164-vor.pdf)
Appendix (Supplementary.pdf)
MP4 File (Video.mp4)
Demo video

References

[1]
Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Apostol Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. 2016. YouTube-8M: A Large-Scale Video Classification Benchmark. ArXiv abs/1609.08675 (2016).
[2]
Gunjan Aggarwal and Devi Parikh. 2021. Dance2Music: Automatic Dance-driven Music Generation. arxiv:2107.06252 [cs.SD]
[3]
Andrea Agostinelli, Timo I. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, Matthew Sharifi, Neil Zeghidour, and C. Frank. 2023. MusicLM: Generating Music From Text. ArXiv abs/2301.11325 (2023).
[4]
Pei-Chun Chang, Yong-Sheng Chen, and Chang-Hsing Lee. 2021. MS-SincResNet: Joint learning of 1D and 2D kernels using multi-scale SincNet and ResNet for music genre classification. In Proceedings of the 2021 International Conference on Multimedia Retrieval. 29–36.
[5]
Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever. 2020. Jukebox: A Generative Model for Music. arXiv preprint arXiv:2005.00341 (2020).
[6]
Chuang Gan, Deng Huang, Peihao Chen, Joshua B. Tenenbaum, and Antonio Torralba. 2020. Foley Music: Learning to Generate Music from Videos. arxiv:2007.10984 [cs.CV]
[7]
Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and Kevin W. Wilson. 2016. CNN architectures for large-scale audio classification. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016), 131–135.
[8]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. CoRR abs/2006.11239 (2020). arXiv:2006.11239https://arxiv.org/abs/2006.11239
[9]
Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, and Zhou Zhao. 2023. Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models. arxiv:2301.12661 [cs.SD]
[10]
Diederik P Kingma and Max Welling. 2022. Auto-Encoding Variational Bayes. arxiv:1312.6114 [stat.ML]
[11]
Ruilong Li, Shan Yang, David A. Ross, and Angjoo Kanazawa. 2021. Learn to Dance with AIST++: Music Conditioned 3D Dance Generation. arxiv:2101.08779 [cs.CV]
[12]
Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. 2015. SMPL: A Skinned Multi-Person Linear Model. ACM Trans. Graphics (Proc. SIGGRAPH Asia) 34, 6 (Oct. 2015), 248:1–248:16.
[13]
Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. arxiv:1711.05101 [cs.LG]
[14]
Simian Luo, Chuanhao Yan, Chenxu Hu, and Hang Zhao. 2023. Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models. arxiv:2306.17203 [cs.SD]
[15]
Brian McFee, Colin Raffel, Dawen Liang, Daniel Ellis, Matt Mcvicar, Eric Battenberg, and Oriol Nieto. 2015. librosa: Audio and Music Signal Analysis in Python. 18–24. https://doi.org/10.25080/Majora-7b98e3ed-003
[16]
Nathanaël Perraudin, Peter Balazs, and Peter L. Søndergaard. 2013. A fast Griffin-Lim algorithm. In 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. 1–4. https://doi.org/10.1109/WASPAA.2013.6701851
[17]
Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2021. High-Resolution Image Synthesis with Latent Diffusion Models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021), 10674–10685.
[18]
Lian Siyao, Weijiang Yu, Tianpei Gu, Chunze Lin, Quan Wang, Chen Qian, Chen Change Loy, and Ziwei Liu. 2022. Bailando: 3D Dance Generation by Actor-Critic GPT with Choreographic Memory. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022), 11040–11049.
[19]
Robert Smith. 2022. Audio Diffusion. https://github.com/teticio/audio-diffusion.
[20]
Jo-Han Tseng, Rodrigo Castellon, and C. Karen Liu. 2022. EDGE: Editable Dance Generation From Music. ArXiv abs/2211.10658 (2022).
[21]
Yi Zhou, Connelly Barnes, Lu Jingwan, Yang Jimei, and Li Hao. 2019. On the Continuity of Rotation Representations in Neural Networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[22]
Ye Zhu, Kyle Olszewski, Yuehua Wu, Panos Achlioptas, Menglei Chai, Yan Yan, and S. Tulyakov. 2022a. Quantized GAN for Complex Music Generation from Dance Videos. ArXiv abs/2204.00604 (2022).
[23]
Ye Zhu, Yuehua Wu, Kyle Olszewski, Jian Ren, S. Tulyakov, and Yan Yan. 2022b. Discrete Contrastive Diffusion for Cross-Modal Music and Image Generation.

Cited By

View all
  • (2024)Dance2Music-Diffusion: leveraging latent diffusion models for music generation from dance videosEURASIP Journal on Audio, Speech, and Music Processing10.1186/s13636-024-00370-62024:1Online publication date: 30-Sep-2024
  • (2024)Spectrogrand: Computational Creativity Driven Audiovisuals' Generation From Text PromptsProceedings of the Fifteenth Indian Conference on Computer Vision Graphics and Image Processing10.1145/3702250.3702280(1-10)Online publication date: 13-Dec-2024
  • (2024)Dance-to-Music Generation with Encoder-based Textual InversionSIGGRAPH Asia 2024 Conference Papers10.1145/3680528.3687562(1-11)Online publication date: 3-Dec-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SA '23: SIGGRAPH Asia 2023 Technical Communications
November 2023
127 pages
ISBN:9798400703140
DOI:10.1145/3610543
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 November 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. 3D motion to music
  2. latent diffusion model
  3. music generation

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

  • National Research Foundation of Korea (NRF)

Conference

SA '23
Sponsor:
SA '23: SIGGRAPH Asia 2023
December 12 - 15, 2023
NSW, Sydney, Australia

Acceptance Rates

Overall Acceptance Rate 178 of 869 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)186
  • Downloads (Last 6 weeks)11
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Dance2Music-Diffusion: leveraging latent diffusion models for music generation from dance videosEURASIP Journal on Audio, Speech, and Music Processing10.1186/s13636-024-00370-62024:1Online publication date: 30-Sep-2024
  • (2024)Spectrogrand: Computational Creativity Driven Audiovisuals' Generation From Text PromptsProceedings of the Fifteenth Indian Conference on Computer Vision Graphics and Image Processing10.1145/3702250.3702280(1-10)Online publication date: 13-Dec-2024
  • (2024)Dance-to-Music Generation with Encoder-based Textual InversionSIGGRAPH Asia 2024 Conference Papers10.1145/3680528.3687562(1-11)Online publication date: 3-Dec-2024
  • (2024)DanceComposer: Dance-to-Music Generation Using a Progressive Conditional Music GeneratorIEEE Transactions on Multimedia10.1109/TMM.2024.340573426(10237-10250)Online publication date: 2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media