research-article

BeatDance: A Beat-Based Model-Agnostic Contrastive Learning Framework for Music-Dance Retrieval

Authors:

Zhaoxin FanAuthors Info & Claims

ICMR '24: Proceedings of the 2024 International Conference on Multimedia Retrieval

Pages 11 - 19

https://doi.org/10.1145/3652583.3658045

Published: 07 June 2024 Publication History

Abstract

Dance and music are closely related forms of expression, with mutual retrieval between dance videos and music being a fundamental task in various fields like education, art, and sports. However, existing methods often suffer from unnatural generation effects or fail to fully explore the correlation between music and dance. To overcome these challenges, we propose BeatDance, a novel beat-based model-agnostic contrastive learning framework. BeatDance incorporates a Beat-Aware Music-Dance InfoExtractor, a Trans-Temporal Beat Blender, and a Beat-Enhanced Hubness Reducer to improve Music-Dance retrieval performance by utilizing the alignment between music beats and dance movements. We also introduce the Music-Dance (M-D) dataset, a large-scale collection of over 10,000 Music-Dance video pairs for training and testing. Experimental results on the M-D dataset demonstrate the superiority of our method over existing baselines, achieving state-of-the-art performance. The code and dataset are available at https://github.com/XulongT/BeatDance.

References

[1]

Omid Alemi, Jules Françoise, and Philippe Pasquier. 2017. GrooveNet: Realtime music-driven dance movement generation using artificial neural networks. networks 8, 17 (2017), 26.

[2]

Simion-Vlad Bogolin, Ioana Croitoru, Hailin Jin, Yang Liu, and Samuel Albanie. 2022. Cross modal retrieval with querybank normalisation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5194--5205.

[3]

Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multiperson 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7291--7299.

[4]

Xuxin Cheng, Zhihong Zhu, Hongxiang Li, Yaowei Li, and Yuexian Zou. 2023. SSVMR: Saliency-Based Self-Training for Video-Music Retrieval. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1--5.

[5]

Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever. 2020. Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341 (2020).

[6]

Hao-Wen Dong,Wen-Yi Hsiao, Li-Chia Yang, and Yi-Hsuan Yang. 2018. Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.

[7]

Jesse Engel, Kumar Krishna Agrawal, Shuo Chen, Ishaan Gulrajani, Chris Donahue, and Adam Roberts. 2018. GANSynth: Adversarial Neural Audio Synthesis. In International Conference on Learning Representations.

[8]

Joao P Ferreira, Thiago M Coutinho, Thiago L Gomes, José F Neto, Rafael Azevedo, Renato Martins, and Erickson R Nascimento. 2021. Learning to dance: A graph convolutional adversarial network to generate realistic dance motions from audio. Computers & Graphics 94 (2021), 11--21.

[9]

Karan Goel, Albert Gu, Chris Donahue, and Christopher Ré. 2022. It's raw! audio generation with state-space models. In International Conference on Machine Learning. PMLR, 7616--7633.

[10]

Satya Krishna Gorti, Noël Vouitsis, Junwei Ma, Keyvan Golestan, Maksims Volkovs, Animesh Garg, and Guangwei Yu. 2022. X-pool: Cross-modal languagevideo attention for text-video retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5006--5015.

[11]

Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th international conference on world wide web. 173--182.

Digital Library

[12]

Sungeun Hong, Woobin Im, and Hyun S Yang. 2018. Cbvmr: content-based video-music retrieval using soft intra-modal structure constraint. In Proceedings of the 2018 ACM on international conference on multimedia retrieval. 353--361.

Digital Library

[13]

Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, Andrew M Dai, Matthew D Hoffman, Monica Dinculescu, and Douglas Eck. 2018. Music Transformer: Generating Music with Long-Term Structure. In International Conference on Learning Representations.

[14]

Ruozi Huang, Huang Hu, Wei Wu, Kei Sawada, Mi Zhang, and Daxin Jiang. 2020. Dance revolution: Long-term dance generation with music via curriculum learning. arXiv preprint arXiv:2006.06119 (2020).

[15]

Jinwoo Kim, Heeseok Oh, Seongjean Kim, Hoseok Tong, and Sanghoon Lee. 2022. A brand new dance partner: Music-conditioned pluralistic dancing controlled by multiple dance genres. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3490--3500.

[16]

Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D Plumbley. 2020. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020), 2880--2894.

Digital Library

[17]

Kundan Kumar, Rithesh Kumar, Thibault De Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre De Brebisson, Yoshua Bengio, and Aaron C Courville. 2019. Melgan: Generative adversarial networks for conditional waveform synthesis. Advances in neural information processing systems 32 (2019).

[18]

Hsin-Ying Lee, Xiaodong Yang, Ming-Yu Liu, Ting-Chun Wang, Yu-Ding Lu, Ming-Hsuan Yang, and Jan Kautz. 2019. Dancing to music. Advances in neural information processing systems 32 (2019).

[19]

Bochen Li and Aparna Kumar. 2019. Query by Video: Cross-modal Music Retrieval. In ISMIR. 604--611.

[20]

Ruilong Li, Shan Yang, David A Ross, and Angjoo Kanazawa. 2021. Ai choreographer: Music conditioned 3d dance generation with aist. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13401--13412.

[21]

Yizhi Li, Ruibin Yuan, Ge Zhang, Yinghao Ma, Xingran Chen, Hanzhi Yin, Chenghua Lin, Anton Ragni, Emmanouil Benetos, Norbert Gyenge, et al. 2023. MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training. arXiv preprint arXiv:2306.00107 (2023).

[22]

Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. 2015. SMPL: A Skinned Multi-Person Linear Model. Acm Transactions on Graphics 34, 6cd (2015), 248.

Digital Library

[23]

Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, et al. 2019. Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172 (2019).

[24]

Brian McFee, Colin Raffel, Dawen Liang, Daniel P Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. 2015. librosa: Audio and music signal analysis in python. In Proceedings of the 14th python in science conference, Vol. 8. 18--25.

[25]

Daniel McKee, Justin Salamon, Josef Sivic, and Bryan Russell. 2023. Language-Guided Music Recommendation for Video via Prompt Analogies. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14784--14793.

[26]

Aashiq Muhamed, Liang Li, Xingjian Shi, Suri Yaddanapudi, Wayne Chi, Dylan Jackson, Rahul Suresh, Zachary C Lipton, and Alex J Smola. 2021. Symbolic music generation with transformer-gans. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. 408--417.

[27]

Takayuki Nakatsuka, Masahiro Hamasaki, and Masataka Goto. 2023. Content-Based Music-Image Retrieval Using Self-and Cross-Modal Feature Embedding Memory. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2174--2184.

[28]

Yagya Raj Pandeya and Joonwhoan Lee. 2021. Deep learning-based late fusion of multimodal information for emotion classification of music video. Multimedia Tools and Applications 80 (2021), 2887--2905.

Digital Library

[29]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.

[30]

Xuanchi Ren, Haoran Li, Zijian Huang, and Qifeng Chen. 2020. Self-supervised dance video synthesis conditioned on music. In Proceedings of the 28th ACM International Conference on Multimedia. 46--54.

Digital Library

[31]

Yi Ren, Jinzheng He, Xu Tan, Tao Qin, Zhou Zhao, and Tie-Yan Liu. 2020. Popmag: Pop music accompaniment generation. In Proceedings of the 28th ACM international conference on multimedia. 1198--1206.

Digital Library

[32]

Li Siyao, Weijiang Yu, Tianpei Gu, Chunze Lin, Quan Wang, Chen Qian, Chen Change Loy, and Ziwei Liu. 2022. Bailando: 3d dance generation by actorcritic gpt with choreographic memory. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11050--11059.

[33]

Guofei Sun, Yongkang Wong, Zhiyong Cheng, Mohan S Kankanhalli, Weidong Geng, and Xiangdong Li. 2020. DeepDance: music-to-dance motion choreography with adversarial learning. IEEE Transactions on Multimedia 23 (2020), 497--509.

[34]

Dídac Surís, Carl Vondrick, Bryan Russell, and Justin Salamon. 2022. It's Time for Artistic Correspondence in Music and Video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10564--10574.

[35]

Taoran Tang, Jia Jia, and Hanyang Mao. 2018. Dance with melody: An lstmautoencoder approach to music-oriented dance synthesis. In Proceedings of the 26th ACM international conference on Multimedia. 1598--1606.

Digital Library

[36]

Shuhei Tsuchida, Satoru Fukayama, and Masataka Goto. 2019. Query-by-dancing: a dance music retrieval system based on body-motion similarity. In MultiMedia Modeling: 25th International Conference, MMM 2019, Thessaloniki, Greece, January 8-11, 2019, Proceedings, Part I 25. Springer, 251--263.

[37]

Guillermo Valle-Pérez, Gustav Eje Henter, Jonas Beskow, Andre Holzapfel, Pierre-Yves Oudeyer, and Simon Alexanderson. 2021. Transflower: probabilistic autoregressive dance generation with multimodal attention. ACM Transactions on Graphics (TOG) 40, 6 (2021), 1--14.

Digital Library

[38]

Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu [n. d.]. WaveNet: A Generative Model for Raw Audio. In 9th ISCA Speech Synthesis Workshop. 125--125.

[39]

Sean Vasquez and Mike Lewis. 2019. Melnet: A generative model for audio in the frequency domain. arXiv preprint arXiv:1906.01083 (2019).

[40]

Zeyu Wang, Yu Wu, Karthik Narasimhan, and Olga Russakovsky. 2022. Multiquery video retrieval. In European Conference on Computer Vision. Springer, 233--249.

[41]

Jiashuo Yu, Junfu Pu, Ying Cheng, Rui Feng, and Ying Shan. 2022. Self-supervised learning of music-dance representation through explicit-implicit rhythm synchronization. arXiv preprint arXiv:2207.03190 (2022).

[42]

Andrew Zhai and Hao-Yu Wu. 2018. Classification is a strong baseline for deep metric learning. arXiv preprint arXiv:1811.12649 (2018).

[43]

Haolin Zhuang, Shun Lei, Long Xiao,Weiqin Li, Liyang Chen, Sicheng Yang, Zhiyong Wu, Shiyin Kang, and Helen Meng. 2023. GTN-Bailando: Genre Consistent long-Term 3D Dance Generation Based on Pre-Trained Genre Token Network. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1--5.

[44]

Wenlin Zhuang, Congyi Wang, Jinxiang Chai, Yangang Wang, Ming Shao, and Siyu Xia. 2022. Music2dance: Dancenet for music-driven dance generation. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 18, 2 (2022), 1--21.

Digital Library

Index Terms

BeatDance: A Beat-Based Model-Agnostic Contrastive Learning Framework for Music-Dance Retrieval
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Music retrieval
        Video search

Recommendations

CoDancers: Music-Driven Coherent Group Dance Generation with Choreographic Unit
ICMR '24: Proceedings of the 2024 International Conference on Multimedia Retrieval

Dance and music are intimately interconnected, with group dance being a crucial part of dance artistry. Consequently, Music-Driven Group Dance Generation has been a fundamental and challenging task in various fields like education, art, and sports. ...
Music similarity and retrieval
SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

This tutorial serves as an introductory course to the field of and state-of-the-art in music information retrieval (MIR) and in particular to music similarity estimation which is an essential component of music retrieval. Apart from explaining ...
Music Retrieval and Recommendation Scheme Based on Varying Mood Sequences

A typical music clip consists of one or more segments with different moods and such mood information could be a crucial clue for determining the similarity between music clips. One representative mood has been selected for music clip for retrieval, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMR '24: Proceedings of the 2024 International Conference on Multimedia Retrieval

May 2024

1379 pages

ISBN:9798400706196

DOI:10.1145/3652583

General Chairs:
Cathal Gurrin
Dublin City University, Ireland
,
Rachada Kongkachandra
Thammasat University, Thailand
,
Klaus Schoeffmann
Klagenfurt University, Austria
,
Program Chairs:
Duc-Tien Dang-Nguyen
University of Bergen, Norway
,
Luca Rossetto
University of Zurich, Switzerland
,
Shin'ichi Satoh
National Institute of Informatics, Japan
,
Liting Zhou
Dublin City University, Ireland

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 June 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China

Conference

ICMR '24

Sponsor:

ICMR '24: International Conference on Multimedia Retrieval

June 10 - 14, 2024

Phuket, Thailand

Acceptance Rates

Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
151
Total Downloads

Downloads (Last 12 months)151
Downloads (Last 6 weeks)48

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten