MCLEMCD: multimodal collaborative learning encoder for enhanced music classification from dances

Gong, Wenjuan; Yu, Qingshuang; Sun, Haoran; Huang, Wendong; Cheng, Peng; Gonzàlez, Jordi

doi:10.1007/s00530-023-01207-6

MCLEMCD: multimodal collaborative learning encoder for enhanced music classification from dances

Regular Paper
Published: 22 January 2024

Volume 30, article number 37, (2024)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Wenjuan Gong¹,
Qingshuang Yu¹,
Haoran Sun¹,
Wendong Huang¹,
Peng Cheng² &
…
Jordi Gonzàlez³

146 Accesses
Explore all metrics

Abstract

Music classification is widely applied in the automatic organization of music archives and intelligent music interfaces. Music is frequently accompanied by other media, such as image sequences. Combining various types of media for various tasks is natural for humans but extremely difficult for machines. In this work, we propose a collaborative learning method to combine dancing motions and music cues for music classification and apply it to music recommendations from dancing motions. Dancing motions in the form of 3D joint positions contain cyclic motions synchronized with music beats, and a collaborative autoencoder is designed to fuse music cues into a dancing motion feature extraction module. The proposed method achieved $98.07\%$ on the MusicToDance data set and $65.29\%$ on the AIST++ data set. The code to run all experiments is available at https://github.com/wenjgong/musicmotion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dance2Music-Diffusion: leveraging latent diffusion models for music generation from dance videos

Article Open access 30 September 2024

BRACE: The Breakdancing Competition Dataset for Dance Motion Synthesis

Dance2MIDI: Dance-driven multi-instrument music generation

Article Open access 24 July 2024

Data availability

The MusicToDance data set that support the findings of this study are openly available at https://github.com/Musicto-dance-motion-synthesis/dataset, with reference number [56]. The AIST++ data set that support the findings of this study are openly available at https://google.github.io/aistplusplus_dataset/download.html, with reference number [43].

Notes

Please refer to Table 1 (“Quantitative Evaluations on the Test Data") in [32].

References

Goto, M., Dannenberg, R.B.: Music interfaces based on automatic music signal analysis: new ways to create and listen to music. IEEE Signal Process. Mag. 36(1), 74–81 (2018)
Article Google Scholar
Schedl, M.: Intelligent user interfaces for social music discovery and exploration of large-scale music repositories. In: Proceedings of the 2017 ACM Workshop on Theory-Informed User Modeling for Tailoring and Personalizing Interfaces. HUMANIZE ’17, pp. 7–11 (2017)
Oramas, S., Nieto, O., Barbieri, F., Serra, X.: Multi-label music genre classification from audio, text, and images using deep features. CoRR abs/1707.04916 (2017)
Mayer, R., Rauber, A.: Musical genre classification by ensembles of audio and lyrics features. In: Proceedings of International Conference on Music Information Retrieval, pp. 675–680 (2011)
Cai, X., Zhang, H.: Music genre classification based on auditory image, spectral and acoustic features. Multimedia Systems 28(3), 779–791 (2022)
Article Google Scholar
Chaturvedi, V., Kaur, A.B., Varshney, V., Garg, A., Chhabra, G.S., Kumar, M.: Music mood and human emotion recognition based on physiological signals: a systematic review. Multimedia Systems, 1–24 (2021)
Yang, Y.-H., Chen, H.H.: Machine recognition of music emotion: A review. ACM Transactions on Intelligent Systems and Technology (TIST) 3(3), 1–30 (2012)
Article Google Scholar
Huq, A., Bello, J.P., Rowe, R.: Automated music emotion recognition: A systematic evaluation. Journal of New Music Research 39(3), 227–244 (2010)
Article Google Scholar
Knees, P., Schedl, M.: Music Similarity and Retrieval: an Introduction to Audio-and Web-based Strategies vol. 9, (2016)
Karydis, I., Kermanidis, K.L., Sioutas, S., Iliadis, L.: Comparing content and context based similarity for musical data. Neurocomputing 107, 69–76 (2013)
Article Google Scholar
Krumhansl, C.L., Schenck, D.L.: Can dance reflect the structural and expressive qualities of music? a perceptual experiment on balanchine’s choreography of mozart’s divertimento no. 15. Musicae Scientiae 1(1), 63–85 (1997)
Su, Y.-H.: Rhythm of music seen through dance: Probing music-dance coupling by audiovisual meter perception (2017)
Alemi, O., Françoise, J., Pasquier, P.: Groovenet: Real-time music-driven dance movement generation using artificial neural networks. networks 8(17), 26 (2017)
Lee, J., Kim, S., Lee, K.: Listen to dance: Music-driven choreography generation using autoregressive encoder-decoder network. CoRR abs/1811.00818 (2018)
Manfrè, A., Infantino, I., Vella, F., Gaglio, S.: An automatic system for humanoid dance creation. Biologically Inspired Cognitive Architectures 15, 1–9 (2016)
Article Google Scholar
Fan, R., Xu, S., Geng, W.: Example-based automatic music-driven conventional dance motion synthesis. IEEE transactions on visualization and computer graphics 18(3), 501–515 (2011)
Google Scholar
Lee, M., Lee, K., Park, J.: Music similarity-based approach to generating dance motion sequence. Multimedia tools and applications 62(3), 895–912 (2013)
Article Google Scholar
Ofli, F., Erzin, E., Yemez, Y., Tekalp, A.M.: Learn2dance: Learning statistical music-to-dance mappings for choreography synthesis. IEEE Transactions on Multimedia 14(3), 747–759 (2011)
Article Google Scholar
Lee, H.-Y., Yang, X., Liu, M.-Y., Wang, T.-C., Lu, Y.-D., Yang, M.-H., Kautz, J.: Dancing to music. arXiv e-prints, 1911 (2019)
Tsuchida, S., Fukayama, S., Goto, M.: Query-by-dancing: a dance music retrieval system based on body-motion similarity. In: International Conference on Multimedia Modeling, pp. 251–263 (2019)
Ohkushi, H., Ogawa, T., Haseyama, M.: Music recommendation according to human motion based on kernel cca-based relationship. EURASIP Journal on Advances in Signal Processing 2011(1), 1–14 (2011)
Article Google Scholar
Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE transactions on acoustics, speech, and signal processing 28(4), 357–366 (1980)
Article Google Scholar
Schörkhuber, C., Klapuri, A.: Constant-q transform toolbox for music processing. In: 7th Sound and Music Computing Conference, Barcelona, Spain, pp. 3–64 (2010)
Böck, S., Widmer, G.: Maximum filter vibrato suppression for onset detection. In: Proc. of the 16th Int. Conf. on Digital Audio Effects (DAFx). Maynooth, Ireland (Sept 2013), vol. 7, p. 4 (2013)
Grosche, P., Müller, M., Kurth, F.: Cyclic tempogram-a mid-level tempo representation for musicsignals. In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5522–5525 (2010)
Bae, H.-S., Lee, H.-J., Lee, S.-G.: Voice recognition based on adaptive mfcc and deep learning. In: 2016 IEEE 11th Conference on Industrial Electronics and Applications (ICIEA), pp. 1542–1546 (2016)
Deng, M., Meng, T., Cao, J., Wang, S., Zhang, J., Fan, H.: Heart sound classification based on improved mfcc features and convolutional recurrent neural networks. Neural Networks 130, 22–32 (2020)
Article PubMed Google Scholar
Boles, A., Rad, P.: Voice biometrics: Deep learning-based voiceprint authentication system. In: 2017 12th System of Systems Engineering Conference (SoSE), pp. 1–6 (2017)
Shiratori, T., Nakazawa, A., Ikeuchi, K.: Synthesizing dance performance using musical and motion features. In: Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006., pp. 3654–3659 (2006)
Verma, P., Sah, A., Srivastava, R.: Deep learning-based multi-modal approach using rgb and skeleton sequences for human activity recognition. Multimedia Systems 26(6), 671–685 (2020)
Article Google Scholar
Niizumi, D., Takeuchi, D., Ohishi, Y., Harada, N., Kashino, K.: Masked modeling duo: Learning representations by encouraging both networks to model the input. In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2023)
Gong, W., Yu, Q.: A deep music recommendation method based on human motion analysis. IEEE Access 9, 26290–26300 (2021)
Article Google Scholar
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-second AAAI Conference on Artificial Intelligence (2018)
Liu, C., Feng, L., Liu, G., Wang, H., Liu, S.: Bottom-up broadcast neural network for music genre classification. Multimedia Tools and Applications 80(5), 7313–7331 (2021)
Article Google Scholar
Favory, X., Drossos, K., Virtanen, T., Serra, X.: Coala: Co-aligned autoencoders for learning semantically enriched audio representations. arXiv preprint arXiv:2006.08386 (2020)
Drake, C., Jones, M.R., Baruch, C.: The development of rhythmic attending in auditory sequences: attunement, referent period, focal attending. Cognition 77(3), 251–288 (2000)
Article CAS PubMed Google Scholar
McKinney, M.F., Moelants, D.: Ambiguity in tempo perception: What draws listeners to different metrical levels? Music Perception 24(2), 155–166 (2006)
Article Google Scholar
Burger, B., Thompson, M.R., Luck, G., Saarikallio, S.H., Toiviainen, P.: Hunting for the beat in the body: on period and phase locking in music-induced movement. Frontiers in human neuroscience 8, 903 (2014)
Article PubMed PubMed Central Google Scholar
Burger, B., Thompson, M.R., Luck, G., Saarikallio, S., Toiviainen, P.: Influences of rhythm-and timbre-related musical features on characteristics of music-induced movement. Frontiers in psychology 4, 183 (2013)
Article PubMed PubMed Central Google Scholar
Chu, W.-T., Tsai, S.-Y.: Rhythm of motion extraction and rhythm-based cross-media alignment for dance videos. IEEE Transactions on Multimedia 14(1), 129–141 (2011)
Article Google Scholar
Rubinstein, M., et al.: Analysis and visualization of temporal variations in video. PhD thesis, Massachusetts Institute of Technology (2014)
Shi, L., Zhang, Y., Cheng, J., Lu, H.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12026–12035 (2019)
Li, R., Yang, S., Ross, D.A., Kanazawa, A.: Learn to dance with aist++: Music conditioned 3d dance generation. CoRR abs/2101.08779 (2021)
Levin, A., Viola, P., Freund, Y.: Unsupervised improvement of visual detectors using co-training. In: Computer Vision, IEEE International Conference On, vol. 2, pp. 626–626 (2003)
Christoudias, C.M., Saenko, K., Morency, L.-P., Darrell, T.: Co-adaptation of audio-visual speech and gesture classifiers. In: Proceedings of the 8th International Conference on Multimodal Interfaces, pp. 84–91 (2006)
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pp. 92–100 (1998)
Qi, Y., Liu, Y., Sun, Q.: Music-driven dance generation. IEEE. Access 7, 166540–166550 (2019)
Article Google Scholar
Plizzari, C., Cannici, M., Matteucci, M.: Skeleton-based action recognition via spatial and temporal transformer networks. Computer Vision and Image Understanding 208–209, 103219 (2021)
Article Google Scholar
Chen, T., Zhou, D., Wang, J., Wang, S., Guan, Y., He, X., Ding, E.: Learning multi-granular spatio-temporal graph network for skeleton-based action recognition. In: Proceedings of the 29th ACM International Conference on Multimedia. MM ’21, pp. 4334–4342. Association for Computing Machinery, New York, NY, USA (2021)
Lee, J., Lee, M., Lee, D., Lee, S.: Hierarchically decomposed graph convolutional networks for skeleton-based action recognition. In: ICCV (2023)
Li, X., Li, X.: Atst: Audio representation learning with teacher-student transformer. In: Interspeech 2022 - Proceedings, pp. 4172–4176 (2022)
Wu, H., Seetharaman, P., Kumar, K., Bello, J.: Wav2clip: Learning robust audio representations from clip. In: 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings, pp. 4563–4567 (2022)
Dieleman, S., Schrauwen, B.: End-to-end learning for music audio. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6964–6968 (2014)
Lidy, T., Schindler, A.: Cqt-based convolutional neural networks for audio scene classification. In: Workshop on Detection and Classification of Acoustic Scenes and Events (2016)
Alves, A.A.C., Andrietta, L.T., Lopes, R.Z., Bussiman, F.O., Silva, F.F.e., Carvalheiro, R., Brito, L.F., Balieiro, J.C.d.C., Albuquerque, L.G., Ventura, R.V.: Integrating audio signal processing and deep learning algorithms for gait pattern classification in brazilian gaited horses. Frontiers in Animal Science 2 (2021)
Tang, T., Jia, J., Mao, H.: Dance with melody: An lstm-autoencoder approach to music-oriented dance synthesis. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 1598–1606 (2018)
McFee, B., Metsai, A., McVicar, M., Balke, S., Thomé, C., Raffel, C., Zalkow, F., Malek, A., Dana, Lee, K., Nieto, O., Ellis, D., Mason, J., Battenberg, E., Seyfarth, S., Yamamoto, R., viktorandreevichmorozov, Choi, K., Moore, J., Bittner, R., Hidaka, S., Wei, Z., nullmightybofo, Weiss, A., Hereñú, D., Stöter, F.-R., Friesch, P., Vollrath, M., Kim, T., Thassilo: librosa/librosa: 0.9.1 (2022)

Download references

Acknowledgements

Jordi Gonzàlez acknowledges the support by the Spanish Ministry of Economy and Competitiveness (MINECO) and the European Regional Development Fund (ERDF) under Project PID2020-120311RB-I00/AEI/10.13039/501100011033.

Author information

Authors and Affiliations

College of Computer Science and Technology, China University of Petroleum (East China), No. 66, Changjiangxi Road, Huangdao, 266555, China
Wenjuan Gong, Qingshuang Yu, Haoran Sun & Wendong Huang
Institute of High Performance Computing, A*STAR, 1 Fusionopolis Way, #16-16 Connexis (North Tower), Singapore, Singapore
Peng Cheng
Computer Vision Center, Autonomous University of Barcelona, Edificio O, Campus UAB, Bellaterra (Cerdanyola), Barcelona, 08193, Spain
Jordi Gonzàlez

Authors

Wenjuan Gong
View author publications
You can also search for this author in PubMed Google Scholar
Qingshuang Yu
View author publications
You can also search for this author in PubMed Google Scholar
Haoran Sun
View author publications
You can also search for this author in PubMed Google Scholar
Wendong Huang
View author publications
You can also search for this author in PubMed Google Scholar
Peng Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Jordi Gonzàlez
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Wenjuan Gong conceived the idea and designed the algorithm and wrote the manuscript. Qingshuang Yu and Wendong Huang implemented and optimized the algorithm. Haoran Sun helps to run all required experiments during revisions. Peng Cheng visualized the demo. Jordi Gonzalez revised the manuscript.

Corresponding author

Correspondence to Wenjuan Gong.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Communicated by M. Mu.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Gong, W., Yu, Q., Sun, H. et al. MCLEMCD: multimodal collaborative learning encoder for enhanced music classification from dances. Multimedia Systems 30, 37 (2024). https://doi.org/10.1007/s00530-023-01207-6

Download citation

Received: 10 December 2022
Accepted: 08 December 2023
Published: 22 January 2024
DOI: https://doi.org/10.1007/s00530-023-01207-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MCLEMCD: multimodal collaborative learning encoder for enhanced music classification from dances

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Dance2Music-Diffusion: leveraging latent diffusion models for music generation from dance videos

BRACE: The Breakdancing Competition Dataset for Dance Motion Synthesis

Dance2MIDI: Dance-driven multi-instrument music generation

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

MCLEMCD: multimodal collaborative learning encoder for enhanced music classification from dances

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Dance2Music-Diffusion: leveraging latent diffusion models for music generation from dance videos

BRACE: The Breakdancing Competition Dataset for Dance Motion Synthesis

Dance2MIDI: Dance-driven multi-instrument music generation

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation