Skip to main content

Advertisement

Log in

MCLEMCD: multimodal collaborative learning encoder for enhanced music classification from dances

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Music classification is widely applied in the automatic organization of music archives and intelligent music interfaces. Music is frequently accompanied by other media, such as image sequences. Combining various types of media for various tasks is natural for humans but extremely difficult for machines. In this work, we propose a collaborative learning method to combine dancing motions and music cues for music classification and apply it to music recommendations from dancing motions. Dancing motions in the form of 3D joint positions contain cyclic motions synchronized with music beats, and a collaborative autoencoder is designed to fuse music cues into a dancing motion feature extraction module. The proposed method achieved \(98.07\%\) on the MusicToDance data set and \(65.29\%\) on the AIST++ data set. The code to run all experiments is available at https://github.com/wenjgong/musicmotion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

The MusicToDance data set that support the findings of this study are openly available at https://github.com/Musicto-dance-motion-synthesis/dataset, with reference number [56]. The AIST++ data set that support the findings of this study are openly available at https://google.github.io/aistplusplus_dataset/download.html, with reference number [43].

Notes

  1. Please refer to Table 1 (“Quantitative Evaluations on the Test Data") in [32].

References

  1. Goto, M., Dannenberg, R.B.: Music interfaces based on automatic music signal analysis: new ways to create and listen to music. IEEE Signal Process. Mag. 36(1), 74–81 (2018)

    Article  Google Scholar 

  2. Schedl, M.: Intelligent user interfaces for social music discovery and exploration of large-scale music repositories. In: Proceedings of the 2017 ACM Workshop on Theory-Informed User Modeling for Tailoring and Personalizing Interfaces. HUMANIZE ’17, pp. 7–11 (2017)

  3. Oramas, S., Nieto, O., Barbieri, F., Serra, X.: Multi-label music genre classification from audio, text, and images using deep features. CoRR abs/1707.04916 (2017)

  4. Mayer, R., Rauber, A.: Musical genre classification by ensembles of audio and lyrics features. In: Proceedings of International Conference on Music Information Retrieval, pp. 675–680 (2011)

  5. Cai, X., Zhang, H.: Music genre classification based on auditory image, spectral and acoustic features. Multimedia Systems 28(3), 779–791 (2022)

    Article  Google Scholar 

  6. Chaturvedi, V., Kaur, A.B., Varshney, V., Garg, A., Chhabra, G.S., Kumar, M.: Music mood and human emotion recognition based on physiological signals: a systematic review. Multimedia Systems, 1–24 (2021)

  7. Yang, Y.-H., Chen, H.H.: Machine recognition of music emotion: A review. ACM Transactions on Intelligent Systems and Technology (TIST) 3(3), 1–30 (2012)

    Article  Google Scholar 

  8. Huq, A., Bello, J.P., Rowe, R.: Automated music emotion recognition: A systematic evaluation. Journal of New Music Research 39(3), 227–244 (2010)

    Article  Google Scholar 

  9. Knees, P., Schedl, M.: Music Similarity and Retrieval: an Introduction to Audio-and Web-based Strategies vol. 9, (2016)

  10. Karydis, I., Kermanidis, K.L., Sioutas, S., Iliadis, L.: Comparing content and context based similarity for musical data. Neurocomputing 107, 69–76 (2013)

    Article  Google Scholar 

  11. Krumhansl, C.L., Schenck, D.L.: Can dance reflect the structural and expressive qualities of music? a perceptual experiment on balanchine’s choreography of mozart’s divertimento no. 15. Musicae Scientiae 1(1), 63–85 (1997)

  12. Su, Y.-H.: Rhythm of music seen through dance: Probing music-dance coupling by audiovisual meter perception (2017)

  13. Alemi, O., Françoise, J., Pasquier, P.: Groovenet: Real-time music-driven dance movement generation using artificial neural networks. networks 8(17), 26 (2017)

  14. Lee, J., Kim, S., Lee, K.: Listen to dance: Music-driven choreography generation using autoregressive encoder-decoder network. CoRR abs/1811.00818 (2018)

  15. Manfrè, A., Infantino, I., Vella, F., Gaglio, S.: An automatic system for humanoid dance creation. Biologically Inspired Cognitive Architectures 15, 1–9 (2016)

    Article  Google Scholar 

  16. Fan, R., Xu, S., Geng, W.: Example-based automatic music-driven conventional dance motion synthesis. IEEE transactions on visualization and computer graphics 18(3), 501–515 (2011)

    Google Scholar 

  17. Lee, M., Lee, K., Park, J.: Music similarity-based approach to generating dance motion sequence. Multimedia tools and applications 62(3), 895–912 (2013)

    Article  Google Scholar 

  18. Ofli, F., Erzin, E., Yemez, Y., Tekalp, A.M.: Learn2dance: Learning statistical music-to-dance mappings for choreography synthesis. IEEE Transactions on Multimedia 14(3), 747–759 (2011)

    Article  Google Scholar 

  19. Lee, H.-Y., Yang, X., Liu, M.-Y., Wang, T.-C., Lu, Y.-D., Yang, M.-H., Kautz, J.: Dancing to music. arXiv e-prints, 1911 (2019)

  20. Tsuchida, S., Fukayama, S., Goto, M.: Query-by-dancing: a dance music retrieval system based on body-motion similarity. In: International Conference on Multimedia Modeling, pp. 251–263 (2019)

  21. Ohkushi, H., Ogawa, T., Haseyama, M.: Music recommendation according to human motion based on kernel cca-based relationship. EURASIP Journal on Advances in Signal Processing 2011(1), 1–14 (2011)

    Article  Google Scholar 

  22. Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE transactions on acoustics, speech, and signal processing 28(4), 357–366 (1980)

    Article  Google Scholar 

  23. Schörkhuber, C., Klapuri, A.: Constant-q transform toolbox for music processing. In: 7th Sound and Music Computing Conference, Barcelona, Spain, pp. 3–64 (2010)

  24. Böck, S., Widmer, G.: Maximum filter vibrato suppression for onset detection. In: Proc. of the 16th Int. Conf. on Digital Audio Effects (DAFx). Maynooth, Ireland (Sept 2013), vol. 7, p. 4 (2013)

  25. Grosche, P., Müller, M., Kurth, F.: Cyclic tempogram-a mid-level tempo representation for musicsignals. In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5522–5525 (2010)

  26. Bae, H.-S., Lee, H.-J., Lee, S.-G.: Voice recognition based on adaptive mfcc and deep learning. In: 2016 IEEE 11th Conference on Industrial Electronics and Applications (ICIEA), pp. 1542–1546 (2016)

  27. Deng, M., Meng, T., Cao, J., Wang, S., Zhang, J., Fan, H.: Heart sound classification based on improved mfcc features and convolutional recurrent neural networks. Neural Networks 130, 22–32 (2020)

    Article  PubMed  Google Scholar 

  28. Boles, A., Rad, P.: Voice biometrics: Deep learning-based voiceprint authentication system. In: 2017 12th System of Systems Engineering Conference (SoSE), pp. 1–6 (2017)

  29. Shiratori, T., Nakazawa, A., Ikeuchi, K.: Synthesizing dance performance using musical and motion features. In: Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006., pp. 3654–3659 (2006)

  30. Verma, P., Sah, A., Srivastava, R.: Deep learning-based multi-modal approach using rgb and skeleton sequences for human activity recognition. Multimedia Systems 26(6), 671–685 (2020)

    Article  Google Scholar 

  31. Niizumi, D., Takeuchi, D., Ohishi, Y., Harada, N., Kashino, K.: Masked modeling duo: Learning representations by encouraging both networks to model the input. In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2023)

  32. Gong, W., Yu, Q.: A deep music recommendation method based on human motion analysis. IEEE Access 9, 26290–26300 (2021)

    Article  Google Scholar 

  33. Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-second AAAI Conference on Artificial Intelligence (2018)

  34. Liu, C., Feng, L., Liu, G., Wang, H., Liu, S.: Bottom-up broadcast neural network for music genre classification. Multimedia Tools and Applications 80(5), 7313–7331 (2021)

    Article  Google Scholar 

  35. Favory, X., Drossos, K., Virtanen, T., Serra, X.: Coala: Co-aligned autoencoders for learning semantically enriched audio representations. arXiv preprint arXiv:2006.08386 (2020)

  36. Drake, C., Jones, M.R., Baruch, C.: The development of rhythmic attending in auditory sequences: attunement, referent period, focal attending. Cognition 77(3), 251–288 (2000)

    Article  CAS  PubMed  Google Scholar 

  37. McKinney, M.F., Moelants, D.: Ambiguity in tempo perception: What draws listeners to different metrical levels? Music Perception 24(2), 155–166 (2006)

    Article  Google Scholar 

  38. Burger, B., Thompson, M.R., Luck, G., Saarikallio, S.H., Toiviainen, P.: Hunting for the beat in the body: on period and phase locking in music-induced movement. Frontiers in human neuroscience 8, 903 (2014)

    Article  PubMed  PubMed Central  Google Scholar 

  39. Burger, B., Thompson, M.R., Luck, G., Saarikallio, S., Toiviainen, P.: Influences of rhythm-and timbre-related musical features on characteristics of music-induced movement. Frontiers in psychology 4, 183 (2013)

    Article  PubMed  PubMed Central  Google Scholar 

  40. Chu, W.-T., Tsai, S.-Y.: Rhythm of motion extraction and rhythm-based cross-media alignment for dance videos. IEEE Transactions on Multimedia 14(1), 129–141 (2011)

    Article  Google Scholar 

  41. Rubinstein, M., et al.: Analysis and visualization of temporal variations in video. PhD thesis, Massachusetts Institute of Technology (2014)

  42. Shi, L., Zhang, Y., Cheng, J., Lu, H.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12026–12035 (2019)

  43. Li, R., Yang, S., Ross, D.A., Kanazawa, A.: Learn to dance with aist++: Music conditioned 3d dance generation. CoRR abs/2101.08779 (2021)

  44. Levin, A., Viola, P., Freund, Y.: Unsupervised improvement of visual detectors using co-training. In: Computer Vision, IEEE International Conference On, vol. 2, pp. 626–626 (2003)

  45. Christoudias, C.M., Saenko, K., Morency, L.-P., Darrell, T.: Co-adaptation of audio-visual speech and gesture classifiers. In: Proceedings of the 8th International Conference on Multimodal Interfaces, pp. 84–91 (2006)

  46. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pp. 92–100 (1998)

  47. Qi, Y., Liu, Y., Sun, Q.: Music-driven dance generation. IEEE. Access 7, 166540–166550 (2019)

    Article  Google Scholar 

  48. Plizzari, C., Cannici, M., Matteucci, M.: Skeleton-based action recognition via spatial and temporal transformer networks. Computer Vision and Image Understanding 208–209, 103219 (2021)

    Article  Google Scholar 

  49. Chen, T., Zhou, D., Wang, J., Wang, S., Guan, Y., He, X., Ding, E.: Learning multi-granular spatio-temporal graph network for skeleton-based action recognition. In: Proceedings of the 29th ACM International Conference on Multimedia. MM ’21, pp. 4334–4342. Association for Computing Machinery, New York, NY, USA (2021)

  50. Lee, J., Lee, M., Lee, D., Lee, S.: Hierarchically decomposed graph convolutional networks for skeleton-based action recognition. In: ICCV (2023)

  51. Li, X., Li, X.: Atst: Audio representation learning with teacher-student transformer. In: Interspeech 2022 - Proceedings, pp. 4172–4176 (2022)

  52. Wu, H., Seetharaman, P., Kumar, K., Bello, J.: Wav2clip: Learning robust audio representations from clip. In: 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings, pp. 4563–4567 (2022)

  53. Dieleman, S., Schrauwen, B.: End-to-end learning for music audio. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6964–6968 (2014)

  54. Lidy, T., Schindler, A.: Cqt-based convolutional neural networks for audio scene classification. In: Workshop on Detection and Classification of Acoustic Scenes and Events (2016)

  55. Alves, A.A.C., Andrietta, L.T., Lopes, R.Z., Bussiman, F.O., Silva, F.F.e., Carvalheiro, R., Brito, L.F., Balieiro, J.C.d.C., Albuquerque, L.G., Ventura, R.V.: Integrating audio signal processing and deep learning algorithms for gait pattern classification in brazilian gaited horses. Frontiers in Animal Science 2 (2021)

  56. Tang, T., Jia, J., Mao, H.: Dance with melody: An lstm-autoencoder approach to music-oriented dance synthesis. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 1598–1606 (2018)

  57. McFee, B., Metsai, A., McVicar, M., Balke, S., Thomé, C., Raffel, C., Zalkow, F., Malek, A., Dana, Lee, K., Nieto, O., Ellis, D., Mason, J., Battenberg, E., Seyfarth, S., Yamamoto, R., viktorandreevichmorozov, Choi, K., Moore, J., Bittner, R., Hidaka, S., Wei, Z., nullmightybofo, Weiss, A., Hereñú, D., Stöter, F.-R., Friesch, P., Vollrath, M., Kim, T., Thassilo: librosa/librosa: 0.9.1 (2022)

Download references

Acknowledgements

Jordi Gonzàlez acknowledges the support by the Spanish Ministry of Economy and Competitiveness (MINECO) and the European Regional Development Fund (ERDF) under Project PID2020-120311RB-I00/AEI/10.13039/501100011033.

Author information

Authors and Affiliations

Authors

Contributions

Wenjuan Gong conceived the idea and designed the algorithm and wrote the manuscript. Qingshuang Yu and Wendong Huang implemented and optimized the algorithm. Haoran Sun helps to run all required experiments during revisions. Peng Cheng visualized the demo. Jordi Gonzalez revised the manuscript.

Corresponding author

Correspondence to Wenjuan Gong.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Communicated by M. Mu.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gong, W., Yu, Q., Sun, H. et al. MCLEMCD: multimodal collaborative learning encoder for enhanced music classification from dances. Multimedia Systems 30, 37 (2024). https://doi.org/10.1007/s00530-023-01207-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00530-023-01207-6

Keywords

Navigation