Abstract
Music classification is widely applied in the automatic organization of music archives and intelligent music interfaces. Music is frequently accompanied by other media, such as image sequences. Combining various types of media for various tasks is natural for humans but extremely difficult for machines. In this work, we propose a collaborative learning method to combine dancing motions and music cues for music classification and apply it to music recommendations from dancing motions. Dancing motions in the form of 3D joint positions contain cyclic motions synchronized with music beats, and a collaborative autoencoder is designed to fuse music cues into a dancing motion feature extraction module. The proposed method achieved \(98.07\%\) on the MusicToDance data set and \(65.29\%\) on the AIST++ data set. The code to run all experiments is available at https://github.com/wenjgong/musicmotion.






Similar content being viewed by others
Data availability
The MusicToDance data set that support the findings of this study are openly available at https://github.com/Musicto-dance-motion-synthesis/dataset, with reference number [56]. The AIST++ data set that support the findings of this study are openly available at https://google.github.io/aistplusplus_dataset/download.html, with reference number [43].
Notes
Please refer to Table 1 (“Quantitative Evaluations on the Test Data") in [32].
References
Goto, M., Dannenberg, R.B.: Music interfaces based on automatic music signal analysis: new ways to create and listen to music. IEEE Signal Process. Mag. 36(1), 74–81 (2018)
Schedl, M.: Intelligent user interfaces for social music discovery and exploration of large-scale music repositories. In: Proceedings of the 2017 ACM Workshop on Theory-Informed User Modeling for Tailoring and Personalizing Interfaces. HUMANIZE ’17, pp. 7–11 (2017)
Oramas, S., Nieto, O., Barbieri, F., Serra, X.: Multi-label music genre classification from audio, text, and images using deep features. CoRR abs/1707.04916 (2017)
Mayer, R., Rauber, A.: Musical genre classification by ensembles of audio and lyrics features. In: Proceedings of International Conference on Music Information Retrieval, pp. 675–680 (2011)
Cai, X., Zhang, H.: Music genre classification based on auditory image, spectral and acoustic features. Multimedia Systems 28(3), 779–791 (2022)
Chaturvedi, V., Kaur, A.B., Varshney, V., Garg, A., Chhabra, G.S., Kumar, M.: Music mood and human emotion recognition based on physiological signals: a systematic review. Multimedia Systems, 1–24 (2021)
Yang, Y.-H., Chen, H.H.: Machine recognition of music emotion: A review. ACM Transactions on Intelligent Systems and Technology (TIST) 3(3), 1–30 (2012)
Huq, A., Bello, J.P., Rowe, R.: Automated music emotion recognition: A systematic evaluation. Journal of New Music Research 39(3), 227–244 (2010)
Knees, P., Schedl, M.: Music Similarity and Retrieval: an Introduction to Audio-and Web-based Strategies vol. 9, (2016)
Karydis, I., Kermanidis, K.L., Sioutas, S., Iliadis, L.: Comparing content and context based similarity for musical data. Neurocomputing 107, 69–76 (2013)
Krumhansl, C.L., Schenck, D.L.: Can dance reflect the structural and expressive qualities of music? a perceptual experiment on balanchine’s choreography of mozart’s divertimento no. 15. Musicae Scientiae 1(1), 63–85 (1997)
Su, Y.-H.: Rhythm of music seen through dance: Probing music-dance coupling by audiovisual meter perception (2017)
Alemi, O., Françoise, J., Pasquier, P.: Groovenet: Real-time music-driven dance movement generation using artificial neural networks. networks 8(17), 26 (2017)
Lee, J., Kim, S., Lee, K.: Listen to dance: Music-driven choreography generation using autoregressive encoder-decoder network. CoRR abs/1811.00818 (2018)
Manfrè, A., Infantino, I., Vella, F., Gaglio, S.: An automatic system for humanoid dance creation. Biologically Inspired Cognitive Architectures 15, 1–9 (2016)
Fan, R., Xu, S., Geng, W.: Example-based automatic music-driven conventional dance motion synthesis. IEEE transactions on visualization and computer graphics 18(3), 501–515 (2011)
Lee, M., Lee, K., Park, J.: Music similarity-based approach to generating dance motion sequence. Multimedia tools and applications 62(3), 895–912 (2013)
Ofli, F., Erzin, E., Yemez, Y., Tekalp, A.M.: Learn2dance: Learning statistical music-to-dance mappings for choreography synthesis. IEEE Transactions on Multimedia 14(3), 747–759 (2011)
Lee, H.-Y., Yang, X., Liu, M.-Y., Wang, T.-C., Lu, Y.-D., Yang, M.-H., Kautz, J.: Dancing to music. arXiv e-prints, 1911 (2019)
Tsuchida, S., Fukayama, S., Goto, M.: Query-by-dancing: a dance music retrieval system based on body-motion similarity. In: International Conference on Multimedia Modeling, pp. 251–263 (2019)
Ohkushi, H., Ogawa, T., Haseyama, M.: Music recommendation according to human motion based on kernel cca-based relationship. EURASIP Journal on Advances in Signal Processing 2011(1), 1–14 (2011)
Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE transactions on acoustics, speech, and signal processing 28(4), 357–366 (1980)
Schörkhuber, C., Klapuri, A.: Constant-q transform toolbox for music processing. In: 7th Sound and Music Computing Conference, Barcelona, Spain, pp. 3–64 (2010)
Böck, S., Widmer, G.: Maximum filter vibrato suppression for onset detection. In: Proc. of the 16th Int. Conf. on Digital Audio Effects (DAFx). Maynooth, Ireland (Sept 2013), vol. 7, p. 4 (2013)
Grosche, P., Müller, M., Kurth, F.: Cyclic tempogram-a mid-level tempo representation for musicsignals. In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5522–5525 (2010)
Bae, H.-S., Lee, H.-J., Lee, S.-G.: Voice recognition based on adaptive mfcc and deep learning. In: 2016 IEEE 11th Conference on Industrial Electronics and Applications (ICIEA), pp. 1542–1546 (2016)
Deng, M., Meng, T., Cao, J., Wang, S., Zhang, J., Fan, H.: Heart sound classification based on improved mfcc features and convolutional recurrent neural networks. Neural Networks 130, 22–32 (2020)
Boles, A., Rad, P.: Voice biometrics: Deep learning-based voiceprint authentication system. In: 2017 12th System of Systems Engineering Conference (SoSE), pp. 1–6 (2017)
Shiratori, T., Nakazawa, A., Ikeuchi, K.: Synthesizing dance performance using musical and motion features. In: Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006., pp. 3654–3659 (2006)
Verma, P., Sah, A., Srivastava, R.: Deep learning-based multi-modal approach using rgb and skeleton sequences for human activity recognition. Multimedia Systems 26(6), 671–685 (2020)
Niizumi, D., Takeuchi, D., Ohishi, Y., Harada, N., Kashino, K.: Masked modeling duo: Learning representations by encouraging both networks to model the input. In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2023)
Gong, W., Yu, Q.: A deep music recommendation method based on human motion analysis. IEEE Access 9, 26290–26300 (2021)
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-second AAAI Conference on Artificial Intelligence (2018)
Liu, C., Feng, L., Liu, G., Wang, H., Liu, S.: Bottom-up broadcast neural network for music genre classification. Multimedia Tools and Applications 80(5), 7313–7331 (2021)
Favory, X., Drossos, K., Virtanen, T., Serra, X.: Coala: Co-aligned autoencoders for learning semantically enriched audio representations. arXiv preprint arXiv:2006.08386 (2020)
Drake, C., Jones, M.R., Baruch, C.: The development of rhythmic attending in auditory sequences: attunement, referent period, focal attending. Cognition 77(3), 251–288 (2000)
McKinney, M.F., Moelants, D.: Ambiguity in tempo perception: What draws listeners to different metrical levels? Music Perception 24(2), 155–166 (2006)
Burger, B., Thompson, M.R., Luck, G., Saarikallio, S.H., Toiviainen, P.: Hunting for the beat in the body: on period and phase locking in music-induced movement. Frontiers in human neuroscience 8, 903 (2014)
Burger, B., Thompson, M.R., Luck, G., Saarikallio, S., Toiviainen, P.: Influences of rhythm-and timbre-related musical features on characteristics of music-induced movement. Frontiers in psychology 4, 183 (2013)
Chu, W.-T., Tsai, S.-Y.: Rhythm of motion extraction and rhythm-based cross-media alignment for dance videos. IEEE Transactions on Multimedia 14(1), 129–141 (2011)
Rubinstein, M., et al.: Analysis and visualization of temporal variations in video. PhD thesis, Massachusetts Institute of Technology (2014)
Shi, L., Zhang, Y., Cheng, J., Lu, H.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12026–12035 (2019)
Li, R., Yang, S., Ross, D.A., Kanazawa, A.: Learn to dance with aist++: Music conditioned 3d dance generation. CoRR abs/2101.08779 (2021)
Levin, A., Viola, P., Freund, Y.: Unsupervised improvement of visual detectors using co-training. In: Computer Vision, IEEE International Conference On, vol. 2, pp. 626–626 (2003)
Christoudias, C.M., Saenko, K., Morency, L.-P., Darrell, T.: Co-adaptation of audio-visual speech and gesture classifiers. In: Proceedings of the 8th International Conference on Multimodal Interfaces, pp. 84–91 (2006)
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pp. 92–100 (1998)
Qi, Y., Liu, Y., Sun, Q.: Music-driven dance generation. IEEE. Access 7, 166540–166550 (2019)
Plizzari, C., Cannici, M., Matteucci, M.: Skeleton-based action recognition via spatial and temporal transformer networks. Computer Vision and Image Understanding 208–209, 103219 (2021)
Chen, T., Zhou, D., Wang, J., Wang, S., Guan, Y., He, X., Ding, E.: Learning multi-granular spatio-temporal graph network for skeleton-based action recognition. In: Proceedings of the 29th ACM International Conference on Multimedia. MM ’21, pp. 4334–4342. Association for Computing Machinery, New York, NY, USA (2021)
Lee, J., Lee, M., Lee, D., Lee, S.: Hierarchically decomposed graph convolutional networks for skeleton-based action recognition. In: ICCV (2023)
Li, X., Li, X.: Atst: Audio representation learning with teacher-student transformer. In: Interspeech 2022 - Proceedings, pp. 4172–4176 (2022)
Wu, H., Seetharaman, P., Kumar, K., Bello, J.: Wav2clip: Learning robust audio representations from clip. In: 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings, pp. 4563–4567 (2022)
Dieleman, S., Schrauwen, B.: End-to-end learning for music audio. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6964–6968 (2014)
Lidy, T., Schindler, A.: Cqt-based convolutional neural networks for audio scene classification. In: Workshop on Detection and Classification of Acoustic Scenes and Events (2016)
Alves, A.A.C., Andrietta, L.T., Lopes, R.Z., Bussiman, F.O., Silva, F.F.e., Carvalheiro, R., Brito, L.F., Balieiro, J.C.d.C., Albuquerque, L.G., Ventura, R.V.: Integrating audio signal processing and deep learning algorithms for gait pattern classification in brazilian gaited horses. Frontiers in Animal Science 2 (2021)
Tang, T., Jia, J., Mao, H.: Dance with melody: An lstm-autoencoder approach to music-oriented dance synthesis. In: Proceedings of the 26th ACM International Conference on Multimedia, pp. 1598–1606 (2018)
McFee, B., Metsai, A., McVicar, M., Balke, S., Thomé, C., Raffel, C., Zalkow, F., Malek, A., Dana, Lee, K., Nieto, O., Ellis, D., Mason, J., Battenberg, E., Seyfarth, S., Yamamoto, R., viktorandreevichmorozov, Choi, K., Moore, J., Bittner, R., Hidaka, S., Wei, Z., nullmightybofo, Weiss, A., Hereñú, D., Stöter, F.-R., Friesch, P., Vollrath, M., Kim, T., Thassilo: librosa/librosa: 0.9.1 (2022)
Acknowledgements
Jordi Gonzàlez acknowledges the support by the Spanish Ministry of Economy and Competitiveness (MINECO) and the European Regional Development Fund (ERDF) under Project PID2020-120311RB-I00/AEI/10.13039/501100011033.
Author information
Authors and Affiliations
Contributions
Wenjuan Gong conceived the idea and designed the algorithm and wrote the manuscript. Qingshuang Yu and Wendong Huang implemented and optimized the algorithm. Haoran Sun helps to run all required experiments during revisions. Peng Cheng visualized the demo. Jordi Gonzalez revised the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Communicated by M. Mu.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Gong, W., Yu, Q., Sun, H. et al. MCLEMCD: multimodal collaborative learning encoder for enhanced music classification from dances. Multimedia Systems 30, 37 (2024). https://doi.org/10.1007/s00530-023-01207-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00530-023-01207-6