Abstract
Music Emotion Recognition (MER) has attracted much interest in the past decades. Many deep learning methods have been applied to this field recently. However, the previous methods for MER mostly utilized simple convolutional layers to extract features from the original audio signals, in which representative emotion-related features cannot be extracted. In this paper, we propose a novel method named Modularized Composite Attention Network (MCAN) for continuous MER. A sample reconstruction technique is proposed to enhance the stability of the network. Specifically, a feature augmentation module is constructed to extract salient features and we design a weighted attention module to control the focus of the whole network. Furthermore, a style embedding module is introduced to enhance the detail processing capability of the network. We conduct experiments on two datasets, that is, the benchmark dataset DEAM and the newly proposed dataset PMEmo. The superior results prove the effectiveness of our proposed MCAN. Especially qualitative analyses are given to for explaining the performance of our model.
Similar content being viewed by others
Data availability
Data and material are fully available without restriction.
Notes
Last.fm. Available: https://www.last.fm/
MIREX. Available: http://www.music-ir.org/mirex/wiki/
MediaEval2019. Available: http://www.multimediaeval.org/mediaeval2019/
AVEC. Available: https://avec-db.sspnet.eu/
DEAM. Available: http://cvml.unige.ch/databases/DEAM/
1000 Songs. Available: http://cvml.unige.ch/databases/emoMusic/
PMEmo. Available: http://www.next.zju.edu.cn/research/pmemo/amp/
References
Aljanaki A, Yang YH, Soleymani M (2017) Developing a benchmark for emotional analysis of music. PLoS One 12(3):e0173392
Amiriparian S, Gerczuk M et al (2019) Emotion and themes recognition in music utilizing convolutional and recurrent neural networks. In: Proceedings of the MediaEval 2019 workshop
Bharti D, Kukana P (2020) A hybrid machine learning model for emotion recognition from speech signals. In: Proceedings of the International Conference on Smart Electronics and Communication, pp. 491–496
Bogdanov D, Wack N et al (2013) ESSENTIA: an audio analysis library for music information retrieval. In: Proceedings of the 14th International Society for Music Information Retrieval Conference, pp 493–498
Bogdanov D, Porter A, Tovstogan P et al (2019) MediaEval 2019: emotion and theme recognition in music using Jamendo. In: Proceedings of the MediaEval 2019 workshop
Cabrera D et al (1999) Psysound: a computer program for psychoacoustical analysis. In: Proceedings of the Australian acoustical society conference, 24: 47–54
Chen S, Jin Q (2015) Multi-modal dimensional emotion recognition using recurrent neural networks. In: Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge, pp 49–56
Cheuk KW, Luo YJ et al Regression-based music emotion prediction using triplet neural networks. In: Proceedings of the IEEE International Joint Conference on Neural Networks, pp 1–7
Deshpande H, Singh R, Nam U (2001) Classification of music signals in the visual domain. Proceedings of the COST G-6 Conference on Digital Audio Effects, 3(1): 1–4
Dieleman S, Brakel P, Schrauwen B (2011) Audio-based music classification with a pretrained convolutional network. In: Proceedings of the 12th International Symposium on Music Information Retrieval, pp 669–674
Dingle GA, Kelly PJ et al (2015) The influence of music on emotions and cravings in clients in addiction treatment: a study of two clinical samples. The Arts in Psychotherapy 45:18–25
Dong Y, Yang X, Zhao X, Li J (2019) Bidirectional convolutional recurrent sparse network (BCRSN): an efficient model for music emotion recognition. IEEE Transactions on Multimedia 21(12):3150–3163
Florence SM, Uma M (2020) Emotional detection and music recommendation system based on user facial expression. IOP Conference Series: Materials Science and Engineering 912(6):062007
Grekow J (2018) From content-based music emotion recognition to rmotion maps of musical pieces. Polish Academy of Science, Polish Warsaw
He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778
Huan RH, Shu J et al (2020) Video multimodal emotion recognition based on Bi-GRU and attention fusion. Multimedia Tools and Applications, pp 1–28
Huang G, Liu Z, Van Der Maaten L et al (2017) Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708
Huang J, Li Y, Tao J et al (2018) Multimodal continuous emotion recognition with data augmentation using recurrent neural networks. In: Proceedings of the 8th international workshop on audio/visual emotion challenge, pp 57–64
Hung HT, Chen YH et al (2019) MediaEval 2019 emotion and theme recognition task: a VQ-VAE based approach. In: Proceedings of the MediaEval 2019 Workshop
Huo Y, Yao H et al (2020) Soul dancer: emotion-based human action generation. ACM Transactions on Multimedia Computing Communication and Application 15(3s):1–19
S. Jun, H. Hwang (2009) A fuzzy inference-based music emotion recognition system. In: Proceedings of the 5th International Conference on Visual Information Engineering, pp 673–677
Lartillot O, Toiviainen P (2007) MIR in matlab (II): a toolbox for musical feature extraction from audio. In: Proceedings of the International Conference on Music Information Retrieval
Li T, Ogihara M (2003) Detecting emotion in music. In: International Symposium on Music Information Retrieval, pp. 239–240
Li Y, Zheng W (2021) Emotion recognition and regulation based on stacked sparse auto-encoder network and personalized reconfigurable music. Mathematics 9(6):593
Li X et al (2016) A deep bidirectional long short-term memory based multi-scale approach for music dynamic emotion prediction. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 544–548
MacDorman OC, Stuart H, Karl F (2007) Automatic emotion prediction of song excerpts: index construction, algorithm design, and empirical comparison. Journal of New Music Research 36(4):281–299
Malik M, Adavanne S, Drossos K et al (2017) Stacked convolutional and recurrent neural networks for music emotion recognition. In: Proceedings of the 14th sound music computing conference, pp 208–213
Mao Q, Dong M, Huang Z, Zhan Y (2014) Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Transactions on Multimedia 16(8):2203–2213
Mayerl M, Vötter M et al (2019) Recognizing song mood and theme using convolutional recurrent neural networks. In: Proceedings of the MediaEval 2019 workshop
Orjesek R, Jarina R, Chmulik M et al (2019) DNN based music emotion recognition from raw audio signal. In: proceeding of the 29th international conference Radioelektronika, pp 1-4
Patra BG, Das D, Bandyopadhyay S (2013) Unsupervised approach to Hindi music mood classification. In: Proceedings of the Mining Intelligence and Knowledge Exploration, pp. 62–69
Pons J, Nieto O, Prockup M et al (2018) End-to-end learning for music audio tagging at scale. In: Proceedings of the 12th international symposium/conference on music information retrieval, pp 637–644
Sangnark S, Lertwatechakul M, Benjangkaprasert C (2018) Thai music emotion recognition based on western music. J Phys Conf Ser 1195(1):012009
Sarkar R, Choudhury S, Dutta S, Roy A, Saha SK (2020) Recognition of emotion in music based on deep convolutional neural network. Multimed Tools Appl 79:765–783
Schmidt EM, Kim YE (2010) Prediction of time-varying musical mood distributions from audio. In: Proceedings of the International Society of Music Information Retrieval Conference, pp. 465–470
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. In: Proceedings of the 3rd International Conference on Learning Representations
Soleymani MM, Caro MN (2013) 1000 Songs for emotional analysis of music. In: Proceedings of the 2nd ACM international workshop on Crowdsourcing for multimedia, pp 1–6
Sukhavasi M, Adapa S (2019) Music theme recognition using CNN and self-attention. In: Proceedings of the MediaEval 2019 Workshop
Sun L, Lian Z et al (2020) Multi-modal continuous dimensional emotion recognition using recurrent neural network and self-attention mechanism. In: Proceedings of the 1st International on Multimedia Sentiment Analysis in Real-life Media Challenge and Workshop, pp 27–34
Thayer RE (1989) The biopsychology of mood and arousal. Personal Individ Differ 11(9):993–993
Tzanetakis G, Cook P (2000) Marsyas: a framework for audio analysis. Organized Sound 4(3):169–177
Wang Y, Sun S (2019) Emotion recognition for internet music by multiple classifiers. In: Proceedings of the IEEE/ACIS 18th International Conference on Computer and Information Science, pp 262–265
Wang JC, Yang YH et al (2012) The acoustic emotion Gaussians model for emotion-based music annotation and retrieval. In: Proceedings of the 20th ACM international conference on multimedia. pp 89–98
Wang Y, Wu J et al (2019) Multi-attention fusion network for video-based emotion recognition. In: Proceedings of the International Conference on Multimodal Interaction, pp. 595–601
Weninger F, Ringeval F, Marchi E et al (2016) Discriminatively trained recurrent neural networks for continuous dimensional emotion recognition from audio. In: Proceedings of the International Joint Conference Artificial Intelligence, pp. 2196–2202
Wu B, Zhong E, Horner A et al (2014) Music emotion recognition by multi-label multi-layer multi-instance multi-view learning. In: Proceedings of the 22nd ACM international conference on multimedia, pp 117–126
Yang J (2021) A novel music emotion recognition model using neural network technology. Front Psychol 12:760060
Yang YH, Chen HH (2011) Music emotion recognition. CRC Press, Boca Raton
Yang YH, Liu CC, Chen HH (2006) Music emotion classification: a fuzzy approach. In: Proceedings of the 14th ACM International Conference on Multimedia, pp 81–84
Yang YH, Lin YC, Su YF, Chen HH (2008) A regression approach to music emotion recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 16(2):448–457
Yang X, Dong Y, Li J (2018) Review of data features-based music emotion recognition methods. Multimedia Systems 24(4):365–389
Zhang H, Cisse M et al (2018) Mixup: beyond empirical risk minimization. In: Proceedings of the 6th international conference on learning representations
Zhang K, Zhang H et al (2018) The PMEmo dataset for music emotion recognition. In: Proceedings of the 2018 ACM on international conference on multimedia retrieval, pp 135-142
Zhao J, Li R et al (2019) Adversarial domain adaption for multi-cultural dimensional emotion recognition in dyadic interactions. In: Proceedings of the 9th international workshop on audio/visual emotion challenge, pp 37–45
Zhao S, Li Y et al (2020) Emotion-based end-to-end matching between image and music in valence-arousal space. In: Proceedings of the 28th ACM international conference on multimedia, pp 2945-2954
Code availability
Custom code is not available without restriction.
Author information
Authors and Affiliations
Contributions
All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Meixian Zhang and Yonghua Zhu. The first draft of the manuscript was written by Meixian Zhang and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Conflict of interest
The authors declare that they have no conflicts of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
ESM 1
(PDF 71 kb)
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, M., Zhu, Y., Zhang, W. et al. Modularized composite attention network for continuous music emotion recognition. Multimed Tools Appl 82, 7319–7341 (2023). https://doi.org/10.1007/s11042-022-13577-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-13577-6