Skip to main content
Log in

Modularized composite attention network for continuous music emotion recognition

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Music Emotion Recognition (MER) has attracted much interest in the past decades. Many deep learning methods have been applied to this field recently. However, the previous methods for MER mostly utilized simple convolutional layers to extract features from the original audio signals, in which representative emotion-related features cannot be extracted. In this paper, we propose a novel method named Modularized Composite Attention Network (MCAN) for continuous MER. A sample reconstruction technique is proposed to enhance the stability of the network. Specifically, a feature augmentation module is constructed to extract salient features and we design a weighted attention module to control the focus of the whole network. Furthermore, a style embedding module is introduced to enhance the detail processing capability of the network. We conduct experiments on two datasets, that is, the benchmark dataset DEAM and the newly proposed dataset PMEmo. The superior results prove the effectiveness of our proposed MCAN. Especially qualitative analyses are given to for explaining the performance of our model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

Data and material are fully available without restriction.

Notes

  1. Last.fm. Available: https://www.last.fm/

  2. MIREX. Available: http://www.music-ir.org/mirex/wiki/

  3. MediaEval2019. Available: http://www.multimediaeval.org/mediaeval2019/

  4. AVEC. Available: https://avec-db.sspnet.eu/

  5. DEAM. Available: http://cvml.unige.ch/databases/DEAM/

  6. 1000 Songs. Available: http://cvml.unige.ch/databases/emoMusic/

  7. PMEmo. Available: http://www.next.zju.edu.cn/research/pmemo/amp/

References

  1. Aljanaki A, Yang YH, Soleymani M (2017) Developing a benchmark for emotional analysis of music. PLoS One 12(3):e0173392

    Article  Google Scholar 

  2. Amiriparian S, Gerczuk M et al (2019) Emotion and themes recognition in music utilizing convolutional and recurrent neural networks. In: Proceedings of the MediaEval 2019 workshop

  3. Bharti D, Kukana P (2020) A hybrid machine learning model for emotion recognition from speech signals. In: Proceedings of the International Conference on Smart Electronics and Communication, pp. 491–496

  4. Bogdanov D, Wack N et al (2013) ESSENTIA: an audio analysis library for music information retrieval. In: Proceedings of the 14th International Society for Music Information Retrieval Conference, pp 493–498

  5. Bogdanov D, Porter A, Tovstogan P et al (2019) MediaEval 2019: emotion and theme recognition in music using Jamendo. In: Proceedings of the MediaEval 2019 workshop

  6. Cabrera D et al (1999) Psysound: a computer program for psychoacoustical analysis. In: Proceedings of the Australian acoustical society conference, 24: 47–54

  7. Chen S, Jin Q (2015) Multi-modal dimensional emotion recognition using recurrent neural networks. In: Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge, pp 49–56

  8. Cheuk KW, Luo YJ et al Regression-based music emotion prediction using triplet neural networks. In: Proceedings of the IEEE International Joint Conference on Neural Networks, pp 1–7

  9. Deshpande H, Singh R, Nam U (2001) Classification of music signals in the visual domain. Proceedings of the COST G-6 Conference on Digital Audio Effects, 3(1): 1–4

  10. Dieleman S, Brakel P, Schrauwen B (2011) Audio-based music classification with a pretrained convolutional network. In: Proceedings of the 12th International Symposium on Music Information Retrieval, pp 669–674

  11. Dingle GA, Kelly PJ et al (2015) The influence of music on emotions and cravings in clients in addiction treatment: a study of two clinical samples. The Arts in Psychotherapy 45:18–25

    Article  Google Scholar 

  12. Dong Y, Yang X, Zhao X, Li J (2019) Bidirectional convolutional recurrent sparse network (BCRSN): an efficient model for music emotion recognition. IEEE Transactions on Multimedia 21(12):3150–3163

    Article  Google Scholar 

  13. Florence SM, Uma M (2020) Emotional detection and music recommendation system based on user facial expression. IOP Conference Series: Materials Science and Engineering 912(6):062007

    Article  Google Scholar 

  14. Grekow J (2018) From content-based music emotion recognition to rmotion maps of musical pieces. Polish Academy of Science, Polish Warsaw

    Book  Google Scholar 

  15. He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778

  16. Huan RH, Shu J et al (2020) Video multimodal emotion recognition based on Bi-GRU and attention fusion. Multimedia Tools and Applications, pp 1–28

  17. Huang G, Liu Z, Van Der Maaten L et al (2017) Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708

  18. Huang J, Li Y, Tao J et al (2018) Multimodal continuous emotion recognition with data augmentation using recurrent neural networks. In: Proceedings of the 8th international workshop on audio/visual emotion challenge, pp 57–64

  19. Hung HT, Chen YH et al (2019) MediaEval 2019 emotion and theme recognition task: a VQ-VAE based approach. In: Proceedings of the MediaEval 2019 Workshop

  20. Huo Y, Yao H et al (2020) Soul dancer: emotion-based human action generation. ACM Transactions on Multimedia Computing Communication and Application 15(3s):1–19

    Google Scholar 

  21. S. Jun, H. Hwang (2009) A fuzzy inference-based music emotion recognition system. In: Proceedings of the 5th International Conference on Visual Information Engineering, pp 673–677

  22. Lartillot O, Toiviainen P (2007) MIR in matlab (II): a toolbox for musical feature extraction from audio. In: Proceedings of the International Conference on Music Information Retrieval

  23. Li T, Ogihara M (2003) Detecting emotion in music. In: International Symposium on Music Information Retrieval, pp. 239–240

  24. Li Y, Zheng W (2021) Emotion recognition and regulation based on stacked sparse auto-encoder network and personalized reconfigurable music. Mathematics 9(6):593

    Article  Google Scholar 

  25. Li X et al (2016) A deep bidirectional long short-term memory based multi-scale approach for music dynamic emotion prediction. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 544–548

  26. MacDorman OC, Stuart H, Karl F (2007) Automatic emotion prediction of song excerpts: index construction, algorithm design, and empirical comparison. Journal of New Music Research 36(4):281–299

    Article  Google Scholar 

  27. Malik M, Adavanne S, Drossos K et al (2017) Stacked convolutional and recurrent neural networks for music emotion recognition. In: Proceedings of the 14th sound music computing conference, pp 208–213

  28. Mao Q, Dong M, Huang Z, Zhan Y (2014) Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Transactions on Multimedia 16(8):2203–2213

    Article  Google Scholar 

  29. Mayerl M, Vötter M et al (2019) Recognizing song mood and theme using convolutional recurrent neural networks. In: Proceedings of the MediaEval 2019 workshop

  30. Orjesek R, Jarina R, Chmulik M et al (2019) DNN based music emotion recognition from raw audio signal. In: proceeding of the 29th international conference Radioelektronika, pp 1-4

  31. Patra BG, Das D, Bandyopadhyay S (2013) Unsupervised approach to Hindi music mood classification. In: Proceedings of the Mining Intelligence and Knowledge Exploration, pp. 62–69

  32. Pons J, Nieto O, Prockup M et al (2018) End-to-end learning for music audio tagging at scale. In: Proceedings of the 12th international symposium/conference on music information retrieval, pp 637–644

  33. Sangnark S, Lertwatechakul M, Benjangkaprasert C (2018) Thai music emotion recognition based on western music. J Phys Conf Ser 1195(1):012009

    Google Scholar 

  34. Sarkar R, Choudhury S, Dutta S, Roy A, Saha SK (2020) Recognition of emotion in music based on deep convolutional neural network. Multimed Tools Appl 79:765–783

    Article  Google Scholar 

  35. Schmidt EM, Kim YE (2010) Prediction of time-varying musical mood distributions from audio. In: Proceedings of the International Society of Music Information Retrieval Conference, pp. 465–470

  36. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. In: Proceedings of the 3rd International Conference on Learning Representations

  37. Soleymani MM, Caro MN (2013) 1000 Songs for emotional analysis of music. In: Proceedings of the 2nd ACM international workshop on Crowdsourcing for multimedia, pp 1–6

  38. Sukhavasi M, Adapa S (2019) Music theme recognition using CNN and self-attention. In: Proceedings of the MediaEval 2019 Workshop

  39. Sun L, Lian Z et al (2020) Multi-modal continuous dimensional emotion recognition using recurrent neural network and self-attention mechanism. In: Proceedings of the 1st International on Multimedia Sentiment Analysis in Real-life Media Challenge and Workshop, pp 27–34

  40. Thayer RE (1989) The biopsychology of mood and arousal. Personal Individ Differ 11(9):993–993

    Google Scholar 

  41. Tzanetakis G, Cook P (2000) Marsyas: a framework for audio analysis. Organized Sound 4(3):169–177

    Article  Google Scholar 

  42. Wang Y, Sun S (2019) Emotion recognition for internet music by multiple classifiers. In: Proceedings of the IEEE/ACIS 18th International Conference on Computer and Information Science, pp 262–265

  43. Wang JC, Yang YH et al (2012) The acoustic emotion Gaussians model for emotion-based music annotation and retrieval. In: Proceedings of the 20th ACM international conference on multimedia. pp 89–98

  44. Wang Y, Wu J et al (2019) Multi-attention fusion network for video-based emotion recognition. In: Proceedings of the International Conference on Multimodal Interaction, pp. 595–601

  45. Weninger F, Ringeval F, Marchi E et al (2016) Discriminatively trained recurrent neural networks for continuous dimensional emotion recognition from audio. In: Proceedings of the International Joint Conference Artificial Intelligence, pp. 2196–2202

  46. Wu B, Zhong E, Horner A et al (2014) Music emotion recognition by multi-label multi-layer multi-instance multi-view learning. In: Proceedings of the 22nd ACM international conference on multimedia, pp 117–126

  47. Yang J (2021) A novel music emotion recognition model using neural network technology. Front Psychol 12:760060

  48. Yang YH, Chen HH (2011) Music emotion recognition. CRC Press, Boca Raton

    Book  MATH  Google Scholar 

  49. Yang YH, Liu CC, Chen HH (2006) Music emotion classification: a fuzzy approach. In: Proceedings of the 14th ACM International Conference on Multimedia, pp 81–84

  50. Yang YH, Lin YC, Su YF, Chen HH (2008) A regression approach to music emotion recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 16(2):448–457

    Article  Google Scholar 

  51. Yang X, Dong Y, Li J (2018) Review of data features-based music emotion recognition methods. Multimedia Systems 24(4):365–389

    Article  Google Scholar 

  52. Zhang H, Cisse M et al (2018) Mixup: beyond empirical risk minimization. In: Proceedings of the 6th international conference on learning representations

  53. Zhang K, Zhang H et al (2018) The PMEmo dataset for music emotion recognition. In: Proceedings of the 2018 ACM on international conference on multimedia retrieval, pp 135-142

  54. Zhao J, Li R et al (2019) Adversarial domain adaption for multi-cultural dimensional emotion recognition in dyadic interactions. In: Proceedings of the 9th international workshop on audio/visual emotion challenge, pp 37–45

  55. Zhao S, Li Y et al (2020) Emotion-based end-to-end matching between image and music in valence-arousal space. In: Proceedings of the 28th ACM international conference on multimedia, pp 2945-2954

Download references

Code availability

Custom code is not available without restriction.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Meixian Zhang and Yonghua Zhu. The first draft of the manuscript was written by Meixian Zhang and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Yonghua Zhu.

Ethics declarations

Ethics approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Conflict of interest

The authors declare that they have no conflicts of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

ESM 1

(PDF 71 kb)

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, M., Zhu, Y., Zhang, W. et al. Modularized composite attention network for continuous music emotion recognition. Multimed Tools Appl 82, 7319–7341 (2023). https://doi.org/10.1007/s11042-022-13577-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-13577-6

Keywords

Navigation