Skip to main content
Log in

End-to-end music emotion variation detection using iteratively reconstructed deep features

  • 1193: Intelligent Processing of Multimedia Signals
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Automatic music emotion recognition (MER) has received increased attention in areas of music information retrieval and user interface development. Music emotion variation detection (or dynamic MER) captures also temporal changes of emotion, and emotional content in music is expressed as a series of valence-arousal predictions. One of the issues in MER is extraction of emotional characteristics from audio signal. We propose a deep neural network based solution for mining music emotion-related salient features directly from raw audio waveform. The proposed architecture is based on stacking one-dimensional convolution layer, autoencoder-based layer with iterative reconstruction, and bidirectional gated recurrent unit. The tests on the DEAM dataset have shown that the proposed solution, in comparison with other state-of-the-art systems, can bring a significant improvement of the regression accuracy, notably for the valence dimension. It is shown that the proposed iterative reconstruction layer is able to enhance the discriminative properties of the features and further increase regression accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Aljanaki A, Yang Y-H, Soleymani M (2017) (2017) Developing a benchmark for emotional analysis of music. PLoS ONE 12(3):1–22

    Article  Google Scholar 

  2. Amiriparian S, Gerczuk M, Coutinho E, Baird A, Ottl S, Milling M, Schuller B (2019) Emotion and themes recognition in music utilising convolutional and recurrent neural networks. In: Proc. MediaEval 2019 workshop, Sophia Antipolis, France,

  3. Bai J, Peng J, Shi J, Tang D, Wu Y, Li J, Luo K (2016) Dimensional music emotion recognition by valence-arousal regression. In: IEEE 15th international conference on cognitive informatics and cognitive computing. pp 42–49

  4. Barthet M, Fazekas G, Sandler M (2012) Music emotion recognition: From content-to context-based models. International symposium on computer music modeling and retrieval. Springer, Berlin, pp 228–252

    Google Scholar 

  5. Casey MA, Veltkamp R, Goto M, Leman M, Rhodes C, Slaney M (2008) Content-based music information retrieval: Current directions and future challenges. Proc. of the IEEE 96(4):668–696

    Article  Google Scholar 

  6. Cheuk KW, Luo YJ, Balamurali BT, Roig G, Herremans D (2020) Regression-based music emotion prediction using triplet neural networks. In: Proc. international joint conference on neural networks (IJCNN), July 2020. pp 1–7

  7. Choi K, Fazekas G, Sandler M, Cho K (2017) Convolutional recurrent neural networks for music classification. In: Proc. IEEE int. conf. acoust., speech, signal process. (ICASSP).pp 2392–2396

  8. Coutinho E, Cangelosi A (2011) Musical emotions: Predicting second-by-second subjective feelings of emotion from low-level psychoacoustic features and physiological measurements. Emotion 11(4):921

    Article  Google Scholar 

  9. Coutinho E, Trigeorgis G, Zafeiriou S, Schuller B (2015) Automatically estimating emotion in music with deep long-short term memory recurrent neural networks. In: Working Notes Proc. of the MediaEval 2015 workshop, Wurzen, Germany

  10. DEAM dataset - The MediaEval database for emotional analysis of music. http://cvml.unige.ch/databases/DEAM. Accessed 30 Sept 2020

  11. Deng JJ, Leung CH (2015) Dynamic time warping for music retrieval using time series modeling of musical emotions. IEEE Transactions on Affective Computing 6(2):137–151

    Article  Google Scholar 

  12. Dieleman S, Schrauwen B (2014) End-to-end learning for music audio. In: IEEE international conference on acoustics, speech and signal processing (ICASSP2014), May. pp 6964–6968

  13. Dong Y, Yang X, Zhao X, Li J (2019) (2019) Bidirectional Convolutional Recurrent Sparse Network (BCRSN): An Efficient Model for Music Emotion Recognition. IEEE Trans. Multimedia 21(12):3150–3163

    Article  Google Scholar 

  14. Eerola T, Vuoskoski JK (2011) A comparison of the discrete and dimensional models of emotion in music. Psychology of Music 39(1):18–4

    Article  Google Scholar 

  15. Goodfellow IJ, Warde-Farley D, Mirza M, Courville A, Bengio Y (2013) Maxout networks. In: Proc. int. conf. on machine learning (ICML’13), Atlanta, GA, USA, June 2013, vol 28. pp III-1319–III-1327

  16. Inskip C, Macfarlane A, Rafferty P (2012) Towards the disintermediation of creative music search: analysing queries to determine important facet. Int. Journal on Digital Libraries 12(2):137–147

    Article  Google Scholar 

  17. Kim J, Urbano J, Liem CC, Hanjalic A (2020) One deep music representation to rule them all? A comparative analysis of different representation learning strategies. Neural Computing and Applications 32(4):1067–1093

    Article  Google Scholar 

  18. Li X, Tian J, Xu M, Ning Y, Cai L (2016) DBLSTM-based multi-scale fusion for dynamic emotion prediction in music. In: Proc. IEEE int. conf. on multimedia and expo (ICME), Jul 2016. pp 1–6

  19. Lim W, Jang D, Lee T (2016) Speech emotion recognition using convolutional and recurrent neural networks. In: Proc. Asia–Pac. signal inf. process. assoc. annu. summit conf.. pp 1–4

  20. Malik M, Adavanne S, Drossos K, Virtanen T, Ticha D, Jarina R (2017) Stacked convolutional and recurrent neural networks for music emotion recognition. In: Proc. of sound and music computing conf., Espoo, Finland

  21. Markov K, Matsui T (2014) Music genre and emotion recognition using Gaussian processes. IEEE access 2:688–697

    Article  Google Scholar 

  22. MediaEval benchmarking initiative for multimedia evaluation. http://www.multimediaeval.org/

  23. Music Information Retrieval Evaluation eXchange. https://www.music-ir.org/mirex

  24. OpenSmile audio feature extraction. https://www.audeering.com/opensmile/

  25. Orjesek R, Jarina R, Chmulik M, Kuba M (2019) DNN based music emotion recognition from raw audio signal. In: Proc. 29th Int. Conf. Radioelektronika, Apr 2019. IEEE, Pardubice, pp 1–4

  26. Panda R, Malheiro R, Paiva RP (2018) Novel audio features for music emotion recognition. IEEE Trans. on Affective Computing 11(4):614–626

    Article  Google Scholar 

  27. Pons J, Nieto O, Prockup M, Schmidt E, Ehmann A, Serra X (2018) End-to-end learning for music audio tagging at scale. In: 19th int. society for music information retrieval conference, Paris, France. arXiv:1711.02520. Accessed 15 Oct 2020

  28. Richard G, Sundaram S, Narayanan S (2013) An overview on perceptually motivated audio indexing and classification. Proceedings of the IEEE 101(9):1939–1954

    Article  Google Scholar 

  29. Sarkar R, Choudhury S, Dutta S, Roy A, Saha SK (2020) Recognition of emotion in music based on deep convolutional neural network. Multimedia Tools and Applications 79(1):765–783

    Article  Google Scholar 

  30. Sharan Roneel V, Moir Tom J (2016) An overview of applications and advancements in automatic sound recognition. Neurocomputing 200:22–34

    Article  Google Scholar 

  31. Thickstun J, Harchaoui Z, Kakade S (2016) Learning features of music from scratch. In: Int. conf. on learning representations ICLR. arXiv:1611.09827. Accessed 15 Oct 2020

  32. Tokozume Y, Harada T (2017) Learning environmental sounds with end-to-end convolutional neural network. In: Proc. of IEEE int. conf. on acoustics, speech and signal processing (ICASSP 2017). pp 2721–2725

  33. Trigeorgis G, Ringeval F, Brueckner R, Marchi E, Nicolaou MA, Schuller B, Zafeiriou S (2016) Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In: Proc. IEEE Int. conf. on acoustics, speech and signal processing (ICASSP). pp 5200–5204

  34. Tzirakis P, Zhang J, Schuller BW (2018) End-to-end speech emotion recognition using deep neural networks. In: Proc. IEEE int. conf. on acoustics, speech and signal processing (ICASSP), April 2018. pp 5089–5093

  35. Wang S, Wang Z, Ji Q (2015) Multiple emotional tagging of multimedia data by exploiting dependencies among emotions. Multimedia Tools and Applications 74(6):1863–1883

    Article  Google Scholar 

  36. Wei W, Zhu H, Benetos E, Wang Y (2020) A-CRNN: a domain adaptation model for sound event detection. In: Proc. IEEE int. conf. acoust., speech, signal process. (ICASSP), May 2020. pp 276–280

  37. Weninger F, Ringeval F, Marchi E, Schuller B (2016) Discriminatively trained recurrent neural networks for continuous dimensional emotion recognition from audio. In: Proc. int. joint conf. on artificial intelligence (IJCAI’16), Jul 2016. pp 2196–2202

  38. Xu M, Li X, Xianyu H, Tian J, Meng F, Chen W (2015) Multi-scale approaches to the MediaEval 2015 emotion in music task. In: Working notes proc. of the MediaEval 2015 Workshop, Wurzen, Germany

  39. Yang X, Dong Y, Li J (2018) Review of data features-based music emotion recognition methods. Multimedia systems 24(4):365–389

    Article  Google Scholar 

  40. Yang Y-H, Lin Y-C, Su Y-F, Chen HH (2008) A regression approach to music emotion recognition. IEEE Trans. Audio, Speech, and Lang. Proc. 16(2): 448–457

  41. Yang YH, Chen HH (2012) Machine recognition of music emotion: A review. ACM Transactions on Intelligent Systems and Technology 3(3):1–30

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Roman Jarina.

Ethics declarations

Conflicts of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Orjesek, R., Jarina, R. & Chmulik, M. End-to-end music emotion variation detection using iteratively reconstructed deep features. Multimed Tools Appl 81, 5017–5031 (2022). https://doi.org/10.1007/s11042-021-11584-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-021-11584-7

Keywords