Abstract
Automatic music emotion recognition (MER) has received increased attention in areas of music information retrieval and user interface development. Music emotion variation detection (or dynamic MER) captures also temporal changes of emotion, and emotional content in music is expressed as a series of valence-arousal predictions. One of the issues in MER is extraction of emotional characteristics from audio signal. We propose a deep neural network based solution for mining music emotion-related salient features directly from raw audio waveform. The proposed architecture is based on stacking one-dimensional convolution layer, autoencoder-based layer with iterative reconstruction, and bidirectional gated recurrent unit. The tests on the DEAM dataset have shown that the proposed solution, in comparison with other state-of-the-art systems, can bring a significant improvement of the regression accuracy, notably for the valence dimension. It is shown that the proposed iterative reconstruction layer is able to enhance the discriminative properties of the features and further increase regression accuracy.




Similar content being viewed by others
References
Aljanaki A, Yang Y-H, Soleymani M (2017) (2017) Developing a benchmark for emotional analysis of music. PLoS ONE 12(3):1–22
Amiriparian S, Gerczuk M, Coutinho E, Baird A, Ottl S, Milling M, Schuller B (2019) Emotion and themes recognition in music utilising convolutional and recurrent neural networks. In: Proc. MediaEval 2019 workshop, Sophia Antipolis, France,
Bai J, Peng J, Shi J, Tang D, Wu Y, Li J, Luo K (2016) Dimensional music emotion recognition by valence-arousal regression. In: IEEE 15th international conference on cognitive informatics and cognitive computing. pp 42–49
Barthet M, Fazekas G, Sandler M (2012) Music emotion recognition: From content-to context-based models. International symposium on computer music modeling and retrieval. Springer, Berlin, pp 228–252
Casey MA, Veltkamp R, Goto M, Leman M, Rhodes C, Slaney M (2008) Content-based music information retrieval: Current directions and future challenges. Proc. of the IEEE 96(4):668–696
Cheuk KW, Luo YJ, Balamurali BT, Roig G, Herremans D (2020) Regression-based music emotion prediction using triplet neural networks. In: Proc. international joint conference on neural networks (IJCNN), July 2020. pp 1–7
Choi K, Fazekas G, Sandler M, Cho K (2017) Convolutional recurrent neural networks for music classification. In: Proc. IEEE int. conf. acoust., speech, signal process. (ICASSP).pp 2392–2396
Coutinho E, Cangelosi A (2011) Musical emotions: Predicting second-by-second subjective feelings of emotion from low-level psychoacoustic features and physiological measurements. Emotion 11(4):921
Coutinho E, Trigeorgis G, Zafeiriou S, Schuller B (2015) Automatically estimating emotion in music with deep long-short term memory recurrent neural networks. In: Working Notes Proc. of the MediaEval 2015 workshop, Wurzen, Germany
DEAM dataset - The MediaEval database for emotional analysis of music. http://cvml.unige.ch/databases/DEAM. Accessed 30 Sept 2020
Deng JJ, Leung CH (2015) Dynamic time warping for music retrieval using time series modeling of musical emotions. IEEE Transactions on Affective Computing 6(2):137–151
Dieleman S, Schrauwen B (2014) End-to-end learning for music audio. In: IEEE international conference on acoustics, speech and signal processing (ICASSP2014), May. pp 6964–6968
Dong Y, Yang X, Zhao X, Li J (2019) (2019) Bidirectional Convolutional Recurrent Sparse Network (BCRSN): An Efficient Model for Music Emotion Recognition. IEEE Trans. Multimedia 21(12):3150–3163
Eerola T, Vuoskoski JK (2011) A comparison of the discrete and dimensional models of emotion in music. Psychology of Music 39(1):18–4
Goodfellow IJ, Warde-Farley D, Mirza M, Courville A, Bengio Y (2013) Maxout networks. In: Proc. int. conf. on machine learning (ICML’13), Atlanta, GA, USA, June 2013, vol 28. pp III-1319–III-1327
Inskip C, Macfarlane A, Rafferty P (2012) Towards the disintermediation of creative music search: analysing queries to determine important facet. Int. Journal on Digital Libraries 12(2):137–147
Kim J, Urbano J, Liem CC, Hanjalic A (2020) One deep music representation to rule them all? A comparative analysis of different representation learning strategies. Neural Computing and Applications 32(4):1067–1093
Li X, Tian J, Xu M, Ning Y, Cai L (2016) DBLSTM-based multi-scale fusion for dynamic emotion prediction in music. In: Proc. IEEE int. conf. on multimedia and expo (ICME), Jul 2016. pp 1–6
Lim W, Jang D, Lee T (2016) Speech emotion recognition using convolutional and recurrent neural networks. In: Proc. Asia–Pac. signal inf. process. assoc. annu. summit conf.. pp 1–4
Malik M, Adavanne S, Drossos K, Virtanen T, Ticha D, Jarina R (2017) Stacked convolutional and recurrent neural networks for music emotion recognition. In: Proc. of sound and music computing conf., Espoo, Finland
Markov K, Matsui T (2014) Music genre and emotion recognition using Gaussian processes. IEEE access 2:688–697
MediaEval benchmarking initiative for multimedia evaluation. http://www.multimediaeval.org/
Music Information Retrieval Evaluation eXchange. https://www.music-ir.org/mirex
OpenSmile audio feature extraction. https://www.audeering.com/opensmile/
Orjesek R, Jarina R, Chmulik M, Kuba M (2019) DNN based music emotion recognition from raw audio signal. In: Proc. 29th Int. Conf. Radioelektronika, Apr 2019. IEEE, Pardubice, pp 1–4
Panda R, Malheiro R, Paiva RP (2018) Novel audio features for music emotion recognition. IEEE Trans. on Affective Computing 11(4):614–626
Pons J, Nieto O, Prockup M, Schmidt E, Ehmann A, Serra X (2018) End-to-end learning for music audio tagging at scale. In: 19th int. society for music information retrieval conference, Paris, France. arXiv:1711.02520. Accessed 15 Oct 2020
Richard G, Sundaram S, Narayanan S (2013) An overview on perceptually motivated audio indexing and classification. Proceedings of the IEEE 101(9):1939–1954
Sarkar R, Choudhury S, Dutta S, Roy A, Saha SK (2020) Recognition of emotion in music based on deep convolutional neural network. Multimedia Tools and Applications 79(1):765–783
Sharan Roneel V, Moir Tom J (2016) An overview of applications and advancements in automatic sound recognition. Neurocomputing 200:22–34
Thickstun J, Harchaoui Z, Kakade S (2016) Learning features of music from scratch. In: Int. conf. on learning representations ICLR. arXiv:1611.09827. Accessed 15 Oct 2020
Tokozume Y, Harada T (2017) Learning environmental sounds with end-to-end convolutional neural network. In: Proc. of IEEE int. conf. on acoustics, speech and signal processing (ICASSP 2017). pp 2721–2725
Trigeorgis G, Ringeval F, Brueckner R, Marchi E, Nicolaou MA, Schuller B, Zafeiriou S (2016) Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In: Proc. IEEE Int. conf. on acoustics, speech and signal processing (ICASSP). pp 5200–5204
Tzirakis P, Zhang J, Schuller BW (2018) End-to-end speech emotion recognition using deep neural networks. In: Proc. IEEE int. conf. on acoustics, speech and signal processing (ICASSP), April 2018. pp 5089–5093
Wang S, Wang Z, Ji Q (2015) Multiple emotional tagging of multimedia data by exploiting dependencies among emotions. Multimedia Tools and Applications 74(6):1863–1883
Wei W, Zhu H, Benetos E, Wang Y (2020) A-CRNN: a domain adaptation model for sound event detection. In: Proc. IEEE int. conf. acoust., speech, signal process. (ICASSP), May 2020. pp 276–280
Weninger F, Ringeval F, Marchi E, Schuller B (2016) Discriminatively trained recurrent neural networks for continuous dimensional emotion recognition from audio. In: Proc. int. joint conf. on artificial intelligence (IJCAI’16), Jul 2016. pp 2196–2202
Xu M, Li X, Xianyu H, Tian J, Meng F, Chen W (2015) Multi-scale approaches to the MediaEval 2015 emotion in music task. In: Working notes proc. of the MediaEval 2015 Workshop, Wurzen, Germany
Yang X, Dong Y, Li J (2018) Review of data features-based music emotion recognition methods. Multimedia systems 24(4):365–389
Yang Y-H, Lin Y-C, Su Y-F, Chen HH (2008) A regression approach to music emotion recognition. IEEE Trans. Audio, Speech, and Lang. Proc. 16(2): 448–457
Yang YH, Chen HH (2012) Machine recognition of music emotion: A review. ACM Transactions on Intelligent Systems and Technology 3(3):1–30
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Orjesek, R., Jarina, R. & Chmulik, M. End-to-end music emotion variation detection using iteratively reconstructed deep features. Multimed Tools Appl 81, 5017–5031 (2022). https://doi.org/10.1007/s11042-021-11584-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-021-11584-7