ABSTRACT
While both speech emotion recognition and music emotion recognition have been studied extensively in different communities, little research went into the recognition of emotion from mixed audio sources, i.e. when both speech and music are present. However, many application scenarios require models that are able to extract emotions from mixed audio sources, such as television content. This paper studies how mixed audio affects both speech and music emotion recognition using a random forest and deep neural network model, and investigates if blind source separation of the mixed signal beforehand is beneficial. We created a mixed audio dataset, with 25% speech-music overlap without contextual relationship between the two. We show that specialized models for speech-only or music-only audio were able to achieve merely 'chance-level' performance on mixed audio. For speech, above chance-level performance was achieved when trained on raw mixed audio, but optimal performance was achieved with audio blind source separated beforehand. Music emotion recognition models on mixed audio achieve performance approaching or even surpassing performance on music-only audio, with and without blind source separation. Our results are important for estimating emotion from real-world data, where individual speech and music tracks are often not available.
- Mehmet Berkehan Akçay and Kaya Oguz. 2020. Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Communication, Vol. 116 (2020), 56 -- 76. https://doi.org/10.1016/j.specom.2019.12.001Google ScholarDigital Library
- Anna Aljanaki, Yi-Hsuan Yang, and Mohammad Soleymani. 2017. Developing a benchmark for emotional analysis of music. PloS one, Vol. 12, 3 (2017), e0173392.Google ScholarCross Ref
- Anjali Bhavan, Pankaj Chauhan, Rajiv Ratn Shah, et al. 2019. Bagged support vector machines for emotion recognition from speech. Knowledge-Based Systems, Vol. 184 (2019), 104886.Google ScholarDigital Library
- Dmitry Bogdanov, Nicolas Wack, Emilia Gómez Gutiérrez, Sankalp Gulati, Herrera Boyer, Oscar Mayor, Gerard Roma Trepat, Justin Salamon, José Ricardo Zapata González, Xavier Serra, et al. 2013. Essentia: An audio analysis library for music information retrieval. In Britto A, Gouyon F, Dixon S, editors. 14th Conference of the International Society for Music Information Retrieval (ISMIR); 2013 Nov 4--8; Curitiba, Brazil. 2013. p. 493--8. International Society for Music Information Retrieval (ISMIR).Google Scholar
- Leo Breiman. 2001. Random forests. Machine learning, Vol. 45, 1 (2001), 5--32. Google ScholarDigital Library
- Sih-Huei Chen, Yuan-Shan Lee, Wen-Chi Hsieh, and Jia-Ching Wang. 2015. Music emotion recognition using deep Gaussian process. In 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). IEEE, 495--498.Google ScholarCross Ref
- Franccois Chollet et al. 2015. Keras. https://keras.io.Google Scholar
- R EBU-Recommendation. 2011. Loudness normalisation and permitted maximum level of audio signals. (2011).Google Scholar
- Tuomas Eerola and Jonna K Vuoskoski. 2011. A comparison of the discrete and dimensional models of emotion in music. Psychology of Music, Vol. 39, 1 (2011), 18--49.Google ScholarCross Ref
- Paul Ekman. 1992. An argument for basic emotions. Cognition & emotion, Vol. 6, 3--4 (1992), 169--200.Google Scholar
- Moataz El Ayadi, Mohamed S Kamel, and Fakhri Karray. 2011. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, Vol. 44, 3 (2011), 572--587. Google ScholarDigital Library
- Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia. 1459--1462. Google ScholarDigital Library
- Lili Guo, Longbiao Wang, Jianwu Dang, Linjuan Zhang, Haotian Guan, and Xiangang Li. 2018. Speech Emotion Recognition by Combining Amplitude and Phase Information Using Convolutional Neural Network. In INTERSPEECH. 1611--1615.Google Scholar
- Byeong-jun Han, Seungmin Rho, Roger B Dannenberg, and Eenjun Hwang. 2009. SMERS: Music Emotion Recognition Using Support Vector Regression.. In International Society for Music Information Retrieval (ISMIR). Citeseer, 651--656.Google Scholar
- Romain Hennequin, Anis Khlif, Felix Voituret, and Manuel Moussallam. 2019. Spleeter: A fast and state-of-the art music source separation tool with pre-trained models. Late-Breaking/Demo International Society for Music Information Retrieval (ISMIR), Vol. 2019 (2019).Google Scholar
- Heysem Kaya and Alexey A Karpov. 2018. Efficient and effective strategies for cross-corpus acoustic emotion recognition. Neurocomputing, Vol. 275 (2018), 1028--1034. Google ScholarDigital Library
- Ruhul Amin Khalil, Edward Jones, Mohammad Inayatullah Babar, Tariqullah Jan, Mohammad Haseeb Zafar, and Thamer Alhussain. 2019. Speech emotion recognition using deep learning techniques: A review. IEEE Access, Vol. 7 (2019), 117327--117345.Google ScholarCross Ref
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
- Lars Kuchinke, Hermann Kappelhoff, and Stefan Koelsch. 2013. Emotion and music in narrative films: a neuroscientific perspective. (2013).Google Scholar
- Casper Laugs. 2020. Creating a Speech and Music Emotion Recognition System for Mixed Source Audio. Master's thesis. Universiteit Utrecht, the Netherlands.Google Scholar
- Bochen Li and Aparna Kumar. 2019. Query by Video: Cross-modal Music Retrieval.. In International Society for Music Information Retrieval (ISMIR). 604--611.Google Scholar
- Tao Li and Mitsunori Ogihara. 2003. Detecting emotion in music. In Proceedings of the International Conference on Music Information Retrieval (ISMIR). Johns Hopkins University.Google Scholar
- Steven R Livingstone and Frank A Russo. 2018. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PloS one, Vol. 13, 5 (2018), e0196391.Google ScholarCross Ref
- Konstantin Markov and Tomoko Matsui. 2014. Music genre and emotion recognition using Gaussian processes. IEEE access, Vol. 2 (2014), 688--697.Google ScholarCross Ref
- Yesid Ospitia Medina, José Ramón Beltrán, and Sandra Baldassarri. 2020. Emotional classification of music using neural networks with the MediaEval dataset. Personal and Ubiquitous Computing (2020), 1--13.Google Scholar
- Albert Mehrabian and James A Russell. 1974. An approach to environmental psychology. the MIT Press.Google Scholar
- Tin Lay Nwe, Say Wei Foo, and Liyanage C De Silva. 2003. Speech emotion recognition using hidden Markov models. Speech communication, Vol. 41, 4 (2003), 603--623.Google Scholar
- Renato Panda, Ricardo Manuel Malheiro, and Rui Pedro Paiva. 2018. Novel audio features for music emotion recognition. IEEE Transactions on Affective Computing (2018).Google Scholar
- Renato Panda, Bruno Rocha, and Rui Pedro Paiva. 2015. Music emotion recognition with standard and melodic audio features. Applied Artificial Intelligence, Vol. 29, 4 (2015), 313--334. Google ScholarDigital Library
- F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, Vol. 12 (2011), 2825--2830. Google ScholarDigital Library
- James A Russell. 1980. A circumplex model of affect. Journal of personality and social psychology, Vol. 39, 6 (1980), 1161.Google ScholarCross Ref
- Mladen Russo, Luka Kraljevi?, Maja Stella, and Marjan Sikora. 2020. Cochleogram-based approach for detecting perceived emotions in music. Information Processing & Management, Vol. 57, 5 (2020), 102270. https://doi.org/10.1016/j.ipm.2020.102270Google ScholarCross Ref
- Erik M Schmidt, Douglas Turnbull, and Youngmoo E Kim. 2010. Feature selection for content-based, time-varying musical emotion regression. In Proceedings of the International Conference on Multimedia Information Retrieval. 267--274. Google ScholarDigital Library
- Björn Schuller, Clemens Hage, Dagmar Schuller, and Gerhard Rigoll. 2010a. ?Mister D.J., Cheer Me Up!?: Musical and Textual Features for Automatic Mood Classification. Journal of New Music Research, Vol. 39, 1 (2010), 13--34. https://doi.org/10.1080/09298210903430475Google ScholarCross Ref
- Björn Schuller, Stefan Steidl, Anton Batliner, Felix Burkhardt, Laurence Devillers, Christian Müller, and Shrikanth S Narayanan. 2010b. The INTERSPEECH 2010 paralinguistic challenge. In INTERSPEECH. 2794--2797.Google Scholar
- Ki-Ho Shin and In-Kwon Lee. 2017. Music synchronization with video using emotion similarity. In 2017 IEEE International Conference on Big Data and Smart Computing (BigComp). IEEE, 47--50.Google ScholarCross Ref
- George Trigeorgis, Fabien Ringeval, Raymond Brueckner, Erik Marchi, Mihalis A Nicolaou, Björn Schuller, and Stefanos Zafeiriou. 2016. Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5200--5204.Google ScholarDigital Library
- Felix Weninger, Florian Eyben, Björn Schuller, Marcello Mortillaro, and Klaus Scherer. 2013. On the Acoustics of Emotion in Audio: What Speech, Music, and Sound have in Common. Frontiers in Psychology, Vol. 4 (2013), 292. https://doi.org/10.3389/fpsyg.2013.00292Google ScholarCross Ref
- Jieping Xu, Xirong Li, Yun Hao, and Gang Yang. 2014. Source separation improves music emotion recognition. In Proceedings of international conference on multimedia retrieval. 423--426. Google ScholarDigital Library
- Xinyu Yang, Yizhuo Dong, and Juan Li. 2018. Review of data features-based music emotion recognition methods. Multimedia systems, Vol. 24, 4 (2018), 365--389. Google ScholarDigital Library
- Yi-Hsuan Yang, Yu-Ching Lin, Ya-Fan Su, and Homer H Chen. 2008. A regression approach to music emotion recognition. IEEE Transactions on audio, speech, and language processing, Vol. 16, 2 (2008), 448--457. Google ScholarDigital Library
Index Terms
- The Influence of Blind Source Separation on Mixed Audio Speech and Music Emotion Recognition
Recommendations
Acoustic Features for Music Emotion Recognition and System Building
ICIT '17: Proceedings of the 2017 International Conference on Information TechnologyWe are faced with a massive growth of musical data in the form of digital files. Accurate metadata labeling of music archives is necessary in order to make digital music searchable and to be efficiently organized, not only by file name or song title but ...
Source Separation Improves Music Emotion Recognition
ICMR '14: Proceedings of International Conference on Multimedia RetrievalDespite the impressive progress in music emotion recognition, it remains unclear what aspect of a song, i.e., singing voice and accompanied music, carries more emotional information. As an initial attempt to answer the question, we introduce source ...
Emotion Recognition of Chinese Traditional Folk Music using an Assembling Machine Learning Method
ICMLT '22: Proceedings of the 2022 7th International Conference on Machine Learning TechnologiesVarious papers published recently about the emotion of western pop music, none have looked into how to describe Chinese traditional folk music. The accuracy of existing algorithms in recognizing emotions in Chinese traditional folk music is just 42%. ...
Comments