skip to main content
10.1145/3395035.3425252acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
short-paper

The Influence of Blind Source Separation on Mixed Audio Speech and Music Emotion Recognition

Published:27 December 2020Publication History

ABSTRACT

While both speech emotion recognition and music emotion recognition have been studied extensively in different communities, little research went into the recognition of emotion from mixed audio sources, i.e. when both speech and music are present. However, many application scenarios require models that are able to extract emotions from mixed audio sources, such as television content. This paper studies how mixed audio affects both speech and music emotion recognition using a random forest and deep neural network model, and investigates if blind source separation of the mixed signal beforehand is beneficial. We created a mixed audio dataset, with 25% speech-music overlap without contextual relationship between the two. We show that specialized models for speech-only or music-only audio were able to achieve merely 'chance-level' performance on mixed audio. For speech, above chance-level performance was achieved when trained on raw mixed audio, but optimal performance was achieved with audio blind source separated beforehand. Music emotion recognition models on mixed audio achieve performance approaching or even surpassing performance on music-only audio, with and without blind source separation. Our results are important for estimating emotion from real-world data, where individual speech and music tracks are often not available.

References

  1. Mehmet Berkehan Akçay and Kaya Oguz. 2020. Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Communication, Vol. 116 (2020), 56 -- 76. https://doi.org/10.1016/j.specom.2019.12.001Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Anna Aljanaki, Yi-Hsuan Yang, and Mohammad Soleymani. 2017. Developing a benchmark for emotional analysis of music. PloS one, Vol. 12, 3 (2017), e0173392.Google ScholarGoogle ScholarCross RefCross Ref
  3. Anjali Bhavan, Pankaj Chauhan, Rajiv Ratn Shah, et al. 2019. Bagged support vector machines for emotion recognition from speech. Knowledge-Based Systems, Vol. 184 (2019), 104886.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Dmitry Bogdanov, Nicolas Wack, Emilia Gómez Gutiérrez, Sankalp Gulati, Herrera Boyer, Oscar Mayor, Gerard Roma Trepat, Justin Salamon, José Ricardo Zapata González, Xavier Serra, et al. 2013. Essentia: An audio analysis library for music information retrieval. In Britto A, Gouyon F, Dixon S, editors. 14th Conference of the International Society for Music Information Retrieval (ISMIR); 2013 Nov 4--8; Curitiba, Brazil. 2013. p. 493--8. International Society for Music Information Retrieval (ISMIR).Google ScholarGoogle Scholar
  5. Leo Breiman. 2001. Random forests. Machine learning, Vol. 45, 1 (2001), 5--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Sih-Huei Chen, Yuan-Shan Lee, Wen-Chi Hsieh, and Jia-Ching Wang. 2015. Music emotion recognition using deep Gaussian process. In 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). IEEE, 495--498.Google ScholarGoogle ScholarCross RefCross Ref
  7. Franccois Chollet et al. 2015. Keras. https://keras.io.Google ScholarGoogle Scholar
  8. R EBU-Recommendation. 2011. Loudness normalisation and permitted maximum level of audio signals. (2011).Google ScholarGoogle Scholar
  9. Tuomas Eerola and Jonna K Vuoskoski. 2011. A comparison of the discrete and dimensional models of emotion in music. Psychology of Music, Vol. 39, 1 (2011), 18--49.Google ScholarGoogle ScholarCross RefCross Ref
  10. Paul Ekman. 1992. An argument for basic emotions. Cognition & emotion, Vol. 6, 3--4 (1992), 169--200.Google ScholarGoogle Scholar
  11. Moataz El Ayadi, Mohamed S Kamel, and Fakhri Karray. 2011. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, Vol. 44, 3 (2011), 572--587. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia. 1459--1462. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Lili Guo, Longbiao Wang, Jianwu Dang, Linjuan Zhang, Haotian Guan, and Xiangang Li. 2018. Speech Emotion Recognition by Combining Amplitude and Phase Information Using Convolutional Neural Network. In INTERSPEECH. 1611--1615.Google ScholarGoogle Scholar
  14. Byeong-jun Han, Seungmin Rho, Roger B Dannenberg, and Eenjun Hwang. 2009. SMERS: Music Emotion Recognition Using Support Vector Regression.. In International Society for Music Information Retrieval (ISMIR). Citeseer, 651--656.Google ScholarGoogle Scholar
  15. Romain Hennequin, Anis Khlif, Felix Voituret, and Manuel Moussallam. 2019. Spleeter: A fast and state-of-the art music source separation tool with pre-trained models. Late-Breaking/Demo International Society for Music Information Retrieval (ISMIR), Vol. 2019 (2019).Google ScholarGoogle Scholar
  16. Heysem Kaya and Alexey A Karpov. 2018. Efficient and effective strategies for cross-corpus acoustic emotion recognition. Neurocomputing, Vol. 275 (2018), 1028--1034. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Ruhul Amin Khalil, Edward Jones, Mohammad Inayatullah Babar, Tariqullah Jan, Mohammad Haseeb Zafar, and Thamer Alhussain. 2019. Speech emotion recognition using deep learning techniques: A review. IEEE Access, Vol. 7 (2019), 117327--117345.Google ScholarGoogle ScholarCross RefCross Ref
  18. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google ScholarGoogle Scholar
  19. Lars Kuchinke, Hermann Kappelhoff, and Stefan Koelsch. 2013. Emotion and music in narrative films: a neuroscientific perspective. (2013).Google ScholarGoogle Scholar
  20. Casper Laugs. 2020. Creating a Speech and Music Emotion Recognition System for Mixed Source Audio. Master's thesis. Universiteit Utrecht, the Netherlands.Google ScholarGoogle Scholar
  21. Bochen Li and Aparna Kumar. 2019. Query by Video: Cross-modal Music Retrieval.. In International Society for Music Information Retrieval (ISMIR). 604--611.Google ScholarGoogle Scholar
  22. Tao Li and Mitsunori Ogihara. 2003. Detecting emotion in music. In Proceedings of the International Conference on Music Information Retrieval (ISMIR). Johns Hopkins University.Google ScholarGoogle Scholar
  23. Steven R Livingstone and Frank A Russo. 2018. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PloS one, Vol. 13, 5 (2018), e0196391.Google ScholarGoogle ScholarCross RefCross Ref
  24. Konstantin Markov and Tomoko Matsui. 2014. Music genre and emotion recognition using Gaussian processes. IEEE access, Vol. 2 (2014), 688--697.Google ScholarGoogle ScholarCross RefCross Ref
  25. Yesid Ospitia Medina, José Ramón Beltrán, and Sandra Baldassarri. 2020. Emotional classification of music using neural networks with the MediaEval dataset. Personal and Ubiquitous Computing (2020), 1--13.Google ScholarGoogle Scholar
  26. Albert Mehrabian and James A Russell. 1974. An approach to environmental psychology. the MIT Press.Google ScholarGoogle Scholar
  27. Tin Lay Nwe, Say Wei Foo, and Liyanage C De Silva. 2003. Speech emotion recognition using hidden Markov models. Speech communication, Vol. 41, 4 (2003), 603--623.Google ScholarGoogle Scholar
  28. Renato Panda, Ricardo Manuel Malheiro, and Rui Pedro Paiva. 2018. Novel audio features for music emotion recognition. IEEE Transactions on Affective Computing (2018).Google ScholarGoogle Scholar
  29. Renato Panda, Bruno Rocha, and Rui Pedro Paiva. 2015. Music emotion recognition with standard and melodic audio features. Applied Artificial Intelligence, Vol. 29, 4 (2015), 313--334. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, Vol. 12 (2011), 2825--2830. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. James A Russell. 1980. A circumplex model of affect. Journal of personality and social psychology, Vol. 39, 6 (1980), 1161.Google ScholarGoogle ScholarCross RefCross Ref
  32. Mladen Russo, Luka Kraljevi?, Maja Stella, and Marjan Sikora. 2020. Cochleogram-based approach for detecting perceived emotions in music. Information Processing & Management, Vol. 57, 5 (2020), 102270. https://doi.org/10.1016/j.ipm.2020.102270Google ScholarGoogle ScholarCross RefCross Ref
  33. Erik M Schmidt, Douglas Turnbull, and Youngmoo E Kim. 2010. Feature selection for content-based, time-varying musical emotion regression. In Proceedings of the International Conference on Multimedia Information Retrieval. 267--274. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Björn Schuller, Clemens Hage, Dagmar Schuller, and Gerhard Rigoll. 2010a. ?Mister D.J., Cheer Me Up!?: Musical and Textual Features for Automatic Mood Classification. Journal of New Music Research, Vol. 39, 1 (2010), 13--34. https://doi.org/10.1080/09298210903430475Google ScholarGoogle ScholarCross RefCross Ref
  35. Björn Schuller, Stefan Steidl, Anton Batliner, Felix Burkhardt, Laurence Devillers, Christian Müller, and Shrikanth S Narayanan. 2010b. The INTERSPEECH 2010 paralinguistic challenge. In INTERSPEECH. 2794--2797.Google ScholarGoogle Scholar
  36. Ki-Ho Shin and In-Kwon Lee. 2017. Music synchronization with video using emotion similarity. In 2017 IEEE International Conference on Big Data and Smart Computing (BigComp). IEEE, 47--50.Google ScholarGoogle ScholarCross RefCross Ref
  37. George Trigeorgis, Fabien Ringeval, Raymond Brueckner, Erik Marchi, Mihalis A Nicolaou, Björn Schuller, and Stefanos Zafeiriou. 2016. Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5200--5204.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Felix Weninger, Florian Eyben, Björn Schuller, Marcello Mortillaro, and Klaus Scherer. 2013. On the Acoustics of Emotion in Audio: What Speech, Music, and Sound have in Common. Frontiers in Psychology, Vol. 4 (2013), 292. https://doi.org/10.3389/fpsyg.2013.00292Google ScholarGoogle ScholarCross RefCross Ref
  39. Jieping Xu, Xirong Li, Yun Hao, and Gang Yang. 2014. Source separation improves music emotion recognition. In Proceedings of international conference on multimedia retrieval. 423--426. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Xinyu Yang, Yizhuo Dong, and Juan Li. 2018. Review of data features-based music emotion recognition methods. Multimedia systems, Vol. 24, 4 (2018), 365--389. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Yi-Hsuan Yang, Yu-Ching Lin, Ya-Fan Su, and Homer H Chen. 2008. A regression approach to music emotion recognition. IEEE Transactions on audio, speech, and language processing, Vol. 16, 2 (2008), 448--457. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. The Influence of Blind Source Separation on Mixed Audio Speech and Music Emotion Recognition
                  Index terms have been assigned to the content through auto-classification.

                  Recommendations

                  Comments

                  Login options

                  Check if you have access through your login credentials or your institution to get full access on this article.

                  Sign in
                  • Published in

                    cover image ACM Conferences
                    ICMI '20 Companion: Companion Publication of the 2020 International Conference on Multimodal Interaction
                    October 2020
                    548 pages
                    ISBN:9781450380027
                    DOI:10.1145/3395035

                    Copyright © 2020 ACM

                    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                    Publisher

                    Association for Computing Machinery

                    New York, NY, United States

                    Publication History

                    • Published: 27 December 2020

                    Permissions

                    Request permissions about this article.

                    Request Permissions

                    Check for updates

                    Qualifiers

                    • short-paper

                    Acceptance Rates

                    Overall Acceptance Rate453of1,080submissions,42%

                  PDF Format

                  View or Download as a PDF file.

                  PDF

                  eReader

                  View online with eReader.

                  eReader