short-paper

The Influence of Blind Source Separation on Mixed Audio Speech and Music Emotion Recognition

Authors:
Casper Laugs

Utrecht University, Utrecht, Netherlands

Utrecht University, Utrecht, Netherlands
View Profile

,
Hendrik Vincent Koops

RTL The Netherlands, Hilversum, Netherlands

RTL The Netherlands, Hilversum, Netherlands
View Profile

,
Daan Odijk

RTL The Netherlands, Hilversum, Netherlands

RTL The Netherlands, Hilversum, Netherlands
View Profile

,
Heysem Kaya

Utrecht University, Utrecht, Netherlands

Utrecht University, Utrecht, Netherlands
View Profile

,
Anja Volk

Utrecht University, Utrecht, Netherlands

Utrecht University, Utrecht, Netherlands
View Profile

ICMI '20 Companion: Companion Publication of the 2020 International Conference on Multimodal InteractionOctober 2020Pages 67–71https://doi.org/10.1145/3395035.3425252

Published:27 December 2020Publication History

ICMI '20 Companion: Companion Publication of the 2020 International Conference on Multimodal Interaction

Pages 67–71

ABSTRACT

While both speech emotion recognition and music emotion recognition have been studied extensively in different communities, little research went into the recognition of emotion from mixed audio sources, i.e. when both speech and music are present. However, many application scenarios require models that are able to extract emotions from mixed audio sources, such as television content. This paper studies how mixed audio affects both speech and music emotion recognition using a random forest and deep neural network model, and investigates if blind source separation of the mixed signal beforehand is beneficial. We created a mixed audio dataset, with 25% speech-music overlap without contextual relationship between the two. We show that specialized models for speech-only or music-only audio were able to achieve merely 'chance-level' performance on mixed audio. For speech, above chance-level performance was achieved when trained on raw mixed audio, but optimal performance was achieved with audio blind source separated beforehand. Music emotion recognition models on mixed audio achieve performance approaching or even surpassing performance on music-only audio, with and without blind source separation. Our results are important for estimating emotion from real-world data, where individual speech and music tracks are often not available.

References

Mehmet Berkehan Akçay and Kaya Oguz. 2020. Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Communication, Vol. 116 (2020), 56 -- 76. https://doi.org/10.1016/j.specom.2019.12.001Google ScholarDigital Library
Anna Aljanaki, Yi-Hsuan Yang, and Mohammad Soleymani. 2017. Developing a benchmark for emotional analysis of music. PloS one, Vol. 12, 3 (2017), e0173392.Google ScholarCross Ref
Anjali Bhavan, Pankaj Chauhan, Rajiv Ratn Shah, et al. 2019. Bagged support vector machines for emotion recognition from speech. Knowledge-Based Systems, Vol. 184 (2019), 104886.Google ScholarDigital Library
Dmitry Bogdanov, Nicolas Wack, Emilia Gómez Gutiérrez, Sankalp Gulati, Herrera Boyer, Oscar Mayor, Gerard Roma Trepat, Justin Salamon, José Ricardo Zapata González, Xavier Serra, et al. 2013. Essentia: An audio analysis library for music information retrieval. In Britto A, Gouyon F, Dixon S, editors. 14th Conference of the International Society for Music Information Retrieval (ISMIR); 2013 Nov 4--8; Curitiba, Brazil. 2013. p. 493--8. International Society for Music Information Retrieval (ISMIR).Google Scholar
Leo Breiman. 2001. Random forests. Machine learning, Vol. 45, 1 (2001), 5--32. Google ScholarDigital Library
Sih-Huei Chen, Yuan-Shan Lee, Wen-Chi Hsieh, and Jia-Ching Wang. 2015. Music emotion recognition using deep Gaussian process. In 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). IEEE, 495--498.Google ScholarCross Ref
Franccois Chollet et al. 2015. Keras. https://keras.io.Google Scholar
R EBU-Recommendation. 2011. Loudness normalisation and permitted maximum level of audio signals. (2011).Google Scholar
Tuomas Eerola and Jonna K Vuoskoski. 2011. A comparison of the discrete and dimensional models of emotion in music. Psychology of Music, Vol. 39, 1 (2011), 18--49.Google ScholarCross Ref
Paul Ekman. 1992. An argument for basic emotions. Cognition & emotion, Vol. 6, 3--4 (1992), 169--200.Google Scholar
Moataz El Ayadi, Mohamed S Kamel, and Fakhri Karray. 2011. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, Vol. 44, 3 (2011), 572--587. Google ScholarDigital Library
Florian Eyben, Martin Wöllmer, and Björn Schuller. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia. 1459--1462. Google ScholarDigital Library
Lili Guo, Longbiao Wang, Jianwu Dang, Linjuan Zhang, Haotian Guan, and Xiangang Li. 2018. Speech Emotion Recognition by Combining Amplitude and Phase Information Using Convolutional Neural Network. In INTERSPEECH. 1611--1615.Google Scholar
Byeong-jun Han, Seungmin Rho, Roger B Dannenberg, and Eenjun Hwang. 2009. SMERS: Music Emotion Recognition Using Support Vector Regression.. In International Society for Music Information Retrieval (ISMIR). Citeseer, 651--656.Google Scholar
Romain Hennequin, Anis Khlif, Felix Voituret, and Manuel Moussallam. 2019. Spleeter: A fast and state-of-the art music source separation tool with pre-trained models. Late-Breaking/Demo International Society for Music Information Retrieval (ISMIR), Vol. 2019 (2019).Google Scholar
Heysem Kaya and Alexey A Karpov. 2018. Efficient and effective strategies for cross-corpus acoustic emotion recognition. Neurocomputing, Vol. 275 (2018), 1028--1034. Google ScholarDigital Library
Ruhul Amin Khalil, Edward Jones, Mohammad Inayatullah Babar, Tariqullah Jan, Mohammad Haseeb Zafar, and Thamer Alhussain. 2019. Speech emotion recognition using deep learning techniques: A review. IEEE Access, Vol. 7 (2019), 117327--117345.Google ScholarCross Ref
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
Lars Kuchinke, Hermann Kappelhoff, and Stefan Koelsch. 2013. Emotion and music in narrative films: a neuroscientific perspective. (2013).Google Scholar
Casper Laugs. 2020. Creating a Speech and Music Emotion Recognition System for Mixed Source Audio. Master's thesis. Universiteit Utrecht, the Netherlands.Google Scholar
Bochen Li and Aparna Kumar. 2019. Query by Video: Cross-modal Music Retrieval.. In International Society for Music Information Retrieval (ISMIR). 604--611.Google Scholar
Tao Li and Mitsunori Ogihara. 2003. Detecting emotion in music. In Proceedings of the International Conference on Music Information Retrieval (ISMIR). Johns Hopkins University.Google Scholar
Steven R Livingstone and Frank A Russo. 2018. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PloS one, Vol. 13, 5 (2018), e0196391.Google ScholarCross Ref
Konstantin Markov and Tomoko Matsui. 2014. Music genre and emotion recognition using Gaussian processes. IEEE access, Vol. 2 (2014), 688--697.Google ScholarCross Ref
Yesid Ospitia Medina, José Ramón Beltrán, and Sandra Baldassarri. 2020. Emotional classification of music using neural networks with the MediaEval dataset. Personal and Ubiquitous Computing (2020), 1--13.Google Scholar
Albert Mehrabian and James A Russell. 1974. An approach to environmental psychology. the MIT Press.Google Scholar
Tin Lay Nwe, Say Wei Foo, and Liyanage C De Silva. 2003. Speech emotion recognition using hidden Markov models. Speech communication, Vol. 41, 4 (2003), 603--623.Google Scholar
Renato Panda, Ricardo Manuel Malheiro, and Rui Pedro Paiva. 2018. Novel audio features for music emotion recognition. IEEE Transactions on Affective Computing (2018).Google Scholar
Renato Panda, Bruno Rocha, and Rui Pedro Paiva. 2015. Music emotion recognition with standard and melodic audio features. Applied Artificial Intelligence, Vol. 29, 4 (2015), 313--334. Google ScholarDigital Library
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, Vol. 12 (2011), 2825--2830. Google ScholarDigital Library
James A Russell. 1980. A circumplex model of affect. Journal of personality and social psychology, Vol. 39, 6 (1980), 1161.Google ScholarCross Ref
Mladen Russo, Luka Kraljevi?, Maja Stella, and Marjan Sikora. 2020. Cochleogram-based approach for detecting perceived emotions in music. Information Processing & Management, Vol. 57, 5 (2020), 102270. https://doi.org/10.1016/j.ipm.2020.102270Google ScholarCross Ref
Erik M Schmidt, Douglas Turnbull, and Youngmoo E Kim. 2010. Feature selection for content-based, time-varying musical emotion regression. In Proceedings of the International Conference on Multimedia Information Retrieval. 267--274. Google ScholarDigital Library
Björn Schuller, Clemens Hage, Dagmar Schuller, and Gerhard Rigoll. 2010a. ?Mister D.J., Cheer Me Up!?: Musical and Textual Features for Automatic Mood Classification. Journal of New Music Research, Vol. 39, 1 (2010), 13--34. https://doi.org/10.1080/09298210903430475Google ScholarCross Ref
Björn Schuller, Stefan Steidl, Anton Batliner, Felix Burkhardt, Laurence Devillers, Christian Müller, and Shrikanth S Narayanan. 2010b. The INTERSPEECH 2010 paralinguistic challenge. In INTERSPEECH. 2794--2797.Google Scholar
Ki-Ho Shin and In-Kwon Lee. 2017. Music synchronization with video using emotion similarity. In 2017 IEEE International Conference on Big Data and Smart Computing (BigComp). IEEE, 47--50.Google ScholarCross Ref
George Trigeorgis, Fabien Ringeval, Raymond Brueckner, Erik Marchi, Mihalis A Nicolaou, Björn Schuller, and Stefanos Zafeiriou. 2016. Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 5200--5204.Google ScholarDigital Library
Felix Weninger, Florian Eyben, Björn Schuller, Marcello Mortillaro, and Klaus Scherer. 2013. On the Acoustics of Emotion in Audio: What Speech, Music, and Sound have in Common. Frontiers in Psychology, Vol. 4 (2013), 292. https://doi.org/10.3389/fpsyg.2013.00292Google ScholarCross Ref
Jieping Xu, Xirong Li, Yun Hao, and Gang Yang. 2014. Source separation improves music emotion recognition. In Proceedings of international conference on multimedia retrieval. 423--426. Google ScholarDigital Library
Xinyu Yang, Yizhuo Dong, and Juan Li. 2018. Review of data features-based music emotion recognition methods. Multimedia systems, Vol. 24, 4 (2018), 365--389. Google ScholarDigital Library
Yi-Hsuan Yang, Yu-Ching Lin, Ya-Fan Su, and Homer H Chen. 2008. A regression approach to music emotion recognition. IEEE Transactions on audio, speech, and language processing, Vol. 16, 2 (2008), 448--457. Google ScholarDigital Library

Index Terms

Index terms have been assigned to the content through auto-classification.

Recommendations

Acoustic Features for Music Emotion Recognition and System Building
ICIT '17: Proceedings of the 2017 International Conference on Information Technology

We are faced with a massive growth of musical data in the form of digital files. Accurate metadata labeling of music archives is necessary in order to make digital music searchable and to be efficiently organized, not only by file name or song title but ...
Read More
Source Separation Improves Music Emotion Recognition
ICMR '14: Proceedings of International Conference on Multimedia Retrieval

Despite the impressive progress in music emotion recognition, it remains unclear what aspect of a song, i.e., singing voice and accompanied music, carries more emotional information. As an initial attempt to answer the question, we introduce source ...
Read More
Emotion Recognition of Chinese Traditional Folk Music using an Assembling Machine Learning Method
ICMLT '22: Proceedings of the 2022 7th International Conference on Machine Learning Technologies

Various papers published recently about the emotion of western pop music, none have looked into how to describe Chinese traditional folk music. The accuracy of existing algorithms in recognizing emotions in Chinese traditional folk music is just 42%. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICMI '20 Companion: Companion Publication of the 2020 International Conference on Multimodal Interaction
October 2020
548 pages
ISBN:9781450380027
DOI:10.1145/3395035
General Chairs:
Khiet Truong
University of Twente, the Netherlands
,
Dirk Heylen
University of Twente, the Netherlands
,
Mary Czerwinski
Microsoft Research, USA
,
Program Chairs:
Nadia Berthouze
University College London, United Kingdom
,
Mohamed Chetouani
Sorbonne University, France
,
Mikio Nakano
C4A Research Institute, Japan
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 December 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
blind source separation
multi-modal
music emotion recognition
speech emotion recognition
Qualifiers
- short-paper
Conference

Acceptance Rates
Overall Acceptance Rate453of1,080submissions,42%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 149
  Total Downloads
- Downloads (Last 12 months)15
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

The Influence of Blind Source Separation on Mixed Audio Speech and Music Emotion Recognition

ICMI '20 Companion: Companion Publication of the 2020 International Conference on Multimodal Interaction

ABSTRACT

References

Cited By

Index Terms

Recommendations

Acoustic Features for Music Emotion Recognition and System Building

Source Separation Improves Music Emotion Recognition

Emotion Recognition of Chinese Traditional Folk Music using an Assembling Machine Learning Method