Abstract
Emotion plays a significant role in human-computer interaction. The continuing improvements in speech technology have led to many new and fascinating applications in human-computer interaction, context aware computing and computer mediated communication. Such applications require reliable online recognition of the user’s affect. However most emotion recognition systems are based on speech via an isolated short sentence or word. We present a framework for online emotion recognition from speech. On the front-end, a voice activity detection algorithm is used to segment the input speech, and features are estimated to model long-term properties. Then, dimensional and continuous emotion recognition is performed via a Relevance Units Machine (RUM). The advantages of the proposed system are: (i) its computational efficiency in run-time (regression outputs can be produced continuously in pseudo real-time), (ii) RUM offers superior sparsity to the well-known Support Vector Regression (SVR) and Relevance Vector Machine for regression (RVR), and (iii) RUM’s predictive performance is comparable to SVR and RVR.
Similar content being viewed by others
References
Baron-Cohen S (2003) Mind reading : the interactive guide to emotions . Jessica Kingsley Publishers
Bishop CM, et al. (2006) Pattern recognition and machine learning. Springer, New York
Borod J (2000) The neuropsychology of emotion. Oxford University Press, USA
Bouman C, Shapiro M, Cook G, Atkins C, Cheng H (1997) Cluster An unsupervised algorithm for modeling Gaussian mixtures
Busso C, Lee S, Narayanan S (2009) Analysis of emotionally salient aspects of fundamental frequency for emotion detection. IEEE Trans Audio Speech Lang Process 17(4):582–596
Calvo R, D’Mello S (2010) Affect detection: an interdisciplinary review of models, methods, and their applications. IEEE Trans Affect Comput 1(1):18–37
Cowie R, Douglas-Cowie E, Savvidou S, McMahon E, Sawey M, Schröder M (2000) FEELTRACE: an instrument for recording perceived emotion in real time. In: ISCA Tutorial and Research Workshop on Speech and Emotion
Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Kollias S, Fellenz W, Taylor J (2001) Emotion recognition in human-computer interaction. IEEE Signal Proc. Mag. 18(1):32–80
Dekens T, Verhelst W (2011) On noise robust voice activity detection. In: 12th Annual Conference of the International Speech Communication Association, pp 2649–2652
Eyben F, Wöllmer M, Graves A , Schuller B , Douglas-Cowie E, Cowie R (2010) On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues. Journal on Multimodal User Interfaces 3(1):7–19
Fontaine JR, Scherer KR, Roesch EB, Ellsworth PC (2007) The world of emotions is not two-dimensional. Psychol Sci 18(12):1050–1057
Gao J, Zhang J (2009) Sparse kernel learning and the relevance units machine. Advances in Knowledge Discovery and Data Mining pp 612–619
Grimm M, Kroschel K (2005) Evaluation of natural emotions using self assessment manikins. In: IEEE Workshop on Automatic Speech Recognition and Understanding, pp 381–385
Grimm M, Kroschel K (2007) Emotion estimation in speech using a 3D emotion space concept. Robust Speech Recognition and Understanding, pp 281–300
Grimm M, Kroschel K, Narayanan S (2008) The vera am mittag german audio-visual emotional speech database. In: IEEE International Conference on Multimedia and Expo, pp 865–868
Gunes H, Piccardi M, Pantic M (2008) From the lab to the real world: affect recognition using multiple cues and modalities. Affective Computing: Focus on Emotion Expression, Sythesis and Recognition pp 185–218
Huttar G (1968) Relations between prosodic variables and emotions in normal American English utterances. Journal of Speech, Language, and Hearing Research 11(3):481
Kehrein R (2002) The prosody of authentic emotions. In: Proceedings of Speech Prosody
Lee C, Narayanan S (2005) Toward detecting emotions in spoken dialogs. IEEE Transactions on Speech and Audio Processing 13(2):293–303
Lefter I, Wiggers P, Rothkrantz L (2010) EmoReSp: an online emotion recognizer based on speech. In: Proceedings of the 11th International Conference on Computer Systems and Technologies, ACM, pp 287–292
McKeown G, Valstar M, Cowie R, Pantic M , Schroder M (2012) The semaine database: annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Trans Affect Comput 3(1):5–17
Nicolaou MA, Gunes H, Pantic M (2011) Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space. IEEE Trans Affect Comput 2(2):92–105
Nicolaou MA, Gunes H, Pantic M (2012) Output-associative rvm regression for dimensional and continuous emotion prediction, vol 30, pp 186–196
Oudeyer P (2003) The production and recognition of emotions in speech: features and algorithms. Int J Hum Comput Stud 59(1–2):157–183
Pantic M, Rothkrantz L (2003) Toward an affect-sensitive multimodal human-computer interaction. Proc IEEE Special Issue on Multimodal Hum Comput Interact 91(9):1370–1390
Paul F, Nathoo A, Richardson H (1971) Breath sounds. Thorax. pp 288–295
Rong J, Li G, Chen Y (2009) Acoustic feature selection for automatic emotion recognition from speech. Inf Process Manag 45(3):315–328
Russell J (1980) A circumplex model of affect. J Pers Soc Psychol 39(6):1161
Scherer K, Oshinsky J (1977) Cue utilization in emotion attribution from auditory stimuli. Motiv Emot 4:331–346
Scherer K, Schorr A, Johnstone T (2001) Appraisal processes in emotion: theory, methods, research. Oxford University Press, USA
Schuller B, Valstar M, Eyben F, McKeown G, Cowie R, Pantic M (2011) AVEC 2011–the first international audio/visual emotion challenge. Affective Computing and Intelligent Interaction, pp 415–424
Schuller B, Valstar M, Cowie R , Pantic M (2012) AVEC 2012–the continuous audio/visual emotion challenge. In: Proceedings of 2nd International Audio/Visual Emotion Challenge and Workshop, AVEC 2012
Shami M, Verhelst W (2007) An evaluation of the robustness of existing supervised machine learning approaches to the classification of emotions in speech. Speech Comm 49(3):201–212
Tipping M (2001) Sparse Bayesian learning and the relevance vector machine. J Mach Learn Res 1:211–244
Ververidis D, Kotropoulos C (2006) Emotional speech recognition: resources, features, and methods. Speech Comm 48(9):1162–1181
Vogt T, André E, Bee N (2008) Emovoice - a framework for online recognition of emotions from voice. Perception in Multimodal Dialogue Systems, vol 5078, pp 188–199
Whissell C (1989) The dictionary of affect in language. Emotion: theory, research, and experience, vol 4, pp 113–131
Wöllmer M, Eyben F, Reiter S, Schuller B, Cox C, Douglas-Cowie E, Cowie R (2008) Abandoning emotion classes towards continuous emotion recognition with modelling of long-range dependencies. In: 9th Annual Conference of the International Speech Communication Association, pp 597–600
Wöllmer M, Schuller B, Eyben F, Rigoll G (2010) Combining long short-term memory and dynamic bayesian networks for incremental emotion-sensitive artificial listening. IEEE J Signal Process 4(5):867–881
Wu D, Parsons T, Mower E, Narayanan S (2010) Speech emotion estimation in 3D space. In: IEEE International Conference on Multimedia and Expo, pp 737–742
Zeng Z, Pantic M, Roisman G, Huang T (2009) A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans Pattern Anal Mach Intell 31(1):39–58
Acknowledgments
The research reported in this paper has been supported in part by the CSC-VUB scholarship grant [2009]3012, and the EU FP7 project ALIZ-E grant 248116.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, F., Sahli, H., Gao, J. et al. Relevance units machine based dimensional and continuous speech emotion prediction. Multimed Tools Appl 74, 9983–10000 (2015). https://doi.org/10.1007/s11042-014-2319-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-014-2319-1