Skip to main content

Advertisement

Log in

Relevance units machine based dimensional and continuous speech emotion prediction

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Emotion plays a significant role in human-computer interaction. The continuing improvements in speech technology have led to many new and fascinating applications in human-computer interaction, context aware computing and computer mediated communication. Such applications require reliable online recognition of the user’s affect. However most emotion recognition systems are based on speech via an isolated short sentence or word. We present a framework for online emotion recognition from speech. On the front-end, a voice activity detection algorithm is used to segment the input speech, and features are estimated to model long-term properties. Then, dimensional and continuous emotion recognition is performed via a Relevance Units Machine (RUM). The advantages of the proposed system are: (i) its computational efficiency in run-time (regression outputs can be produced continuously in pseudo real-time), (ii) RUM offers superior sparsity to the well-known Support Vector Regression (SVR) and Relevance Vector Machine for regression (RVR), and (iii) RUM’s predictive performance is comparable to SVR and RVR.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Baron-Cohen S (2003) Mind reading : the interactive guide to emotions . Jessica Kingsley Publishers

  2. Bishop CM, et al. (2006) Pattern recognition and machine learning. Springer, New York

    MATH  Google Scholar 

  3. Borod J (2000) The neuropsychology of emotion. Oxford University Press, USA

    Google Scholar 

  4. Bouman C, Shapiro M, Cook G, Atkins C, Cheng H (1997) Cluster An unsupervised algorithm for modeling Gaussian mixtures

  5. Busso C, Lee S, Narayanan S (2009) Analysis of emotionally salient aspects of fundamental frequency for emotion detection. IEEE Trans Audio Speech Lang Process 17(4):582–596

    Article  Google Scholar 

  6. Calvo R, D’Mello S (2010) Affect detection: an interdisciplinary review of models, methods, and their applications. IEEE Trans Affect Comput 1(1):18–37

    Article  Google Scholar 

  7. Cowie R, Douglas-Cowie E, Savvidou S, McMahon E, Sawey M, Schröder M (2000) FEELTRACE: an instrument for recording perceived emotion in real time. In: ISCA Tutorial and Research Workshop on Speech and Emotion

  8. Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Kollias S, Fellenz W, Taylor J (2001) Emotion recognition in human-computer interaction. IEEE Signal Proc. Mag. 18(1):32–80

    Article  Google Scholar 

  9. Dekens T, Verhelst W (2011) On noise robust voice activity detection. In: 12th Annual Conference of the International Speech Communication Association, pp 2649–2652

  10. Eyben F, Wöllmer M, Graves A , Schuller B , Douglas-Cowie E, Cowie R (2010) On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues. Journal on Multimodal User Interfaces 3(1):7–19

    Article  Google Scholar 

  11. Fontaine JR, Scherer KR, Roesch EB, Ellsworth PC (2007) The world of emotions is not two-dimensional. Psychol Sci 18(12):1050–1057

    Article  Google Scholar 

  12. Gao J, Zhang J (2009) Sparse kernel learning and the relevance units machine. Advances in Knowledge Discovery and Data Mining pp 612–619

  13. Grimm M, Kroschel K (2005) Evaluation of natural emotions using self assessment manikins. In: IEEE Workshop on Automatic Speech Recognition and Understanding, pp 381–385

  14. Grimm M, Kroschel K (2007) Emotion estimation in speech using a 3D emotion space concept. Robust Speech Recognition and Understanding, pp 281–300

  15. Grimm M, Kroschel K, Narayanan S (2008) The vera am mittag german audio-visual emotional speech database. In: IEEE International Conference on Multimedia and Expo, pp 865–868

  16. Gunes H, Piccardi M, Pantic M (2008) From the lab to the real world: affect recognition using multiple cues and modalities. Affective Computing: Focus on Emotion Expression, Sythesis and Recognition pp 185–218

  17. Huttar G (1968) Relations between prosodic variables and emotions in normal American English utterances. Journal of Speech, Language, and Hearing Research 11(3):481

    Article  Google Scholar 

  18. Kehrein R (2002) The prosody of authentic emotions. In: Proceedings of Speech Prosody

  19. Lee C, Narayanan S (2005) Toward detecting emotions in spoken dialogs. IEEE Transactions on Speech and Audio Processing 13(2):293–303

    Article  Google Scholar 

  20. Lefter I, Wiggers P, Rothkrantz L (2010) EmoReSp: an online emotion recognizer based on speech. In: Proceedings of the 11th International Conference on Computer Systems and Technologies, ACM, pp 287–292

  21. McKeown G, Valstar M, Cowie R, Pantic M , Schroder M (2012) The semaine database: annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Trans Affect Comput 3(1):5–17

    Article  Google Scholar 

  22. Nicolaou MA, Gunes H, Pantic M (2011) Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space. IEEE Trans Affect Comput 2(2):92–105

    Article  Google Scholar 

  23. Nicolaou MA, Gunes H, Pantic M (2012) Output-associative rvm regression for dimensional and continuous emotion prediction, vol 30, pp 186–196

  24. Oudeyer P (2003) The production and recognition of emotions in speech: features and algorithms. Int J Hum Comput Stud 59(1–2):157–183

    Google Scholar 

  25. Pantic M, Rothkrantz L (2003) Toward an affect-sensitive multimodal human-computer interaction. Proc IEEE Special Issue on Multimodal Hum Comput Interact 91(9):1370–1390

    Google Scholar 

  26. Paul F, Nathoo A, Richardson H (1971) Breath sounds. Thorax. pp 288–295

  27. Rong J, Li G, Chen Y (2009) Acoustic feature selection for automatic emotion recognition from speech. Inf Process Manag 45(3):315–328

    Article  Google Scholar 

  28. Russell J (1980) A circumplex model of affect. J Pers Soc Psychol 39(6):1161

    Article  Google Scholar 

  29. Scherer K, Oshinsky J (1977) Cue utilization in emotion attribution from auditory stimuli. Motiv Emot 4:331–346

    Article  Google Scholar 

  30. Scherer K, Schorr A, Johnstone T (2001) Appraisal processes in emotion: theory, methods, research. Oxford University Press, USA

    Google Scholar 

  31. Schuller B, Valstar M, Eyben F, McKeown G, Cowie R, Pantic M (2011) AVEC 2011–the first international audio/visual emotion challenge. Affective Computing and Intelligent Interaction, pp 415–424

  32. Schuller B, Valstar M, Cowie R , Pantic M (2012) AVEC 2012–the continuous audio/visual emotion challenge. In: Proceedings of 2nd International Audio/Visual Emotion Challenge and Workshop, AVEC 2012

  33. Shami M, Verhelst W (2007) An evaluation of the robustness of existing supervised machine learning approaches to the classification of emotions in speech. Speech Comm 49(3):201–212

    Article  Google Scholar 

  34. Tipping M (2001) Sparse Bayesian learning and the relevance vector machine. J Mach Learn Res 1:211–244

    MATH  MathSciNet  Google Scholar 

  35. Ververidis D, Kotropoulos C (2006) Emotional speech recognition: resources, features, and methods. Speech Comm 48(9):1162–1181

    Article  Google Scholar 

  36. Vogt T, André E, Bee N (2008) Emovoice - a framework for online recognition of emotions from voice. Perception in Multimodal Dialogue Systems, vol 5078, pp 188–199

  37. Whissell C (1989) The dictionary of affect in language. Emotion: theory, research, and experience, vol 4, pp 113–131

  38. Wöllmer M, Eyben F, Reiter S, Schuller B, Cox C, Douglas-Cowie E, Cowie R (2008) Abandoning emotion classes towards continuous emotion recognition with modelling of long-range dependencies. In: 9th Annual Conference of the International Speech Communication Association, pp 597–600

  39. Wöllmer M, Schuller B, Eyben F, Rigoll G (2010) Combining long short-term memory and dynamic bayesian networks for incremental emotion-sensitive artificial listening. IEEE J Signal Process 4(5):867–881

    Google Scholar 

  40. Wu D, Parsons T, Mower E, Narayanan S (2010) Speech emotion estimation in 3D space. In: IEEE International Conference on Multimedia and Expo, pp 737–742

  41. Zeng Z, Pantic M, Roisman G, Huang T (2009) A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans Pattern Anal Mach Intell 31(1):39–58

    Article  Google Scholar 

Download references

Acknowledgments

The research reported in this paper has been supported in part by the CSC-VUB scholarship grant [2009]3012, and the EU FP7 project ALIZ-E grant 248116.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fengna Wang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, F., Sahli, H., Gao, J. et al. Relevance units machine based dimensional and continuous speech emotion prediction. Multimed Tools Appl 74, 9983–10000 (2015). https://doi.org/10.1007/s11042-014-2319-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-014-2319-1

Keywords

Navigation