Skip to main content
Log in

Audiovisual emotion recognition using ANOVA feature selection method and multi-classifier neural networks

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

To make human–computer interaction more naturally and friendly, computers must enjoy the ability to understand human’s affective states the same way as human does. There are many modals such as face, body gesture and speech that people use to express their feelings. In this study, we simulate human perception of emotion through combining emotion-related information using facial expression and speech. Speech emotion recognition system is based on prosody features, mel-frequency cepstral coefficients (a representation of the short-term power spectrum of a sound) and facial expression recognition based on integrated time motion image and quantized image matrix, which can be seen as an extension to temporal templates. Experimental results showed that using the hybrid features and decision-level fusion improves the outcome of unimodal systems. This method can improve the recognition rate by about 15 % with respect to the speech unimodal system and by about 30 % with respect to the facial expression system. By using the proposed multi-classifier system that is an improved hybrid system, recognition rate would increase up to 7.5 % over the hybrid features and decision-level fusion with RBF, up to 22.7 % over the speech-based system and up to 38 % over the facial expression-based system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Devillers L, Vidrascu L (2006) Real-life emotions detection with lexical and paralinguistic cues on human call center dialogs. In: Proceedings of the interspeech, pp 801–804

  2. Lee C-C, Mower E, Busso C, Lee S, Narayanan S (2009) Emotion recognition using a hierarchical binary decision tree approach. In: Proceedings of the interspeech, pp 320–323

  3. Polzehl T, Sundaram S, Ketabdar H, Wagner M, Metze F (2009) Emotion classification in children’s speech using fusion of acoustic and linguistic features. In: Proceedings of the interspeech, pp 340–343

  4. Klein J, Moon Y, Picard RW (2002) This computer responds to user frustration: theory, design and results. Interact Comput 14:119–140

    Article  Google Scholar 

  5. Oudeyer P-Y (2003) The production and recognition of emotions in speech: features and algorithms. Int J Hum Comput Interact Stud 59:157–183

    Article  Google Scholar 

  6. Mansoorizadeh M, Moghaddam Charkari N (2009) Hybrid feature and decision level fusion of face and speech information for bimodal emotion recognition. In: Proceedings of the 14th international CSI computer conference

  7. Ambady N, Rosenthal R (1992) Thin slices of expressive behavior as predictors of interpersonal consequences: a meta-analysis. Psychol Bull 111(2):256–274

    Article  Google Scholar 

  8. Ekman P, Rosenberg EL (2005) What the face reveals: basic and applied studies of spontaneous expression using the facial action coding system (FACS), 2nd edn. Oxford University Press, Oxford

    Book  Google Scholar 

  9. Mehrabian A (1968) Communication without words. Psychol Today 2:53–56

    Google Scholar 

  10. Greenwald M, Cook E, Lang P (1989) Affective judgment and psychophysiological response: dimensional covariation in the evaluation of pictorial stimuli. J Psychophysiol 3:51–64

    Google Scholar 

  11. Zeng Z, Pantic M, Roisman GI, Huang TS (2009) A survey of affect recognition methods: audio, visual, and spontaneous expressions. PAMI 31:39–58

    Article  Google Scholar 

  12. Pantic M, Rothkrantz LJM (2000) Automatic analysis of facial expressions: the state of the art. IEEE Trans Patt Anal Mach Intell 22:1424–1445

    Google Scholar 

  13. De Silva LC, Pei Chi N (2000) Bimodal emotion recognition. In: Proceedings of the fourth IEEE international conference on automatic face and gesture recognition, vol 1, pp 332–335

  14. Song M, You M, Li N, Chen C (1920) A robust multimodal approach for emotion recognition. Neurocomputing 71:1913–2008

    Article  Google Scholar 

  15. Hoch S, Althoff F, McGlaun G, Rigooll G (2005) Bimodal fusion of emotional data in an automotive environment. In: Proceedings of the international conference on acoustics, speech, and signal processing, vol 2, pp 1085–1088

  16. Wang Y, Guan L (2005) Recognizing human emotion from audiovisual information. In: Proceedings of the international conference on acoustics, speech, and signal processing, pp 1125–1128

  17. Paleari M, Benmokhtar R, Huet B (2008) Evidence theory-based multimodal emotion recognition. In: MMM ‘09, pp 435–446

  18. Sheikhan M, Bejani M, Gharavian D (2012) Modular neural-SVM scheme for speech emotion recognition using ANOVA feature selection method. Neural Comput Appl J. doi:10.1007/s00521-012-0814-8

    Google Scholar 

  19. Lee CM, Narayanan SS (2005) Toward detecting emotions in spoken dialogs. IEEE Transact Speech Audio Process 13:293–303

    Article  Google Scholar 

  20. Gharavian D, Ahadi SM (2005) The effect of emotion on farsi speech parameters: a statistical evaluation. In: Proceedings of the international conference on speech and computer, pp 463–466

  21. Ververidis D, Kotropoulos C (2006) Emotional speech recognition: resources, features, and methods. Speech Commun 48:1162–1181

    Article  Google Scholar 

  22. Shami M, Verhelst W (2007) An evaluation of the robustness of existing supervised machine learning approaches to the classifications of emotions in speech. Speech Commun 49:201–212

    Article  Google Scholar 

  23. Altun H, Polat G (2009) Boosting selection of speech related features to improve performance of multiclass SVMs in emotion detection. Expert Syst Appl 36:8197–8203

    Article  Google Scholar 

  24. Gharavian D, Sheikhan M, Nazerieh AR, Garoucy S (2011) Speech emotion recognition using FCBF feature selection method and GA-optimized fuzzy ARTMAP neural network. Neural Comput Appl. doi:10.1007/s00521-011-0643-1

    Google Scholar 

  25. Sheikhan M, Safdarkhani MK, Gharavian D (2011) Emotion recognition of speech using small-size selected feature set and ANN-based classifiers: a comparative study. World Appl Sci J 14:616–625

    Google Scholar 

  26. Fersini E, Messina E, Archetti F (2012) Emotional states in judicial courtrooms: an experimental investigation. Speech Commun 54:11–22

    Article  Google Scholar 

  27. Albornoz EM, Milone DH, Rufiner HL (2011) Spoken emotion recognition using hierarchical classifiers. Comput Speech Lang 25:556–570

    Article  Google Scholar 

  28. López-Cózar R, Silovsky J, Kroul M (2011) Enhancement of emotion detection in spoken dialogue systems by combining several information sources. Speech Commun 53:1210–1228

    Article  Google Scholar 

  29. Boersma P, Weenink D (2007) Praat: doing phonetics by computer (version 4.6.12) [computer program]

  30. Bobick AF, Davis JW (2001) The recognition of human movement using temporal templates. IEEE Trans Patt Anal Mach Intell 23(3):257–267

    Article  Google Scholar 

  31. Valstar MF, Patras I, Pantic M (2004) Facial action unit recognition using temporal templates. In: IEEE international workshop on human robot interactive communication

  32. Osadchy M, Keren D (2004) A rejection-based method for event detection in video. IEEE Trans Circuits Syst Video Technol 14(4):534–541

    Article  Google Scholar 

  33. Li N, Dettmer S, Shah M (1997) Visually recognizing speech using eigensequences. In: Motion-based recognition. Kluwer, Boston, pp 345–371

    Chapter  Google Scholar 

  34. Babua RV, Ramakrishnanb KR (2004) Recognition of human actions using motion history information extracted from the compressed video. Image Vis Comput 22:597–607

    Article  Google Scholar 

  35. Sadoghi Yazdi H, Amintoosi M, Fathy M (2007) Facial expression recognition with QIM and ITMI spatio-temporal database. In: 4th Iranian conference on machine vision and image processing, Mashhad, Iran, pp 14–15 (Persian)

  36. Intel, OpenCV Open source computer vision library. http://www.intel.com/research/mrl/research/opencv/

  37. Ebrahimpour R (2007) View-independent face recognition with mixture of experts. Dissertation, The Institute for Research in Fundamental Sciences (IPM)

  38. Ghaderi R (2000) Arranging simple neural networks to solve complex classification problems. Dissertation, Surrey University

  39. Wolpert DH (1992) Stacked generalisation. Complex Syst 5:241–259

    MathSciNet  Google Scholar 

  40. Martin O, Kotsia I, Macq B, Pitas I (2006) The enterface ‘05 audio-visual emotion database. In: Proceedings of the 22nd international conference on data engineering workshops (ICDEW ‘06)

  41. Paleari M, Huet B (2008) Toward emotion indexing of multimedia excerpts. In: CBMI

  42. Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B (2005) A database of German emotional speech. In: Interspeech, Lisbon, Portugal

  43. Mansoorizadeh M, Moghaddam Charkari N (2009) Multimodal information fusion application to human emotion recognition from face and speech. Multimed Tools Appl

  44. Kanade T, Cohn J, Tian Y (2000) Comprehensive database for facial expression analysis. In: IEEE international conference on face and gesture recognition (AFGR ‘00), pp 46–53

  45. SPSS (2007) Clementine® 12.0 algorithms guide. Integral Solutions Limited, Chicago

    Google Scholar 

  46. Zeng Z, Hu Y, Roisman GI, Wen Z, Fu Y, Huang TS (2007) Audio-visual spontaneous emotion recognition. Artif Intell Hum Comput 4451:72–90

    Article  Google Scholar 

  47. Busso C et al (2004) Analysis of emotion recognition using facial expressions, speech and multimodal information. In: Proceedings of the sixth ACM international conference on multimodal interfaces (ICMI ‘04), pp 205–211

  48. Cheng-Yao C, Yue-Kai H, Cook P (2005) Visual/acoustic emotion recognition, pp 1468–1471

  49. Schuller B, Arsic D, Rigoll G, Wimmer M, Radig B (2007) Audiovisual behavior modeling by combined feature spaces. In: ICASSP, pp 733–736

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Davood Gharavian.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bejani, M., Gharavian, D. & Charkari, N.M. Audiovisual emotion recognition using ANOVA feature selection method and multi-classifier neural networks. Neural Comput & Applic 24, 399–412 (2014). https://doi.org/10.1007/s00521-012-1228-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-012-1228-3

Keywords

Navigation