Skip to main content
Log in

Recognition of emotions from video using acoustic and facial features

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

In this paper, acoustic and facial features extracted from video are explored for recognizing emotions. The temporal variation of gray values of the pixels within eye and mouth regions is used as a feature to capture the emotion-specific knowledge from the facial expressions. Acoustic features representing spectral and prosodic information are explored for recognizing emotions from the speech signal. Autoassociative neural network models are used to capture the emotion-specific information from acoustic and facial features. The basic objective of this work is to examine the capability of the proposed acoustic and facial features in view of capturing the emotion-specific information. Further, the correlations among the feature sets are analyzed by combining the evidences at different levels. The performance of the emotion recognition system developed using acoustic and facial features is observed to be 85.71 and 88.14 %, respectively. It has been observed that combining the evidences of models developed using acoustic and facial features improved the recognition performance to 93.62 %. The performance of the emotion recognition systems developed using neural network models is compared with hidden Markov models, Gaussian mixture models and support vector machine models. The proposed features and models are evaluated on real-life emotional database, Interactive Emotional Dyadic Motion Capture database, which was recently collected at University of Southern California.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Lee, C.M., Narayanan, S.S.: Toward detecting emotions in spoken dialogs. IEEE Trans. Speech Audio Process. 13, 293–303 (March 2005)

    Google Scholar 

  2. Schroder, M., Cowie, R., Douglas-Cowie, E., Westerdijk, M., Gielen, S.: Acoustic correlates of emotion dimensions in view of speech synthesis. In: EUROSPEECH, Aalborg, Denmark (2001)

  3. Pantic, M., Bartlett, M.: Machine analysis of facial expressions. In: Delac, K., Grgic, M. (eds.) Face Recognition, Vienna, pp. 377–416. I-Tech Education (2007)

  4. Gunes, H., Pantic, M.: Automatic, dimensional and continuous emotion recognition. Int. J. Synth. Emot. 1(1), 68–99 (2010)

    Article  Google Scholar 

  5. Douglas-Cowie, R., Tsapatsoulis, E., Votsis, N., Kollias, G., Fellenz, S., Fellinge, W., Taylor, J.: Emotion recognition in human computer interaction. IEEE Signal Process. Mag. 18(1), 32–80 (2001)

    Google Scholar 

  6. Turk, O., Schroder, M.: Evaluation of expressive speech synthesis with voice conversion and copy re-synthesis techniques. IEEE Trans. Speech Audio Process. 18(5), 965–973 (2010)

    Article  Google Scholar 

  7. Ekman, P., Friesen, W.V., Hager, J.C.: Facial action coding system: the manual (666 Malibu Drive, Salt Lake City UT 84107). A Human Face (2002)

  8. Zeng, Z., Pantic, M., Roisman, G.I., Huang, T.S.: A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans. Pattern Anal. Mach. Intell. 31, 39–58 (Jan 2009)

    Google Scholar 

  9. Yegnanarayana, B., Kishore, S.P.: AANN an alternative to GMM for pattern recognition. Neural Netw. 15, 459–469 (Apr. 2002)

    Google Scholar 

  10. Rao, K.S.: Voice conversion by mapping the speaker-specific features using pitch synchronous approach. Comput. Speech Lang. 24(1), 474–494 (2010)

    Google Scholar 

  11. Pantic, M., Patras, I.: Dynamics of facial expression: recognition of facial actions and their temporal segments from face profile image sequences. IEEE Trans. Syst. Man Cybern. Part B 36(2), 433–449 (2006)

    Article  Google Scholar 

  12. Bartlett, M., Littlewort, G., Vural, E., Lee, K., Cetin, M., Ercil, A., Movellan1, J.: Data mining spontaneous facial behavior with automatic expression coding. In: Lecture Notes in Computer Science, pp. 1–21 (2008)

  13. Petridis, S., Pantic, M.: Fusion of audio and visual cues for laughter detection. In: Proceedings of CIVR, pp. 329–338 (2008)

  14. Gunes, H., Piccardi, M.: Automatic temporal segment detection and affect recognition from face and body display. IEEE Trans. Syst. Man Cybern. Part B Special Issue Hum. Comput. 39, 64–84 (Feb. 2009)

    Google Scholar 

  15. Ashraf, A.B., Lucey, S., Cohn, J.F., Chen, T., Prkachin, K., Solomon, P.: The painful face: pain expression recognition using active appearance models. In: Proceedings of ICMI, pp. 1788–1796 (2007)

  16. Rudovic, O., Patras, I., Pantic, M.: Coupled Gaussian process regression for pose-invariant facial expression recognition. In: Proceedings of 11th European Conference on Computer Vision (ECCV) (2010)

  17. Wu, T., Bartlett, M.S., Movellan, J.R.: Facial expression recognition using Gabor motion energy filters. In: IEEE CVPR Workshop on Computer Vision and Pattern Recognition for Human Communicative Behavior Analysis (2010)

  18. Lajevardi, S.M., Lech, M.: Facial expression recognition using neural networks and log-Gabor filters. In: Digital Image Computing: Techniques and Applications (DICTA), pp. 77–83 (2008)

  19. Filko, D., Martinovic, G.: Emotion recognition system by a neural network based facial expression analysis. AUTOMATIKA 54, 263–272 (Aug. 2013)

    Google Scholar 

  20. Ioannou, S.V., Raouzaiou, A.T., Tzouvaras, V.A., Mailis, T.P., Karpouzis, K.C., Kollias, S.D.: Emotion recognition through facial expression analysis based on a neurofuzzy network. Neural Netw. 18(2), 423–435 (2005)

    Google Scholar 

  21. Busso, C., Lee, S., Narayanan, S.: Analysis of emotionally salient aspects of fundamental frequency for emotion detection. IEEE Trans. Speech Audio Process. 17, 582–596 (2009)

    Article  Google Scholar 

  22. McGilloway, S., Cowie, R., Douglas-Cowie, E., Gielen, S., Westerdijk, M., Stroeve, S.: Approaching automatic recognition of emotion from voice: A rough benchmark. In: ISCA Workshop on Speech and Emotion, Belfast (2000)

  23. Nicholson, J., Takahashi, K., Nakatsu, R.: Emotion recognition in speech using neural networks. In: 6th International Conference on Neural Information Processing (ICONIP), pp. 495–501 (1999)

  24. Nwe, T.L., Foo, S.W., Silva, L.C.D.: Speech emotion recognition using hidden Markov models. Speech Commun. 41, 603–623 (Nov. 2003)

    Google Scholar 

  25. Kwon, O., Chan, K., Hao, J., Lee, T.: Emotion recognition by speech signals. In: Eurospeech, Geneva, pp. 125–128 (2003)

  26. Wang, Y., Guan, L.: An investigation of speech-based human emotion recognition. In: IEEE 6th Workshop on Multimedia Signal Processing, pp. 15–18 (2004)

  27. Iliev, A.I., Scordilis, M.S., Papa, J.P., Falco, A.X.: Spoken emotion recognition through optimum-path forest classification using glottal features. Comput. Speech Lang. 24(3), 445–460 (2010)

    Article  Google Scholar 

  28. Wu, S., Falk, T.H., Chan, W.-Y.: Automatic speech emotion recognition using modulation spectral features. Speech Commun. 53(5), 768–785 (2011)

    Google Scholar 

  29. Lee, C.M., Narayanan, S.: Toward detecting emotions in spoken dialogs. IEEE Trans. Audio Speech Lang. Process. 13, 293–303 (March 2005)

    Google Scholar 

  30. Nicholson, J., Takahashi, K., Nakatsu, R.: Emotion recognition in speech using neural networks. Neural Comput. Appl. 9, 290–296 (2000)

    Google Scholar 

  31. Rao, K.S., Koolagudi, S.G.: Characterization and recognition of emotions from speech using excitation source information. Int. J. Speech Technol. 16, 181–201 (2013)

    Article  Google Scholar 

  32. Koolagudi, S.G., Rao, K.S.: Emotion recognition from speech using sub-syllabic and pitch synchronous spectral features. Int. J. Speech Technol. 15, 495–511 (2012)

    Article  Google Scholar 

  33. Kanluan, I., Grimm, M., Kroschel, K.: Audio-visual emotion recognition using an emotion space concept. In: Proceedings of EUSIPCO (2008)

  34. Datcu, D., Rothkrantz, L.J.M.: Semantic audio-visual data fusion for automatic emotion recognition. In: Proceedings of Euromedia (2008)

  35. Pantic, M., Rothkrantz, L.J.M.: Toward an affect-sensitive multimodal human–computer interaction. In: Proceedings of the IEEE, vol. 91, pp. 1370–1390 (2003)

  36. Yoshitomi, Y., Kim, S.I., Kawano, T., Kitazoe, T.: Effect of sensor fusion for recognition of emotional states using voice, face image and thermal image of face. In: 9th International Workshop on Robot and Human Interactive Communication, pp. 178–184 (2000)

  37. Huang, T.S., Chen, L.S., Tao, H., Miyasato, T., Nakatsu, R.: Bimodal emotion recognition by man and machine. In: ATR Workshop on Virtual communication environments, Kyoto, Japan (1998)

  38. Zeng, Z., Tu, J., Pianfetti, B., Huang, T.S.: Audiovisual affective expression recognition through multistream fused HMM. IEEE Trans. Multimed. 10(4), 570–577 (2008)

    Google Scholar 

  39. Palanivel, S., Yegnanarayana, B.: Multimodal person authentication using speech, face and visual speech. Comput. Vis. Image Underst. 109, 44–55 (2008)

    Article  Google Scholar 

  40. Vuppala, A.K., Yadav, J., Chakrabarti, S., Rao, K.S.: Vowel onset point detection for low bit rate coded speech. IEEE Trans. Audio Speech Lang. Process. 20, 1894–1903 (Aug. 2012)

    Google Scholar 

  41. Prasanna, S.R.M., Yegnanarayana, B.: Extraction of pitch in adverse conditions. In: Proceedings of IEEE International Conference Acoustics, Speech, Signal Processing, Montreal, Canada (2004)

  42. Ikbal, M.S., Misra, H., Yegnanarayana, B.: Analysis of autoassociative mapping neural networks. In: International Joint Conference Neural Networks, USA, pp. 854–858 (1999)

  43. Busso, C., Bulut, M., Lee, C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J.N., Lee, S., Narayanan, S.S.: IEMOCAP: interactive emotional dyadic motion capture database. J. Lang. Resour. Eval. 42, 335–359 (Dec. 2008)

    Google Scholar 

  44. Metallinou, A., Lee, S., Narayanan, S.S.: Audio-visual emotion recognition using Gaussian mixture models for face and voice. In: IEEE International Symposium on Multimedia (ISM), USA, Berkeley (2008)

  45. Mower, E., Mataric, M.J., Narayanan, S.S.: A framework for automatic human emotion classification using emotional profiles. IEEE Trans. Audio Speech Lang. Process. 19, 1057–1070 (May 2011)

    Google Scholar 

  46. Metallinou, A., Woellmer, M., Katsamanis, A., Eyben, F., Schuller, B., Narayanan, S.: Context-sensitive learning for enhanced audiovisual emotion classification. IEEE Trans. Affect. Comput. (TAC) 3, 184–198 (April 2012)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to K. Sreenivasa Rao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rao, K.S., Koolagudi, S.G. Recognition of emotions from video using acoustic and facial features. SIViP 9, 1029–1045 (2015). https://doi.org/10.1007/s11760-013-0522-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-013-0522-6

Keywords

Navigation