Skip to main content
Log in

Robust emotion recognition by spectro-temporal modulation statistic features

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

Most speech emotion recognition studies consider clean speech. In this study, statistics of joint spectro-temporal modulation features are extracted from an auditory perceptual model and are used to detect the emotion status of speech under noisy conditions. Speech samples were extracted from the Berlin Emotional Speech database and corrupted with white and babble noise under various SNR levels. This study investigates a clean train/noisy test scenario to simulate practical conditions with unknown noisy sources. Simulations demonstrate the redundancy of the proposed spectro-temporal modulation features and further consider the dimensionality reduction. The proposed modulation features achieve higher recognition rates of speech emotions under noisy conditions than (1) conventional mel-frequency cepstral coefficients combined with prosodic features; (2) official acoustic features adopted in the INTERSPEECH 2009 Emotion Challenge. Adding modulation features increased the recognition rates of INTERSPEECH proposed features by approximately 7% for all tested SNR conditions (20–0 dB).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  • Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor Newslett 6(1):20–29

    Article  Google Scholar 

  • Bregman AS (1990) Auditory scene analysis: The perceptual organization of sound. MIT press, Cambridge

    Google Scholar 

  • Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B (2005) A database of German emotional speech. In: Proceedings of Interspeech, pp 489–492

  • Busso C, Lee S, Narayanan S (2009) Analysis of emotionally salient aspects of fundamental frequency for emotion detection. IEEE Trans Audio Speech Lang Process 17:582–596

    Article  Google Scholar 

  • Carlyon RP, Moore BCJ, Micheyl C (2000) The effect of modulation rate on the detection of frequency modulation and mistuning of complex tones. J Acoust Soc Am 108:304–315

    Article  Google Scholar 

  • Chang CC, Lin CJ (2001) LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm

  • Chawla NV, Hall LO, Bowyer KW, Kegelmeyer WP (2002) SMOTE: synthetic minority oversampling technique. J Artif Intell Res 16:321–357

    MATH  Google Scholar 

  • Chi TS, Hsu CC (2011) Multiband analysis and synthesis of spectro-temporal modulations of Fourier spectrogram. J Acoust Soc Am 129(5):EL190–EL196

    Article  Google Scholar 

  • Chi T, Ru P, Shamma SA (2005) Multi-resolution spectro-temporal analysis of complex sounds. J Acoust Soc Am 118(2):887–906

    Article  Google Scholar 

  • Cowie R, Douglas-Cowie E, Tsapatsoulis N, Votsis G, Kollias S, Fellenz W, Taylor JG (2001) Emotion recognition in human-computer interaction. IEEE Signal Process Magazine 18:32–80

    Article  Google Scholar 

  • Eyben F, Wollmer M, Schuller B (2009) Speech and music interpretation by large-space extraction. http://sourceforge.net/projects/openSMILE

  • Ezzat T, Bouvrie J, Poggio T (2007) Spectro-temporal analysis of speech using 2-D Gabor filters. In: Proceedings of Interspeech, pp 506–509

  • Falk TH, Chan WY (2010) Modulation spectral features for robust far-field speaker identification. IEEE Trans Audio Speech Lang Process 18:90–100

    Article  Google Scholar 

  • Grimault N, Bacon SP, Micheyl C (2002) Auditory stream segregation on the basis of amplitude modulation rate. J Acoust Soc Am 111:1340–1348

    Article  Google Scholar 

  • Jiang DN, Cai LH (2004) Speech emotion classification with the combination of statistic features and temporal features. In: Proceedings of the ICME, pp 1967–1970

  • Kawahara H, de Cheveigné A, Banno H, Takahashi T, Irino T (2005) Nearly defect-free F0 trajectory extraction for expressive speech modifications based on STRAIGHT. In: Proceedings of Interspeech, pp 537–540

  • Lee CM, Narayanan SS (2005) Toward detecting emotions in spoken dialogs. IEEE Trans Speech Audio Process 13:293–303

    Article  Google Scholar 

  • Loizou PC (2007) Speech enhancement: theory and practice. CRC, New York

    Google Scholar 

  • Meyer BT, Kollmeier B (2011) Robustness of spectro-temporal features against intrinsic and extrinsic variations in automatic speech recognition. Speech Commun 53(5):753–767

    Article  Google Scholar 

  • Mozziconacci S (2002) Prosody and emotions. In: Proceedings of Speech Prosody, pp 1–9

  • New T, Foo S, DeSilva L (2003) Speech emotion recognition using hidden markov models. Speech Commun 41:603–623

    Article  Google Scholar 

  • O’Shaughnessy D (2000) Speech communications-human and machine, 2nd edn. IEEE Press, Piscataway

    MATH  Google Scholar 

  • Pao TL, Chen YT, Yeh JH, Li PJ (2006) Mandarin emotional speech recognition based on SVM and NN. In: Proceedings of the 18th International Conference on Pattern Recognition, vol. 1, pp 1096–1110

  • Pudil P, Ferri FJ, Novovicova J, Kittler J (1994) Floating search methods for feature selection with nonmonotonic criterion functions. In: Proceedings of the international Conference on Computer Vision & Image Processing, pp 279–283

  • Ringeval F, Chetouani M (2008) A vowel based approach for acted emotion recognition. In: Proceedings of Interspeech, pp 2763–2766

  • Scherer K (2003) Vocal communication of emotion: A review of research paradigms. Speech Commun 40:227–256

    Article  MATH  Google Scholar 

  • Schuller B, Rigoll G (2006) Timing levels in segment-based speech emotion recognition. In: Proceedings of Interspeech, pp 1818–1821

  • Schuller B, Rigoll G, Lang M (2003) Hidden markov model-based speech emotion recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 2, pp 1–4

  • Schuller B, Rigoll G, Lang M (2004) Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp 577–580

  • Schuller B, Arsić D, Wallhoff F, Rigoll G (2006) Emotion recognition in the noise applying large acoustic feature sets. In: Proceedings of Speech Prosody

  • Schuller B, Batliner A, Seppi D, Steidl S, Vogt T, Wagner J, Devillers L, Vidrascu L, Amir N, Kessous L, Aharonson V (2007a). The relevance of feature type for the automatic classification of emotional user states: Low level descriptors and functionals. In: Proceedings of Interspeech, pp 2253–2256

  • Schuller B, Seppi D, Batliner A, Maier A, Steidl S (2007b) Towards more reality in the recognition of emotional speech. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 4, pp 941–944

  • Schuller B, Wöllmer M, Eyben F, Rigoll G (2009) Spectral or voice quality? Feature type relevance for the discrimination of emotion pairs. In: Hancil S (ed) The Role of Prosody in Affective Speech, Linguistic Insights, Studies in Language and Communication, Vol. 97. Peter Lang Publishing Group, New York, pp 285–307

  • Shamma S, Klein DJ (2000) The case of the missing pitch templates: how harmonic templates may form in the early auditory system. J Acoust Soc Am 107:2631–2644

    Article  Google Scholar 

  • Steidl S (2009) Automatic classification of emotion-related user states in spontaneous children’s speech. Logos Verlag, Berlin

    Google Scholar 

  • Varga A, Steeneken HJM (1993) Assessment for automatic speech recognition: II. NOISEX-92: a database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun 12(3):247–251

    Article  Google Scholar 

  • Ververidis V, Kotropoulos C (2006) Emotional speech recognition: resources, features, and methods. Speech Commun 48(9):1162–1181

    Article  Google Scholar 

  • Wu S, Falk TH, Chan WY (2011) Automatic speech emotion recognition using modulation spectral features. Speech Commun 53(5):768–785

    Article  Google Scholar 

  • You M, Chen C, Bu J, Liu J, Tao J (2006) Emotion recognition from noisy speech. In: Proceedings of the ICME, pp 1653–1656

  • Yu F, Chang E, Xu YQ, Shum HY (2001) Emotion detection from speech to enrich multimedia content. In: Proceedings of the IEEE Pacific-Rim Conference on Multimedia, vol 1, pp 550–557

  • Zeng Z, Pantic M, Rosiman GI, Huang TS (2009) A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Trans Pattern Anal Mach Intell 31(1):39–58

    Article  Google Scholar 

Download references

Acknowledgments

This study was supported in part by the National Science Council, Taiwan under Grant No. NSC 99-2220-E-009-056

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tai-Shih Chi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chi, TS., Yeh, LY. & Hsu, CC. Robust emotion recognition by spectro-temporal modulation statistic features. J Ambient Intell Human Comput 3, 47–60 (2012). https://doi.org/10.1007/s12652-011-0088-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12652-011-0088-5

Keywords

Navigation