Skip to main content
Log in

A statistical feature extraction for deep speech emotion recognition in a bilingual scenario

  • 1226: Deep-Patterns Emotion Recognition in the Wild
  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Previously and currently, most of literature works on Speech Emotion Recognition (SER) have been orientated towards a monolingual approach. The current study extends monolingual SER to a bilingual setting. However, in order to construct a generalized emotion recognition system for different languages, selecting a proper feature extraction and classification methods remains a topic of discussion. To address this issue, a promising method is proposed and evaluated. On the one hand, the proposed method is based on a statistical-based parameterization framework for representing the speech through a fixed-length vector. On the other hand, we propose a deep learning approach that combines three convolutional neural networks architectures. Based on these, monolingual and bilingual emotion recognition experiments have been conducted using the English RAVDESS and Italian EMOVO corpora. The effectiveness of the suggested SER model is proved from the experimentations when compared to state-of-the-art monolingual methods, with an average achieve of 87.08% and 83.90% in terms of accuracy over RAVDESS and EMOVO datasets, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data Availability

RAVDESS and EMOVO data that support the findings of this study are publicly available in https://zenodo.org/record/1188976#.YwSinnbMK3A and http://voice.fub.it/activities/corpora/emovo/index.html respectively.

References

  1. Aggarwal CC (2018) Neural networks and deep learning. Springer, vol 10, pp 978–983

  2. Ancilin J, Milton A (2021) Improved speech emotion recognition with mel frequency magnitude coefficient. Appl Acoustics 179:108046

    Article  Google Scholar 

  3. Badshah AM, Ahmad J, Rahim N, Baik SW (2017) Speech emotion recognition from spectrograms with deep convolutional neural network. In: 2017 International conference on platform technology and service (PlatCon), pp 1–5

  4. Bensalah N, Ayad H, Adib A, Farouk AIE (2020) Lstm vs. gru for arabic machine translation. In: SoCPaR, pp 156–165

  5. Bhavan A, Chauhan P, Hitkul, Shah RR (2019) Bagged support vector machines for emotion recognition from speech. Knowl-Based Syst 184:104886

    Article  Google Scholar 

  6. Bouny LE, Khalil M, Adib A (2020) ECG heartbeat classification based on multi-scale wavelet convolutional neural networks. In: 2020 IEEE international conference on acoustics, speech and signal processing, ICASSP. Barcelona, Spain, 4-8 May 2020. IEEE, pp 3212–3216

  7. Bouny LE, Khalil M, Adib A (2020) An end-to-end multi-level wavelet convolutional neural networks for heart diseases diagnosis. Neurocomputing 417:187–201

    Article  Google Scholar 

  8. Braga D, Madureira A, Coelho L, Ajith R (2019) Automatic detection of parkinson’s disease based on acoustic analysis of speech. In: Engineering applications of artificial intelligence, vol 77, pp 148–158

  9. Chen M, He X, Yang J, Zhang H (2018) 3-d Convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process Lett 25:1440–1444

    Article  Google Scholar 

  10. Christy A, Vaithyasubramanian S, Jesudoss A et al (2020) Multimodal speech emotion recognition and classification using convolutional neural network techniques. Int J Speech Technol 23:381–388

    Article  Google Scholar 

  11. Costantini G, Iaderola I, Paoloni A, Massimiliano T (2014) EMOVO Corpus: an Italian emotional speech database. In: Proceedings of the ninth international conference on language resources and evaluation (LREC’14), pp 3501–3504, Reykjavik, Iceland. European language resources association (ELRA)

  12. Davis S, Mermelstein P (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. In: IEEE transactions on acoustics, speech, and signal processing, vol 28, pp 357–366

  13. Ekman P (1992) Are there basic emotions?. Am Psychol Assoc 99 (3):550

    Google Scholar 

  14. Elangovan P, Nath MK (2021) A novel shallow convnet-18 for malaria parasite detection in thin blood smear images. In: SN computer science, vol 2, pp 1–11

  15. Han K, Yu D, Tashev I (2014) Speech emotion recognition using deep neural network and extreme learning machine. In: Interspeech 2014

  16. Heracleous P, Yoneyama A (2019) A comprehensive study on bilingual and multilingual speech emotion recognition using a two-pass classification scheme. PLoS One, vol 8

  17. Hifny Y, networks AA (2019) Efficient arabic emotion recognition using deep neural. In: IEEE international conference on acoustics speech and signal processing (ICASSP), pp 6710–6714

  18. Huang Z, Dong M, Mao Q, Zhan Y (2014) Speech emotion recognition using cnn. In: Proceedings of the 22nd ACM international conference on multimedia, MM ’14. New York, NY USA, pp 801–804. Association for computing machinery

  19. Huang Y, Tian K, Wu A, Zhang G (2017) Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition. In: Journal of ambient intelligence and humanized computing

  20. Issa D, Demirci MF, Yazici A (2020) Speech emotion recognition with deep convolutional neural networks. Biomed Signal Process Contr 59:101894

    Article  Google Scholar 

  21. John Kim, Saurous RA (2018) Emotion recognition from human speech using temporal information and deep learning. In: Proceeding interspeech, vol 2018, pp 937–940

  22. Kar MK, Nath MK, Neog DR (2021) A review on progress in semantic image segmentation and its application to medical images. In: SN computer science, vol 2

  23. Kerkeni L, Serrestou Y, Raoof K, Mbarki M, Mahjoub MA, Cleder C (2019) Automatic speech emotion recognition using an optimal combination of features based on emd-tkeo. In: Speech communication

  24. Khan S, Rahmani H, Ali Shah SA, Bennamoun M, Medioni G, Dickinson S (2018) A guide to convolutional neural networks for computer vision. Springer, ISBN: 9783031006937

  25. Kim J (2007) Bimodal emotion recognition using speech and physiological changes. In: Robust speech recognition and understanding. I-tech education and publishing vienna, vol 265, p 280

  26. Kim J, André E (2008) Emotion recognition based on physiological changes in music listening. In: IEEE transactions on pattern analysis and machine intelligence. IEEE, vol 30, pp 2067–2083

  27. Kim Y, Yun TS (2021) How to classify sand types: a deep learning approach. In: Engineering geology, vol 288, p 106142

  28. Kinnunen T, Li H (2010) An overview of text-independent speaker recognition: from features to supervectors. In: Speech communication. Elsevier, vol 52, pp 12–40

  29. Lakomkin E, Zamani MA, Weber C, Magg S, Wermter S (2018) On the robustness of speech emotion recognition for human-robot interaction with deep neural networks. In: 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp 854–860

  30. Lalitha S, Gupta D, Zakariah M, Alotaibi YA (2020) Investigation of multilingual and mixed-lingual emotion recognition using enhanced cues with data augmentation. Appl Acoustics 170:107519

    Article  Google Scholar 

  31. Lang PJ (1995) The emotion probe: studies of motivation and attention. In: American psychologist, vol 50, p 372. American psychological association

  32. Lella KK, Alphonse PJA (2021) Automatic covid-19 disease diagnosis using 1d convolutional neural network and augmentation with human respiratory sound based on parameters: cough, breath, and voice. In: AIMS public health, vol 8

  33. Livingstone SR, Russo FA (2018) The ryerson audio-visual database of emotionalspeech and song (ravdess): a dynamic, multimodal set of facial and vocal expressions in north american english. Plos One, vol 13

  34. Lopez-Moreno I, Gonzalez-Dominguez J, Plchot O, Martinez D, Gonzalez-Rodriguez J, Moreno P (2014) Automatic language identification using deep neural networks. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5337–5341

  35. Mansouri-Benssassi E, Ye J (2019) Speech emotion recognition with early visual cross-modal enhancement using spiking neural networks. In: 2019 International joint conference on neural networks (IJCNN), pp 1–8

  36. Mao Q, Dong M, Huang Z, Zhan Y (2014) Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans Multimed 16:2203–2213

    Article  Google Scholar 

  37. McFee B, Raffel C, Liang D, Ellis DPW, McVicar M, Battenberg E, Nieto O (2015) Librosa: audio and music signal analysis in python. In: Proceedings of the 14th python in science conference, pp. 18–25.

  38. Meftah A, Alotaibi YA, Selouani S-A (2014) Designing, building, and analyzing an arabic speech emotional corpus. In: Phase 2. in 5th international conference on arabic language processing, pp 181–184

  39. Mustaqeem, Kwon S (2020) A cnn-assisted enhanced audio signal processing for speech emotion recognition. In: Sensors, vol 20

  40. Nagarajan S, Srinivas Nettimi SS, Kumar LS, Nath MK, Kanhe A (2020) Speech emotion recognition using cepstral features extracted with novel triangular filter banks based on bark and erb frequency scales. Digital Signal Process 104:102763

    Article  Google Scholar 

  41. Neumann M, Thang Vu NG (2018) Cross-lingual and multilingual speech emotion recognition on english and french. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5769–5773

  42. Ortony A, Turner TJ (1990) What’s basic about basic emotions? Psychological Rev 97:315–331

    Article  Google Scholar 

  43. Özseven T (2019) A novel feature selection method for speech emotion recognition. Appl Acoustics 146:320–326

    Article  Google Scholar 

  44. Palo HK, Mohanty MN (2018) Wavelet based feature combination for recognition of emotions. In: Ain shams engineering journal, vol 9, pp 1799–1806

  45. Pandey SK, Shekhawat HS, Prasanna SRM (2019) Emotion recognition from raw speech using wavenet. In: TENCON 2019 - 2019 IEEE region 10 conference (TENCON), pp 1292–1297

  46. Picone JW (1993) Signal modeling techniques in speech recognition. In: Proceedings of the IEEE, vol 81, pp 1215–1247

  47. Polzehl T, Schmitt A, Metze F (2010) Approaching multi-lingual emotion recognition from speech-on language dependency of acoustic prosodic features for anger detection. In: Proceedings of speech prosody

  48. Popova AS, Rassadin AG, Ponomarenko AA (2018) Emotion recognition in sound. In: Kryzhanovsky B, Dunin-Barkowski W, Redko V (eds) Advances in neural computation, machine learning, and cognitive research, pp 117–124, Cham. Springer international publishing

  49. Riyad M, Khalil M, Adib A (2020) Incep-eegnet: a convnet for motor imagery decoding. In: Moataz AE, Mammass D, Mansouri A, Nouboud F (eds) Image and signal processing - 9th international conference, ICISP 2020, Marrakesh, Morocco, 4-6 June 2020, proceedings, vol 12119 of lecture notes in computer science. Springer, pp 103–111

  50. Russell J (1980) A circumplex model of affect. In: Journal of personality and social psychology, vol 39, pp 1161–1178, 12

  51. Schuller B, Arsic D, Wallhoff F, Rigoll G (2006) Emotion recognition in the noise applying large acoustic feature sets. In: Speech Prosody

  52. Schuller B, Zhangm Z, Weninger F, Rigoll G (2011) Selecting training data for cross-corpus speech emotion recognition: prototypicality vs. generalization

  53. Sefara TJ (2019) The effects of normalisation methods on speech emotion recognition. In: 2019 International multidisciplinary information technology and engineering conference (IMITEC), pp 1–8

  54. Sekkate S, Khalil M, Adib A (2019) Speaker identification for ofdm-based aeronautical communication system. In: Circuits, systems, and signal processing. Springer US, vol 38, pp 3743–3761

  55. Sekkate S, Khalil M, Adib (2020) A statistical based modeling approach for deep learning based speech emotion recognition. In: International conference on intelligent systems design and applications (ISDA)

  56. Sekkate S, Khalil M, Adib A, Jebara SB (2019) A multiresolution-based fusion strategy for improving speech emotion recognition efficiency. In: Mobile, secure, and programmable networking, pp 96–109, Cham. Springer international publishing

  57. Sekkate S, Khalil M, Adib A, Jebara SB (2019) An investigation of a feature-level fusion for noisy speech emotion recognition. In: Computers, vol 8

  58. Settle S, Roux JL, Hori T, Watanabe S, Hershey JR (2018) End-to-end multi-speaker speech recognition. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4819–4823

  59. Sönmez YÜ, Varol A (2020) A speech emotion recognition model based on multi-level local binary and local ternary patterns. IEEE Access 8:190784–190796

    Article  Google Scholar 

  60. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Dropout RS (2014) A simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958

    MathSciNet  MATH  Google Scholar 

  61. Sugan N, Srinivas NSS, Kar N, Kumar LS, Nath MK, Kanhe A (2018) Performance comparison of different cepstral features for speech emotion recognition. In: 2018 International CET conference on control, communication, and computing (IC4), pp 266–271

  62. Sugan N, Srinivas NSS, Kar N, Kumar LS, Nath MK, Kanhe A (2019) Recognition of spoken languages from acoustic speech signals using fourier parameters. In: Circuits, systems, and signal processing, vol 38, pp 5018–5067

  63. Thoits PA (1989) The sociology of emotions. In: Annual review of sociology. Annual reviews 4139 el camino way, PO Box 10139, Palo Alto, CA 94303-0139, USA, vol 15, pp 317–342

  64. Trigeorgis G, Ringeval F, Brueckner R, Marchi E, Nicolaou MA, Schuller B, Zafeiriou S (2016) Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 5200–5204

  65. Yang B, Lugger M (2010) Emotion recognition from speech signals using new harmony features. In: Signal processing, vol 90, pp 1415–1423. Special Section on Statistical Signal & Array Processing

  66. Yenigalla P, Kumar A, Tripathi S, Singh C, Kar S, Vepa J (2018) Speech emotion recognition using spectrogram & phoneme embedding. In: INTERSPEECH

  67. Zeng Y, Mao H, Peng D, Yi Z (2019) Spectrogram based multi-task audio classification. In: Multimedia tools appl, vol 78, pp 3705–3722, USA. Kluwer academic publishers

Download references

Acknowledgements

The research presented in this paper was supported by the Ministry of Higher Education, Scientific Research and Innovation, the Digital Development Agency (ADA) and the CNRST of Morocco (Alkhawarizmi/2020/01).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sara Sekkate.

Ethics declarations

Conflict of Interests

The authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest, or non-financial interest in the subject matter or materials discussed in this manuscript.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sekkate, S., Khalil, M. & Adib, A. A statistical feature extraction for deep speech emotion recognition in a bilingual scenario. Multimed Tools Appl 82, 11443–11460 (2023). https://doi.org/10.1007/s11042-022-14051-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-14051-z

Keywords

Navigation