Speaker and gender dependencies in within/cross linguistic Speech Emotion Recognition

Chakhtouna, Adil; Sekkate, Sara; Adib, Abdellah

doi:10.1007/s10772-023-10038-9

Speaker and gender dependencies in within/cross linguistic Speech Emotion Recognition

Manuscript
Published: 25 August 2023

Volume 26, pages 609–625, (2023)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

175 Accesses
Explore all metrics

Abstract

In this difficult period and with the great influence of COVID-19 on many aspects of people’s lives, many areas have been affected such as economy, tourism and especially issues related to the medical field. For example in healthcare, a lot of people suffered from psychological and emotional disorders. Speech Emotion Recognition (SER) seems to be useful for different medical teams to understand the emotional state of their patients. The central contribution of this research is the creation of new features called Stationary Mel Frequency Cepstral Coefficients (SMFCC) and Discrete Mel Frequency Cepstral Coefficients (DMFCC) through the use of Multilevel Wavelet Transform (MWT) and conventional MFCC features. The proposed method was evaluated in different patterns: Within/Cross-language, Speaker-Dependency and Gender-Dependency. Recognition rates of \(91.4\%\), \(74.4\%\) and \(80,8\%\) were reached for EMO-DB (German), RAVDESS (English) and EMOVO (Italian) target databases, respectively, in Speaker-dependent (SD) experiments for both genders (female and male). Therefore, the conclusive performance matrix is mentioned below to provide additional information on the model’s performance in the various experiments performed. The experimental results show that the proposed SER system outperforms other previous SER studies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving Speaker-Dependency/Independency of Wavelet-Based Speech Emotion Recognition

Spectro-Temporal Energy Ratio Features for Single-Corpus and Cross-Corpus Experiments in Speech Emotion Recognition

Article 27 May 2023

Text-Dependent Versus Text-Independent Speech Emotion Recognition

Data availibility

Not applicable.

Notes

References

Ahmed, S. T., Singh, D. K., Basha, S. M., Abouel Nasr, E., Kamrani, A. K., & Aboudaif, M. K. (2021). Neural network based mental depression identification and sentiments classification technique from speech signals: A covid-19 focused pandemic study. Frontiers in Public Health, 9, 781827.
Article Google Scholar
Akil, S., Sekkate, S., & Adib, A. (2021). Feature selection based on machine learning for credit scoring: An evaluation of filter and embedded methods. In 2021 International conference on innovations in intelligent systems and applications (INISTA) (pp. 1–6). IEEE.
Ancilin, J., & Milton, A. (2021). Improved speech emotion recognition with Mel frequency magnitude coefficient. Applied Acoustics, 179, 108046.
Article Google Scholar
Assunção, G., Menezes, P., & Perdigão, F. (2020). Speaker awareness for speech emotion recognition. International Journal of Online and Biomedical Engineering, 16(4), 15–22.
Google Scholar
Bhavan, A., Chauhan, P., Shah, R. R., et al. (2019). Bagged support vector machines for emotion recognition from speech. Knowledge-Based Systems, 184, 104886.
Article Google Scholar
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., Weiss, B., et al. (2005). A database of German emotional speech. Interspeech, 5, 1517–1520.
Google Scholar
Burrus, C. S., Gopinath, R. A., Guo, H., Odegard, J. E., & Selesnick, I. W. (1997). Introduction to wavelets and wavelet transforms: A primer. Pentice Hall.
Google Scholar
Chakhtouna, A., Sekkate, S., & Adib, A. (2021). Improving speech emotion recognition system using spectral and prosodic features. In 2021 International conference on intelligent systems design and applications (ISDA) (pp. 1–10). Springer.
Chakhtouna, A., Sekkate, S., & Adib, A. (2022). Improving speaker-dependency/independency of wavelet-based speech emotion recognition. In Emerging trends in intelligent systems & network security (pp. 281–291). Springer.
Chakhtouna, A., Sekkate, S., & Adib, A. (2023). Speech emotion recognition using pre-trained and fine-tuned transfer learning approaches. In Innovations in smart cities applications volume 6: The proceedings of the 7th international conference on smart city applications (pp. 365–374). Springer.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.
Article MATH Google Scholar
Costantini, G., Iaderola, I., Paoloni, A., & Todisco, M. (2014). Emovo corpus: An Italian emotional speech database. In International conference on language resources and evaluation (LREC 2014) (pp. 3501–3504). European Language Resources Association (ELRA).
Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366.
Article Google Scholar
Dissanayake, T., Rajapaksha, Y., Ragel, R., & Nawinne, I. (2019). An ensemble learning approach for electrocardiogram sensor based human emotion recognition. Sensors, 19(20), 4495.
Article Google Scholar
Evain, S., Lecouteux, B., Schwab, D., Contesse, A., Pinchaud, A., & Bernardoni, N. H. (2021). Human beatbox sound recognition using an automatic speech recognition toolkit. Biomedical Signal Processing and Control, 67, 102468.
Article Google Scholar
Eyben, F., Scherer, K. R., Schuller, B. W., Sundberg, J., André, E., Busso, C., Devillers, L. Y., Epps, J., Laukka, P., Narayanan, S. S., et al. (2015). The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing, 7(2), 190–202.
Article Google Scholar
Grossmann, A., & Morlet, J. (1984). Decomposition of hardy functions into square integrable wavelets of constant shape. SIAM Journal on Mathematical Analysis, 15(4), 723–736.
Article MathSciNet MATH Google Scholar
Issa, D., Demirci, M. F., & Yazici, A. (2020). Speech emotion recognition with deep convolutional neural networks. Biomedical Signal Processing and Control, 59, 101894.
Article Google Scholar
Janse, P. V., Magre, S. B., Kurzekar, P. K., & Deshmukh, R. (2014). A comparative study between MFCC and DWT feature extraction technique. International Journal of Engineering Research and Technology, 3(1), 3124–3127.
Google Scholar
Kanwal, S., & Asghar, S. (2021). Speech emotion recognition using clustering based GA-optimized feature set. IEEE Access, 9, 125830–125842.
Article Google Scholar
Karimi, S., & Sedaaghi, M. H. (2013). Robust emotional speech classification in the presence of babble noise. International Journal of Speech Technology, 16(2), 215–227.
Article Google Scholar
Khalil, M., Adib, A., et al. (2020). An end-to-end multi-level wavelet convolutional neural networks for heart diseases diagnosis. Neurocomputing, 417, 187–201.
Article Google Scholar
Kishore, K. K., & Satish, P. K. (2013). Emotion recognition in speech using MFCC and wavelet features. In 2013 3rd IEEE international advance computing conference (IACC) (pp. 842–847). IEEE.
Kockmann, M., Burget, L., et al. (2011). Application of speaker-and language identification state-of-the-art techniques for emotion recognition. Speech Communication, 53(9–10), 1172–1185.
Article Google Scholar
Kurpukdee, N., Kasuriya, S., Chunwijitra, V., Wutiwiwatchai, C., & Lamsrichan, P. (2017). A study of support vector machines for emotional speech recognition. In 2017 8th international conference of information and communication technology for embedded systems (IC-ICTES) (pp. 1–6). IEEE.
Kursa, M. B., Rudnicki, W. R., et al. (2010). Feature selection with the Boruta package. Journal of Statistical Software, 36(11), 1–13.
Article Google Scholar
Lalitha, S., Tripathi, S., & Gupta, D. (2019). Enhanced speech emotion detection using deep neural networks. International Journal of Speech Technology, 22(3), 497–510.
Article Google Scholar
Latif, S., Qayyum, A., Usman, M., & Qadir, J. (2018). Cross lingual speech emotion recognition: Urdu vs. Western languages. In 2018 International conference on frontiers of information technology (FIT) (pp. 88–93). IEEE.
Livingstone, S. R., & Russo, F. A. (2018). The Ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE, 13(5), e0196391.
Article Google Scholar
McFee, B., Raffel, C., Liang, D., Ellis, D. P., McVicar, M., Battenberg, E., & Nieto, O. (2015). Librosa: Audio and music signal analysis in Python. In Proceedings of the 14th Python in science conference (Vol. 8, pp. 18–25). Citeseer.
Naing, H. M. S., Hidayat, R., Hartanto, R., & Miyanaga, Y. (2020). Discrete wavelet denoising into MFCC for noise suppressive in automatic speech recognition system. International Journal of Intelligent Engineering and Systems, 13(2), 74–82.
Article Google Scholar
Nwe, T. L., Foo, S. W., & De Silva, L. C. (2003). Speech emotion recognition using hidden Markov models. Speech Communication, 41(4), 603–623.
Article Google Scholar
Praksah, C., & Gaikwad, V. (2015). Analysis of emotion recognition system through speech signal using KNN, GMM & SVM classifier. IOSR Journal of Electronics and Communication Engineering (IOSR-JECE), 10(2), 55–67.
Google Scholar
Ramakrishnan, S., & El Emary, I. M. (2013). Speech emotion recognition approaches in human computer interaction. Telecommunication Systems, 52(3), 1467–1478.
Article Google Scholar
Ramya, J., Vijaylakshmi, H., & Saifuddin, H. M. (2021). Segmentation of skin lesion images using discrete wavelet transform. Biomedical Signal Processing and Control, 69, 102839.
Article Google Scholar
Rao, K. S., Koolagudi, S. G., & Vempada, R. R. (2013). Emotion recognition from speech using global and local prosodic features. International Journal of Speech Technology, 16(2), 143–160.
Article Google Scholar
Riyad, M., Khalil, M., & Adib, A. (2021). A novel multi-scale convolutional neural network for motor imagery classification. Biomedical Signal Processing and Control, 68, 102747.
Article Google Scholar
Rybka, J., & Janicki, A. (2013). Comparison of speaker dependent and speaker independent emotion recognition. International Journal of Applied Mathematics and Computer Science,23(4).
Schuller, B., Steidl, S., Batliner, A., Burkhardt, F., Devillers, L., Müller, C., & Narayanan, S. (2010). The interspeech 2010 paralinguistic challenge. In Proceedings of INTERSPEECH 2010 (pp. 2794–2797).
Sekkate, S., Khalil, M., & Adib, A. (2017). Speaker identification: A way to reduce call-sign confusion events. In 2017 International conference on advanced technologies for signal and image processing (ATSIP) (pp. 1–6). IEEE.
Sekkate, S., Khalil, M., Adib, A., & Ben Jebara, S. (2019a). A multiresolution-based fusion strategy for improving speech emotion recognition efficiency. In International conference on mobile, secure, and programmable networking (pp. 96–109). Springer.
Sekkate, S., Khalil, M., Adib, A., & Ben Jebara, S. (2019b). An investigation of a feature-level fusion for noisy speech emotion recognition. Computers, 8(4), 91.
Article Google Scholar
Sharma, R., Pachori, R. B., & Sircar, P. (2020). Automated emotion recognition based on higher order statistics and deep learning algorithm. Biomedical Signal Processing and Control, 58, 101867.
Article Google Scholar
Shensa, M. J., et al. (1992). The discrete wavelet transform: Wedding the a Trous and Mallat algorithms. IEEE Transactions on Signal Processing, 40(10), 2464–2482.
Article MATH Google Scholar
Sönmez, Y. Ü., & Varol, A. (2020). A speech emotion recognition model based on multi-level local binary and local ternary patterns. IEEE Access, 8, 190784–190796.
Article Google Scholar
Sun, Y., Wen, G., & Wang, J. (2015). Weighted spectral features based on local Hu moments for speech emotion recognition. Biomedical Signal Processing and Control, 18, 80–90.
Article Google Scholar
Tan, Y., Sun, Z., Duan, F., Solé-Casals, J., & Caiafa, C. F. (2021). A multimodal emotion recognition method based on facial expressions and electroencephalography. Biomedical Signal Processing and Control, 70, 103029.
Article Google Scholar
Tuncer, T., Dogan, S., & Acharya, U. R. (2021). Automated accurate speech emotion recognition system using twine shuffle pattern and iterative neighborhood component analysis techniques. Knowledge-Based Systems, 211, 106547.
Article Google Scholar
Upadhya, S. S., Cheeran, A., & Nirmal, J. H. (2018). Thomson multitaper MFCC and PLP voice features for early detection of Parkinson disease. Biomedical Signal Processing and Control, 46, 293–301.
Article Google Scholar
Wang, K., Su, G., Liu, L., & Wang, S. (2020). Wavelet packet analysis for speaker-independent emotion recognition. Neurocomputing, 398, 257–264.
Article Google Scholar
Zehra, W., Javed, A. R., Jalil, Z., Khan, H. U., & Gadekallu, T. R. (2021). Cross corpus multi-lingual speech emotion recognition using ensemble learning. Complex & Intelligent Systems, 1–10.
Zhao, J., Mao, X., & Chen, L. (2019). Speech emotion recognition using deep 1d & 2d CNN LSTM networks. Biomedical Signal Processing and Control, 47, 312–323.
Article Google Scholar
Zhu, L., Chen, L., Zhao, D., Zhou, J., & Zhang, W. (2017). Emotion recognition from Chinese speech for smart affective services using a combination of SVM and DBN. Sensors, 17(7), 1694.
Article Google Scholar

Download references

Acknowledgements

This work was supported by the Ministry of Higher Education, Scientific Research and Innovation, the Digital Development Agency (DDA) and the CNRST of Morocco (Alkhawarizmi/2020/01).

Author information

Authors and Affiliations

Team Data Science & Artificial Intelligence, Laboratory of Mathematics, Computer Science and Applications (LMCSA), Faculty of Sciences and Technologies, Hassan II University, Mohammedia, Morocco
Adil Chakhtouna & Abdellah Adib
Higher National School of Arts and Crafts, Hassan II University, Casablanca, Morocco
Sara Sekkate

Authors

Adil Chakhtouna
View author publications
You can also search for this author in PubMed Google Scholar
Sara Sekkate
View author publications
You can also search for this author in PubMed Google Scholar
Abdellah Adib
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Adil Chakhtouna, Sara Sekkate or Abdellah Adib.

Ethics declarations

Conflict of interest

The authors have no financial or proprietary interests in any material discussed in this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix 1: Mathematical representation

The overall mathematical representation adopted in the current research is summarized in the following Table 13.

Table 13 Overview of the mathematical notation adopted in this research

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chakhtouna, A., Sekkate, S. & Adib, A. Speaker and gender dependencies in within/cross linguistic Speech Emotion Recognition. Int J Speech Technol 26, 609–625 (2023). https://doi.org/10.1007/s10772-023-10038-9

Download citation

Received: 10 July 2022
Accepted: 03 August 2023
Published: 25 August 2023
Issue Date: September 2023
DOI: https://doi.org/10.1007/s10772-023-10038-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speaker and gender dependencies in within/cross linguistic Speech Emotion Recognition

Abstract

Access this article

Similar content being viewed by others

Improving Speaker-Dependency/Independency of Wavelet-Based Speech Emotion Recognition

Spectro-Temporal Energy Ratio Features for Single-Corpus and Cross-Corpus Experiments in Speech Emotion Recognition

Text-Dependent Versus Text-Independent Speech Emotion Recognition

Data availibility

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix 1: Mathematical representation

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Speaker and gender dependencies in within/cross linguistic Speech Emotion Recognition

Abstract

Access this article

Similar content being viewed by others

Improving Speaker-Dependency/Independency of Wavelet-Based Speech Emotion Recognition

Spectro-Temporal Energy Ratio Features for Single-Corpus and Cross-Corpus Experiments in Speech Emotion Recognition

Text-Dependent Versus Text-Independent Speech Emotion Recognition

Data availibility

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix 1: Mathematical representation

Appendix 1: Mathematical representation

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation