Speaker identification in stressful talking environments based on convolutional neural network

Shahin, Ismail; Nassif, Ali Bou; Hindawi, Noor

doi:10.1007/s10772-021-09869-1

Speaker identification in stressful talking environments based on convolutional neural network

Published: 05 July 2021

Volume 24, pages 1055–1066, (2021)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

429 Accesses
6 Citations
Explore all metrics

Abstract

Speaker identification accuracy in stressful environments is deficient compared to neutral environment. This research aims at employing and assessing an up-to-date classifier to reinforce and enhance the degraded text-independent speaker identification accuracy in stressful environments. The classifier is based on exploiting supervised Convolutional Neural Network (CNN) using three distinct speech databases: Arabic Emirati-accented database, “Speech Under Simulated and Actual Stress (SUSAS) English database, and Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)” English datasets with the concatenated Mel-Frequency Cepstral Coefficients (MFCCs), MFCCs-delta, and MFCCs-delta-delta as the extracted features. Using Emirati-accented corpus, CNN surpasses all the shallow classifiers where the results demonstrate that CNN yields average accuracy of 81.6% compared to 53.4%, 47.8%, 43.1%, 31.8%, and 19.5% based, respectively, on “Support Vector Machine, K-Nearest Neighbor, Multi-Layer Perceptron, Radial Basis Function, and Naïve Bayes”. CNN is also superior to the shallow classifiers using SUSAS and RAVDESS databases. Our results of speaker identification accuracy demonstrate that optimal solutions have been achieved using the Grid Search hyper parameter optimization algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Novel cascaded Gaussian mixture model-deep neural network classifier for speaker identification in emotional talking environments

Article 04 October 2018

Novel hybrid DNN approaches for speaker verification in emotional and stressful talking environments

Article 22 June 2021

A text independent speaker identification system using ANN, RNN, and CNN classification technique

Article 02 November 2023

Data availability

In this work, the authors used three different speech corpora in which two of them (SASUS and RAVDESS) are public and one is private (Emirati-accented). The description of the two public databases is provided in subsections 4.2 and 4.3, while a consent from the adult participants, as well as minors’ parents was given before performing the experiments of the private dataset (subsection 4.1).

References

Abdel-hamid, O., Mohamed, A., Jiang, H., Deng, L., Penn, G., & Yu, D. (2014). Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(10), 1533–1545.
Ahmad, K., Thosar, A., Nirmal, J., & Pande, V. (2015). A unique approach in text independent speaker recognition using MFCC feature sets and probabilistic neural network. Eighth International Conference on Advances in Pattern Recognition (ICAPR), 2015, 1–6.
Google Scholar
Basheer, I. A., & Hajmeer, M. (2000). Artificial neural networks: Fundamentals, computing, design, and application. Journal of Microbiol Methods, 43(1), 3–31.
Article Google Scholar
Bergstra, J., & Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13, 281–305.
MathSciNet MATH Google Scholar
Bhattacharya, G., Kenny, P., Alam, J., Stafylakis, T., & Kenny, P. (2016). Deep neural network based text-dependent speaker verification: preliminary results. Odyssey. https://doi.org/10.21437/Odyssey.2016-2
Bou-Ghazale, S. E., & Hansen, J. H. L. (2000). A Comparative study of traditional and newly proposed features for recognition of speech under stress. IEEE Transaction Speech Audio Process., 8(4), 429–442. https://doi.org/10.1109/89.848224
Article Google Scholar
Bunrit, S., Inkian, T., Kerdprasop, N., & Kerdprasop, K. (2019). Text-independent speaker identification using deep learning model of convolution neural network. International Journal of Machine Learning and Computing, 9(2), 143–148. https://doi.org/10.18178/ijmlc.2019.9.2.778
Article Google Scholar
Farrell, K. R., Mammone, R. J., & Assaleh, K. T. (1994). Speaker recognition using neural networks and conventional classifiers. IEEE Transaction Speech Audio Process., 2(1), 194–205. https://doi.org/10.1109/89.260362
Article Google Scholar
Furui, S. (1986). Speaker-independent isolated word recognition using dynamic features of speech spectrum. IEEE Transactions on Acoustic, 34(1), 52–59. https://doi.org/10.1109/TASSP.1986.1164788
Article Google Scholar
Furui, S. (1991). Speaker-dependent-feature extraction, recognition and processing techniques. Speech Communication, 10(5–6), 505–520. https://doi.org/10.1016/0167-6393(91)90054-W
Article Google Scholar
Godino-llorente, J., Gómez-vilda, P., & Blanco-velasco, M. (2006). Dimensionality reduction of a pathological voice quality assessment system based on Gaussian mixture models and short-term cepstral parameters. IEEE Transactions on Biomedical Engineering, 53(10), 1943–1953.
Article Google Scholar
Goutte, C., & Gaussier, E. (2005) A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. In Advances in Information Retrieval, pp. 345–359.
Hansen, J. (1999). “SUSAS Transcripts LDC99T33”, Web Download. Linguistic Data Consortium.
Hansen , J., & Bou-Ghazale, S. (1997). Getting started with SUSAS : A speech under simulated and actual stress database. In Fifth European conference on speech communication and technology (pp. 2–5).
Hanson, B., & Applebaum, T. (1990) Robust speaker-independent word recognition using static, dynamic and acceleration features: Experiments with Lombard and noisy speech. In International conference on acoustics, speech, and signal processing, pp. 857–860.
Hasan, R., Jamil, M., Rabbani, G., & Rahman, S. (2004). Speaker identification using MEL frequency cepstral coefficients. Variations, 1(4)
Hogg, R., McKean, J., & Craig, A. (2005). Introduction to mathematical statistics
Jalil, A. M., Hasan, F. S., & Alabbasi, H. A. (2019). Speaker identification using convolutional neural network for clean and noisy speech samples. In First international conference of computer and applied sciences (CAS) (pp. 57–62). https://doi.org/10.1109/CAS47993.2019.9075461
Article Google Scholar
Livingstone, S., & Russo, F. (2018). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE, 13(5)
Lukic, Y., Vogt, C., Durr, O., & Stadelmann, T. (2016). Speaker identification and clustering using convolutional neural networks. IEEE International Workshop on Machine Learning for Signal Processing. https://doi.org/10.1109/MLSP.2016.7738816
Article Google Scholar
Mao, Q., Dong, M., Huang, Z., & Zhan, Y. (2014). Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Transaction Multimedia, 16(8), 2203–2213. https://doi.org/10.1109/TMM.2014.2360798
Article Google Scholar
Nassif, A. B., Shahin, I., Attili, I., Azzeh, M., & Shaalan, K. (2019). Speech recognition using deep neural networks: A systematic review. IEEE Access, 7, 19143–19165. https://doi.org/10.1109/ACCESS.2019.2896880
Article Google Scholar
Nassif, A. B., Shahin, I., Hamsa, S., Nemmour, N., & Hirose, K. (2021). CASA-based speaker identification using cascaded GMM-CNN classifier in noisy and emotional talking conditions. Applied Soft Computing, 103, 107141. https://doi.org/10.1016/j.asoc.2021.107141
Article Google Scholar
Quatieri, T. F. (2002). Discrete-time speech signal processing: principles and practice. 2002.
Raja, G. S., & Dandapat, S. (2010). Speaker recognition under stressed condition. International Journal of Speech Technology, 13(3), 141–161. https://doi.org/10.1007/s10772-010-9075-z
Article Google Scholar
Reynolds, D. A. (2002). An overview of automatic speaker recognition technology. In IEEE international conference on acoustics, speech and signal processing (vol. 4, pp. 4072–4075). https://doi.org/10.1109/ICASSP.2002.5745552
Article Google Scholar
Shahin, I. (2006). Enhancing speaker identification performance under the shouted talking condition using second-order circular hidden Markov models. Speech Communication, 48(4), 1047–1055.
Article Google Scholar
Shahin, I. (2008). Speaker identification in the shouted environment using Suprasegmental Hidden Markov Models. Signal Processing, 88(11), 2700–2708. https://doi.org/10.1016/j.sigpro.2008.05.012
Article MATH Google Scholar
Shahin, I. (2010). Employing second-order circular suprasegmental hidden markov models to enhance speaker identification performance in shouted talking environments. EURASIP Journal on Audio, Speech, and Music Processing. https://doi.org/10.1155/2010/862138
Article Google Scholar
Shahin, I. (2012). Studying and enhancing talking condition recognition in stressful and emotional talking environments based on HMMs, CHMM2s and SPHMMs. Journal of Multimodal User Interfaces, 6(1–2), 59–71. https://doi.org/10.1007/s12193-011-0082-4
Article Google Scholar
Shahin, I. (2013). Employing both gender and emotion cues to enhance speaker identification performance in emotional talking environments. International Journal of Speech Technology, 16(3), 341–351. https://doi.org/10.1007/s10772-013-9188-2
Article Google Scholar
Shahin, I. (2016a). “Speaker identification in a shouted talking environment based on novel third-order circular suprasegmental hidden markov models. Circuits, Systems, and Signal Processing, 35(10), 3770–3792. https://doi.org/10.1007/s00034-015-0220-4
Article MathSciNet Google Scholar
Shahin, I. (2016b). Employing emotion cues to verify speakers in emotional talking environments. Journal of Intelligent Systems, 25(1), 3–17. https://doi.org/10.1515/jisys-2014-0118
Article MathSciNet Google Scholar
Shahin, I. (2016c). Emirati speaker verification based on HMMls, HMM2s, and HMM3s. In IEEE 13th international conference on signal processing (ICSP) (pp. 562–567). https://doi.org/10.1109/ICSP.2016.7877896.
Shahin, I. (2018a). Novel third-order hidden Markov models for speaker identification in shouted talking environments. Engineering Applications of Artificial Intelligence, 35(10), 316–323. https://doi.org/10.1016/j.engappai.2014.07.006
Article Google Scholar
Shahin, I. (2018b). Text-independent emirati-accented speaker identification in emotional talking environment. In Fifth HCT information technology trends (ITT) (pp. 257–262). https://doi.org/10.1109/CTIT.2018.8649514
Article Google Scholar
Shahin, I., & Ba-Hutair, M. N. (2014). Emarati speaker identification. In 12th international conference on signal processing (ICSP) (pp. 488–493). https://doi.org/10.1109/ICOSP.2014.7015053.
Shahin, I., & Nassif, A. B. (2019). Emirati-accented speaker identification in stressful talking conditions. In International conference on electrical and computing technologies and applications (ICECTA), Nov. 2019 (pp. 1–6). https://doi.org/10.1109/ICECTA48151.2019.8959731.
Shahin, I., Nassif, A. B., & Bahutair, M. (2018). Emirati-accented speaker identification in each of neutral and shouted talking environments. International Journal of Speech Technology, 21(2), 265–278. https://doi.org/10.1007/s10772-018-9502-0
Article Google Scholar
Shahin, I., Nassif, A. B., & Hamsa, S. (2018). Novel cascaded Gaussian mixture model-deep neural network classifier for speaker identification in emotional talking environments. Neural Computing and Applications. https://doi.org/10.1007/s00521-018-3760-2
Article Google Scholar
Wikipedia, F. (2014). Softmax function. http://en.wikipedia.org/w/index.php?title=Softmax_function&oldid=623230338.

Download references

Funding

Ismail Shahin, Ali Bou Nassif, and Noor Hindawi would like to express their thanks and gratitude to the “University of Sharjah for their assistance through the competitive research project entitled Emirati-Accented Speaker and Emotion Recognition Based on Deep Neural Network, No. 19020403139.”

Author information

Ismail Shahin
Present address: Department of Electrical Engineering, University of Sharjah, P. O. Box 27272, Sharjah, United Arab Emirates

Authors and Affiliations

Department of Electrical Engineering, University of Sharjah, P. O. Box 27272, Sharjah, United Arab Emirates
Noor Hindawi
Department of Computer Engineering, University of Sharjah, P. O. Box 27272, Sharjah, United Arab Emirates
Ali Bou Nassif

Authors

Ismail Shahin
View author publications
You can also search for this author in PubMed Google Scholar
Ali Bou Nassif
View author publications
You can also search for this author in PubMed Google Scholar
Noor Hindawi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ismail Shahin.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shahin, I., Nassif, A.B. & Hindawi, N. Speaker identification in stressful talking environments based on convolutional neural network. Int J Speech Technol 24, 1055–1066 (2021). https://doi.org/10.1007/s10772-021-09869-1

Download citation

Received: 31 December 2020
Accepted: 24 June 2021
Published: 05 July 2021
Issue Date: December 2021
DOI: https://doi.org/10.1007/s10772-021-09869-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speaker identification in stressful talking environments based on convolutional neural network

Abstract

Access this article

Similar content being viewed by others

Novel cascaded Gaussian mixture model-deep neural network classifier for speaker identification in emotional talking environments

Novel hybrid DNN approaches for speaker verification in emotional and stressful talking environments

A text independent speaker identification system using ANN, RNN, and CNN classification technique

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Speaker identification in stressful talking environments based on convolutional neural network

Abstract

Access this article

Similar content being viewed by others

Novel cascaded Gaussian mixture model-deep neural network classifier for speaker identification in emotional talking environments

Novel hybrid DNN approaches for speaker verification in emotional and stressful talking environments

A text independent speaker identification system using ANN, RNN, and CNN classification technique

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation