Skip to main content
Log in

Speaker identification in stressful talking environments based on convolutional neural network

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Speaker identification accuracy in stressful environments is deficient compared to neutral environment. This research aims at employing and assessing an up-to-date classifier to reinforce and enhance the degraded text-independent speaker identification accuracy in stressful environments. The classifier is based on exploiting supervised Convolutional Neural Network (CNN) using three distinct speech databases: Arabic Emirati-accented database, “Speech Under Simulated and Actual Stress (SUSAS) English database, and Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)” English datasets with the concatenated Mel-Frequency Cepstral Coefficients (MFCCs), MFCCs-delta, and MFCCs-delta-delta as the extracted features. Using Emirati-accented corpus, CNN surpasses all the shallow classifiers where the results demonstrate that CNN yields average accuracy of 81.6% compared to 53.4%, 47.8%, 43.1%, 31.8%, and 19.5% based, respectively, on “Support Vector Machine, K-Nearest Neighbor, Multi-Layer Perceptron, Radial Basis Function, and Naïve Bayes”. CNN is also superior to the shallow classifiers using SUSAS and RAVDESS databases. Our results of speaker identification accuracy demonstrate that optimal solutions have been achieved using the Grid Search hyper parameter optimization algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

In this work, the authors used three different speech corpora in which two of them (SASUS and RAVDESS) are public and one is private (Emirati-accented). The description of the two public databases is provided in subsections 4.2 and 4.3, while a consent from the adult participants, as well as minors’ parents was given before performing the experiments of the private dataset (subsection 4.1).

References

  • Abdel-hamid, O., Mohamed, A., Jiang, H., Deng, L., Penn, G., & Yu, D. (2014). Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(10), 1533–1545.

  • Ahmad, K., Thosar, A., Nirmal, J., & Pande, V. (2015). A unique approach in text independent speaker recognition using MFCC feature sets and probabilistic neural network. Eighth International Conference on Advances in Pattern Recognition (ICAPR), 2015, 1–6.

    Google Scholar 

  • Basheer, I. A., & Hajmeer, M. (2000). Artificial neural networks: Fundamentals, computing, design, and application. Journal of Microbiol Methods, 43(1), 3–31.

    Article  Google Scholar 

  • Bergstra, J., & Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13, 281–305.

    MathSciNet  MATH  Google Scholar 

  • Bhattacharya, G., Kenny, P., Alam, J., Stafylakis, T., & Kenny, P. (2016). Deep neural network based text-dependent speaker verification: preliminary results. Odyssey. https://doi.org/10.21437/Odyssey.2016-2

  • Bou-Ghazale, S. E., & Hansen, J. H. L. (2000). A Comparative study of traditional and newly proposed features for recognition of speech under stress. IEEE Transaction Speech Audio Process., 8(4), 429–442. https://doi.org/10.1109/89.848224

    Article  Google Scholar 

  • Bunrit, S., Inkian, T., Kerdprasop, N., & Kerdprasop, K. (2019). Text-independent speaker identification using deep learning model of convolution neural network. International Journal of Machine Learning and Computing, 9(2), 143–148. https://doi.org/10.18178/ijmlc.2019.9.2.778

    Article  Google Scholar 

  • Farrell, K. R., Mammone, R. J., & Assaleh, K. T. (1994). Speaker recognition using neural networks and conventional classifiers. IEEE Transaction Speech Audio Process., 2(1), 194–205. https://doi.org/10.1109/89.260362

    Article  Google Scholar 

  • Furui, S. (1986). Speaker-independent isolated word recognition using dynamic features of speech spectrum. IEEE Transactions on Acoustic, 34(1), 52–59. https://doi.org/10.1109/TASSP.1986.1164788

    Article  Google Scholar 

  • Furui, S. (1991). Speaker-dependent-feature extraction, recognition and processing techniques. Speech Communication, 10(5–6), 505–520. https://doi.org/10.1016/0167-6393(91)90054-W

    Article  Google Scholar 

  • Godino-llorente, J., Gómez-vilda, P., & Blanco-velasco, M. (2006). Dimensionality reduction of a pathological voice quality assessment system based on Gaussian mixture models and short-term cepstral parameters. IEEE Transactions on Biomedical Engineering, 53(10), 1943–1953.

    Article  Google Scholar 

  • Goutte, C., & Gaussier, E. (2005) A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. In Advances in Information Retrieval, pp. 345–359.

  • Hansen, J. (1999). “SUSAS Transcripts LDC99T33”, Web Download. Linguistic Data Consortium.

  • Hansen , J., & Bou-Ghazale, S. (1997). Getting started with SUSAS : A speech under simulated and actual stress database. In Fifth European conference on speech communication and technology (pp. 2–5).

  • Hanson, B., & Applebaum, T. (1990) Robust speaker-independent word recognition using static, dynamic and acceleration features: Experiments with Lombard and noisy speech. In International conference on acoustics, speech, and signal processing, pp. 857–860.

  • Hasan, R., Jamil, M., Rabbani, G., & Rahman, S. (2004). Speaker identification using MEL frequency cepstral coefficients. Variations, 1(4)

  • Hogg, R., McKean, J., & Craig, A. (2005). Introduction to mathematical statistics

  • Jalil, A. M., Hasan, F. S., & Alabbasi, H. A. (2019). Speaker identification using convolutional neural network for clean and noisy speech samples. In First international conference of computer and applied sciences (CAS) (pp. 57–62). https://doi.org/10.1109/CAS47993.2019.9075461

    Article  Google Scholar 

  • Livingstone, S., & Russo, F. (2018). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE, 13(5)

  • Lukic, Y., Vogt, C., Durr, O., & Stadelmann, T. (2016). Speaker identification and clustering using convolutional neural networks. IEEE International Workshop on Machine Learning for Signal Processing. https://doi.org/10.1109/MLSP.2016.7738816

    Article  Google Scholar 

  • Mao, Q., Dong, M., Huang, Z., & Zhan, Y. (2014). Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Transaction Multimedia, 16(8), 2203–2213. https://doi.org/10.1109/TMM.2014.2360798

    Article  Google Scholar 

  • Nassif, A. B., Shahin, I., Attili, I., Azzeh, M., & Shaalan, K. (2019). Speech recognition using deep neural networks: A systematic review. IEEE Access, 7, 19143–19165. https://doi.org/10.1109/ACCESS.2019.2896880

    Article  Google Scholar 

  • Nassif, A. B., Shahin, I., Hamsa, S., Nemmour, N., & Hirose, K. (2021). CASA-based speaker identification using cascaded GMM-CNN classifier in noisy and emotional talking conditions. Applied Soft Computing, 103, 107141. https://doi.org/10.1016/j.asoc.2021.107141

    Article  Google Scholar 

  • Quatieri, T. F. (2002). Discrete-time speech signal processing: principles and practice. 2002.

  • Raja, G. S., & Dandapat, S. (2010). Speaker recognition under stressed condition. International Journal of Speech Technology, 13(3), 141–161. https://doi.org/10.1007/s10772-010-9075-z

    Article  Google Scholar 

  • Reynolds, D. A. (2002). An overview of automatic speaker recognition technology. In IEEE international conference on acoustics, speech and signal processing (vol. 4, pp. 4072–4075). https://doi.org/10.1109/ICASSP.2002.5745552

    Article  Google Scholar 

  • Shahin, I. (2006). Enhancing speaker identification performance under the shouted talking condition using second-order circular hidden Markov models. Speech Communication, 48(4), 1047–1055.

    Article  Google Scholar 

  • Shahin, I. (2008). Speaker identification in the shouted environment using Suprasegmental Hidden Markov Models. Signal Processing, 88(11), 2700–2708. https://doi.org/10.1016/j.sigpro.2008.05.012

    Article  MATH  Google Scholar 

  • Shahin, I. (2010). Employing second-order circular suprasegmental hidden markov models to enhance speaker identification performance in shouted talking environments. EURASIP Journal on Audio, Speech, and Music Processing. https://doi.org/10.1155/2010/862138

    Article  Google Scholar 

  • Shahin, I. (2012). Studying and enhancing talking condition recognition in stressful and emotional talking environments based on HMMs, CHMM2s and SPHMMs. Journal of Multimodal User Interfaces, 6(1–2), 59–71. https://doi.org/10.1007/s12193-011-0082-4

    Article  Google Scholar 

  • Shahin, I. (2013). Employing both gender and emotion cues to enhance speaker identification performance in emotional talking environments. International Journal of Speech Technology, 16(3), 341–351. https://doi.org/10.1007/s10772-013-9188-2

    Article  Google Scholar 

  • Shahin, I. (2016a). “Speaker identification in a shouted talking environment based on novel third-order circular suprasegmental hidden markov models. Circuits, Systems, and Signal Processing, 35(10), 3770–3792. https://doi.org/10.1007/s00034-015-0220-4

    Article  MathSciNet  Google Scholar 

  • Shahin, I. (2016b). Employing emotion cues to verify speakers in emotional talking environments. Journal of Intelligent Systems, 25(1), 3–17. https://doi.org/10.1515/jisys-2014-0118

    Article  MathSciNet  Google Scholar 

  • Shahin, I. (2016c). Emirati speaker verification based on HMMls, HMM2s, and HMM3s. In IEEE 13th international conference on signal processing (ICSP) (pp. 562–567). https://doi.org/10.1109/ICSP.2016.7877896.

  • Shahin, I. (2018a). Novel third-order hidden Markov models for speaker identification in shouted talking environments. Engineering Applications of Artificial Intelligence, 35(10), 316–323. https://doi.org/10.1016/j.engappai.2014.07.006

    Article  Google Scholar 

  • Shahin, I. (2018b). Text-independent emirati-accented speaker identification in emotional talking environment. In Fifth HCT information technology trends (ITT) (pp. 257–262). https://doi.org/10.1109/CTIT.2018.8649514

    Article  Google Scholar 

  • Shahin, I., & Ba-Hutair, M. N. (2014). Emarati speaker identification. In 12th international conference on signal processing (ICSP) (pp. 488–493). https://doi.org/10.1109/ICOSP.2014.7015053.

  • Shahin, I., & Nassif, A. B. (2019). Emirati-accented speaker identification in stressful talking conditions. In International conference on electrical and computing technologies and applications (ICECTA), Nov. 2019 (pp. 1–6). https://doi.org/10.1109/ICECTA48151.2019.8959731.

  • Shahin, I., Nassif, A. B., & Bahutair, M. (2018). Emirati-accented speaker identification in each of neutral and shouted talking environments. International Journal of Speech Technology, 21(2), 265–278. https://doi.org/10.1007/s10772-018-9502-0

    Article  Google Scholar 

  • Shahin, I., Nassif, A. B., & Hamsa, S. (2018). Novel cascaded Gaussian mixture model-deep neural network classifier for speaker identification in emotional talking environments. Neural Computing and Applications. https://doi.org/10.1007/s00521-018-3760-2

    Article  Google Scholar 

  • Wikipedia, F. (2014). Softmax function. http://en.wikipedia.org/w/index.php?title=Softmax_function&oldid=623230338.

Download references

Funding

Ismail Shahin, Ali Bou Nassif, and Noor Hindawi would like to express their thanks and gratitude to the “University of Sharjah for their assistance through the competitive research project entitled Emirati-Accented Speaker and Emotion Recognition Based on Deep Neural Network, No. 19020403139.”

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ismail Shahin.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shahin, I., Nassif, A.B. & Hindawi, N. Speaker identification in stressful talking environments based on convolutional neural network. Int J Speech Technol 24, 1055–1066 (2021). https://doi.org/10.1007/s10772-021-09869-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-021-09869-1

Keywords

Navigation