A deep learning approach for robust speaker identification using chroma energy normalized statistics and mel frequency cepstral coefficients

Abraham, J. V. Thomas; Khan, A. Nayeemulla; Shahina, A.

doi:10.1007/s10772-021-09888-y

A deep learning approach for robust speaker identification using chroma energy normalized statistics and mel frequency cepstral coefficients

Published: 30 August 2021

Volume 26, pages 579–587, (2023)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

441 Accesses
5 Citations
Explore all metrics

Abstract

Speech signals used for training and testing may vary due to mismatch in the environment or variation in the channel used or due to the physiological changes of the speaker. The performance of a speaker identification system drops significantly due to these factors. In this paper, we propose a robust speaker identification system suitable for real-world speech signal using a deep learning architecture based on convolutional neural network (CNN). Mel frequency cepstral coefficients (MFCC) features are augmented with chroma energy normalized statistics (CENS) features to train a CNN model. The VoxCeleb1 dataset is used in this study and it is found that the proposed method gives better identification accuracy than existing speaker identification methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic speech recognition: a survey

Article 10 November 2020

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Biometrics recognition using deep learning: a survey

Article 13 January 2023

References

Abraham, J. V. T., Shahina, A., & Khan, A. N. (2019). Enhancing noisy speech using WEMD. International Journal of Recent Technology and Engineering, 7, 705–708.
Google Scholar
Alias, F., Carrié, J. C., & Sevillano, X. (2016). A review of physical and perceptual feature extraction techniques for speech, music and environmental sounds. Applied Sciences, 6, 143.
Article Google Scholar
Arsikere, H., An, H., & Alwan, A. (2014). Speaker recognition via fusion of subglottal features and MFCCs. In INTERSPEECH 2014.
Bartsch, M., & Wakefield, G. (2005). Audio thumbnailing of popular music using chroma-based representations. IEEE Transactions on Multimedia, 7, 96–104.
Article Google Scholar
Bell, P., Gales, M. J. F., Hain, T., Kilgour, J., Lanchantin, P., Liu, X., McParland, A., Renals, S., Saz, O., Wester, M., & Woodland, P. C. (2015). The MGB challenge: Evaluating multi-genre broadcast media recognition. In IEEE workshop on automatic speech recognition and understanding (ASRU) (pp. 687–693).
Campbell, J., Reynolds, D., & Dunn, R. (2003). Fusing high- and low-level features for speaker recognition. In In INTERSPEECH (pp. 2665–2668).
Campbell, W., Campbell, J., Reynolds, D., Singer, E., & Torres-Carrasquillo, P. (2006). Support vector machines for speaker and language recognition. Computer Speech & Language, 20, 210–229.
Article Google Scholar
Chang, J., & Wang, D. (2017). Robust speaker recognition based on DNN/i-vectors and speech separation. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5415–5419).
Chowdhury, A., & Ross, A. (2020). Fusing mfcc and lpc features using 1d triplet cnn for speaker recognition in severely degraded audio signals. IEEE Transactions on Information Forensics and Security, 15, 1616–1629.
Article Google Scholar
Convolutional Neural Networks. (2018). https://www.datasciencecentral.com/profiles/blogs/understanding-neural-networks-from-neuron-to-rnn-cnn-and-deep.
Dehak, N., Dehak, R., Kenny, P., Brummer, N., Ouellet, P., & Dumouchel, P. (2009). Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification. Proceedings of the annual conference of the international speech communication association, INTERSPEECH (vol. 1, pp. 1559–1562).
El-Fattah, M. A. A., Dessouky, M. I., Abbas, A. M., Diab, S. M., El-Rabaie, E.-S.M., Al-Nuaimy, W., et al. (2014). Speech enhancement with an adaptive wiener filter. International Journal of Speech Technology, 17(1), 53–64.
Article Google Scholar
Friedland, G., Vinyals, O., Huang, C., & Müller, C. (2009). Fusing short term and long term features for improved speaker diarization. In IEEE international conference on acoustics, speech and signal processing (pp. 4077–4080).
Garofolo, J., Lamel, L., Fisher, W., Fiscus, J., & Pallett, D. (1993). DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1–1.1. NASA STI/Recon Technical Report, 93, 27403.
Google Scholar
Guo, J., Yang, R., Arsikere, H., & Alwan, A. (2017). Robust speaker identification via fusion of subglottal resonances and cepstral features. The Journal of the Acoustical Society of America, 141(4), EL420–EL426.
Article Google Scholar
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 770–778).
He, J., Liu, L., & Palm, G. (1997). A new codebook training algorithm for VQ-based speaker recognition. IEEE International Conference on Acoustics, Speech, and Signal Processing, 2, 1091–1094.
Article Google Scholar
Janin, A., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Morgan, N., Peskin, B., Pfau, T., Shriberg, E., Stolcke, A., & Wooters, C. (2003). The ICSI meeting corpus. In IEEE international conference on acoustics, speech, and signal processing (vol. 1).
Kanagasundaram, A., Vogt, R., Dean, D., Sridharan, S., & Mason, M. (2011). i-vector based speaker recognition on short utterances. In Proceedings of the annual conference of the international speech communication association, INTERSPEECH.
Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 52, 12–40.
Article Google Scholar
Lawson, A., Vabishchevich, P., Huggins, M., Ardis, P., Battles, B., & Stauffer, A. (2011) Survey and evaluation of acoustic features for speaker recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5444–5447).
Lei, Y., Burget, L., & Scheffer, N. (2013). A noise robust i-vector extractor using vector taylor series for speaker recognition. In IEEE international conference on acoustics, speech and signal processing (pp. 6788–6791).
McCool, C., Marcel, S., & ”MOBIO Database for the ICPR . (2010). Face and Speech Competition. Idiap-Com Idiap-Com-02-2009. Idiap, 11, 2009.
Mccowan, I., Carletta, J., Kraaij, W., Ashby, S., Bourban, S., Flynn, M., Guillemot, M., Hain, T., Kadlec, J., Karaiskos, V., Kronenthal, M., Lathoud, G., Lincoln, M., Masson, A. Lisowska., Post, W., Reidsma, D., & Wellner, P. (2005). The AMI meeting corpus. In International conference on methods and techniques in behavioral research.
Millar, J. B., Vonwiller, J. P., Harrington, J. M., & Dermody, P. J. (1994). “The Australian National Database of Spoken Language. In Proceedings of IEEE international conference on acoustics, speech and signal processing (vol. i, pp. I/97–I100).
Morrison, G. S., & Enzinger, E. (2016). Multi-laboratory evaluation of forensic voice comparison systems under conditions reflecting those of a real forensic case (forensic\_eval\_01) introduction. Speech Communication, 85, 119–126.
Article Google Scholar
Müller, M., Kurth, F., & Clausen, M. (2005). Audio matching via chroma-based statistical features. In 6th International conference on music information retrieval, ISMIR (pp. 288–295).
Nagrani, A., Chung, J. S., & Zisserman, A. (2017). VoxCeleb: a large-scale speaker identification dataset. In INTERSPEECH.
Petrovska-Delacrétaz, D., Hennebert, J., Melin, H., & Genoud, D. (June 2000). POLYCOST: A telephone-speech database for speaker recognition. Speech Communication, 31, 265–270.
Article Google Scholar
Prince, S. J. D., & Elder, J. H. (2007). Probabilistic linear discriminant analysis for inferences about identity. In IEEE 11th international conference on computer vision (pp. 1–8).
Reynolds, D., & Rose, R. (1995). Robust text-independent speaker identification using Gaussian Mixture speaker models. IEEE Transactions on Speech and Audio Processing, 3, 72–83.
Article Google Scholar
Richardson, F., Reynolds, D., & Dehak, N. (2015). Deep neural network approaches to speaker and language recognition. IEEE Signal Processing Letters, 22, 1–1.
Article Google Scholar
Sell, G., & Clark, P. (2014). Music tonality features for speech/music discrimination. In IEEE international conference on acoustics (pp. 2489–2493). ICASSP: Speech and Signal Processing—Proceedings.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018). X-vectors: Robust DNN embeddings for speaker recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5329–5333).
Szegedy, C., Ioffe, S., & Vanhoucke, V. (2016). Inception-v4, inception-resnet and the impact of residual connections on learning. CoRR, vol. abs/1602.07261.
Tavares, R., & Coelho, R. (2016). Speech enhancement with nonstationary acoustic noise detection in time domain. IEEE Signal Processing Letters, 23(1), 6–10.
Article Google Scholar
Torfi, A., Dawson, J., & Nasrabadi, N. M.(2018). Text-independent speaker verification using 3D Convolutional Neural Networks. In IEEE international conference on multimedia and expo (ICME) (pp. 1–6).
Variani, E., Lei, X., McDermott, E., Moreno, I. L., Gonzalez-Dominguez, J. (2014). Deep neural networks for small footprint text-dependent speaker verification. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4052–4056).
Vloed, D. van der., Bouten, J., & Leeuwen, D. Van. (2014). NFI-FRITS: A forensic speaker recognition database and some first experiments. In Proceedings of Odyssey speaker and language recognition workshop (pp. 6–13).
Woo, R. H., Park, A., & Hazen, T. J. (2006). The MIT mobile device speaker verification corpus: Data collection and preliminary experiments. In IEEE Odyssey—the speaker and language recognition workshop (pp. 1–6).
Yu, H., Tan, Z.-H., Ma, Z., & Guo, J. (2017). Adversarial network bottleneck features for noise robust speaker verification. In INTERSPEECH (pp. 1492–1496).

Download references

Author information

Authors and Affiliations

VIT University Chennai Campus, Chennai, Tamil Nadu, India
J. V. Thomas Abraham & A. Nayeemulla Khan
SSN College of Engineering, Kalavakkam, Chennai, Tamil Nadu, India
A. Shahina

Authors

J. V. Thomas Abraham
View author publications
You can also search for this author in PubMed Google Scholar
A. Nayeemulla Khan
View author publications
You can also search for this author in PubMed Google Scholar
A. Shahina
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to J. V. Thomas Abraham.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Abraham, J.V.T., Khan, A.N. & Shahina, A. A deep learning approach for robust speaker identification using chroma energy normalized statistics and mel frequency cepstral coefficients. Int J Speech Technol 26, 579–587 (2023). https://doi.org/10.1007/s10772-021-09888-y

Download citation

Received: 26 July 2020
Accepted: 09 August 2021
Published: 30 August 2021
Issue Date: September 2023
DOI: https://doi.org/10.1007/s10772-021-09888-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A deep learning approach for robust speaker identification using chroma energy normalized statistics and mel frequency cepstral coefficients

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Biometrics recognition using deep learning: a survey

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A deep learning approach for robust speaker identification using chroma energy normalized statistics and mel frequency cepstral coefficients

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Biometrics recognition using deep learning: a survey

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation