Skip to main content

Advertisement

Log in

A deep learning approach for robust speaker identification using chroma energy normalized statistics and mel frequency cepstral coefficients

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Speech signals used for training and testing may vary due to mismatch in the environment or variation in the channel used or due to the physiological changes of the speaker. The performance of a speaker identification system drops significantly due to these factors. In this paper, we propose a robust speaker identification system suitable for real-world speech signal using a deep learning architecture based on convolutional neural network (CNN). Mel frequency cepstral coefficients (MFCC) features are augmented with chroma energy normalized statistics (CENS) features to train a CNN model. The VoxCeleb1 dataset is used in this study and it is found that the proposed method gives better identification accuracy than existing speaker identification methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Abraham, J. V. T., Shahina, A., & Khan, A. N. (2019). Enhancing noisy speech using WEMD. International Journal of Recent Technology and Engineering, 7, 705–708.

    Google Scholar 

  • Alias, F., Carrié, J. C., & Sevillano, X. (2016). A review of physical and perceptual feature extraction techniques for speech, music and environmental sounds. Applied Sciences, 6, 143.

    Article  Google Scholar 

  • Arsikere, H., An, H., & Alwan, A. (2014). Speaker recognition via fusion of subglottal features and MFCCs. In INTERSPEECH 2014.

  • Bartsch, M., & Wakefield, G. (2005). Audio thumbnailing of popular music using chroma-based representations. IEEE Transactions on Multimedia, 7, 96–104.

    Article  Google Scholar 

  • Bell, P., Gales, M. J. F., Hain, T., Kilgour, J., Lanchantin, P., Liu, X., McParland, A., Renals, S., Saz, O., Wester, M., & Woodland, P. C. (2015). The MGB challenge: Evaluating multi-genre broadcast media recognition. In IEEE workshop on automatic speech recognition and understanding (ASRU) (pp. 687–693).

  • Campbell, J., Reynolds, D., & Dunn, R. (2003). Fusing high- and low-level features for speaker recognition. In In INTERSPEECH (pp. 2665–2668).

  • Campbell, W., Campbell, J., Reynolds, D., Singer, E., & Torres-Carrasquillo, P. (2006). Support vector machines for speaker and language recognition. Computer Speech & Language, 20, 210–229.

    Article  Google Scholar 

  • Chang, J., & Wang, D. (2017). Robust speaker recognition based on DNN/i-vectors and speech separation. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5415–5419).

  • Chowdhury, A., & Ross, A. (2020). Fusing mfcc and lpc features using 1d triplet cnn for speaker recognition in severely degraded audio signals. IEEE Transactions on Information Forensics and Security, 15, 1616–1629.

    Article  Google Scholar 

  • Convolutional Neural Networks. (2018). https://www.datasciencecentral.com/profiles/blogs/understanding-neural-networks-from-neuron-to-rnn-cnn-and-deep.

  • Dehak, N., Dehak, R., Kenny, P., Brummer, N., Ouellet, P., & Dumouchel, P. (2009). Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification. Proceedings of the annual conference of the international speech communication association, INTERSPEECH (vol. 1, pp. 1559–1562).

  • El-Fattah, M. A. A., Dessouky, M. I., Abbas, A. M., Diab, S. M., El-Rabaie, E.-S.M., Al-Nuaimy, W., et al. (2014). Speech enhancement with an adaptive wiener filter. International Journal of Speech Technology, 17(1), 53–64.

    Article  Google Scholar 

  • Friedland, G., Vinyals, O., Huang, C., & Müller, C. (2009). Fusing short term and long term features for improved speaker diarization. In IEEE international conference on acoustics, speech and signal processing (pp. 4077–4080).

  • Garofolo, J., Lamel, L., Fisher, W., Fiscus, J., & Pallett, D. (1993). DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1–1.1. NASA STI/Recon Technical Report, 93, 27403.

    Google Scholar 

  • Guo, J., Yang, R., Arsikere, H., & Alwan, A. (2017). Robust speaker identification via fusion of subglottal resonances and cepstral features. The Journal of the Acoustical Society of America, 141(4), EL420–EL426.

    Article  Google Scholar 

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 770–778).

  • He, J., Liu, L., & Palm, G. (1997). A new codebook training algorithm for VQ-based speaker recognition. IEEE International Conference on Acoustics, Speech, and Signal Processing, 2, 1091–1094.

    Article  Google Scholar 

  • Janin, A., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Morgan, N., Peskin, B., Pfau, T., Shriberg, E., Stolcke, A., & Wooters, C. (2003). The ICSI meeting corpus. In IEEE international conference on acoustics, speech, and signal processing (vol. 1).

  • Kanagasundaram, A., Vogt, R., Dean, D., Sridharan, S., & Mason, M. (2011). i-vector based speaker recognition on short utterances. In Proceedings of the annual conference of the international speech communication association, INTERSPEECH.

  • Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 52, 12–40.

    Article  Google Scholar 

  • Lawson, A., Vabishchevich, P., Huggins, M., Ardis, P., Battles, B., & Stauffer, A. (2011) Survey and evaluation of acoustic features for speaker recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5444–5447).

  • Lei, Y., Burget, L., & Scheffer, N. (2013). A noise robust i-vector extractor using vector taylor series for speaker recognition. In IEEE international conference on acoustics, speech and signal processing (pp. 6788–6791).

  • McCool, C., Marcel, S., & ”MOBIO Database for the ICPR . (2010). Face and Speech Competition. Idiap-Com Idiap-Com-02-2009. Idiap, 11, 2009.

  • Mccowan, I., Carletta, J., Kraaij, W., Ashby, S., Bourban, S., Flynn, M., Guillemot, M., Hain, T., Kadlec, J., Karaiskos, V., Kronenthal, M., Lathoud, G., Lincoln, M., Masson, A. Lisowska., Post, W., Reidsma, D., & Wellner, P. (2005). The AMI meeting corpus. In International conference on methods and techniques in behavioral research.

  • Millar, J. B., Vonwiller, J. P., Harrington, J. M., & Dermody, P. J. (1994). “The Australian National Database of Spoken Language. In Proceedings of IEEE international conference on acoustics, speech and signal processing (vol. i, pp. I/97–I100).

  • Morrison, G. S., & Enzinger, E. (2016). Multi-laboratory evaluation of forensic voice comparison systems under conditions reflecting those of a real forensic case (forensic\_eval\_01) introduction. Speech Communication, 85, 119–126.

    Article  Google Scholar 

  • Müller, M., Kurth, F., & Clausen, M. (2005). Audio matching via chroma-based statistical features. In 6th International conference on music information retrieval, ISMIR (pp. 288–295).

  • Nagrani, A., Chung, J. S., & Zisserman, A. (2017). VoxCeleb: a large-scale speaker identification dataset. In INTERSPEECH.

  • Petrovska-Delacrétaz, D., Hennebert, J., Melin, H., & Genoud, D. (June 2000). POLYCOST: A telephone-speech database for speaker recognition. Speech Communication, 31, 265–270.

    Article  Google Scholar 

  • Prince, S. J. D., & Elder, J. H. (2007). Probabilistic linear discriminant analysis for inferences about identity. In IEEE 11th international conference on computer vision (pp. 1–8).

  • Reynolds, D., & Rose, R. (1995). Robust text-independent speaker identification using Gaussian Mixture speaker models. IEEE Transactions on Speech and Audio Processing, 3, 72–83.

    Article  Google Scholar 

  • Richardson, F., Reynolds, D., & Dehak, N. (2015). Deep neural network approaches to speaker and language recognition. IEEE Signal Processing Letters, 22, 1–1.

    Article  Google Scholar 

  • Sell, G., & Clark, P. (2014). Music tonality features for speech/music discrimination. In IEEE international conference on acoustics (pp. 2489–2493). ICASSP: Speech and Signal Processing—Proceedings.

  • Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

  • Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018). X-vectors: Robust DNN embeddings for speaker recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5329–5333).

  • Szegedy, C., Ioffe, S., & Vanhoucke, V. (2016). Inception-v4, inception-resnet and the impact of residual connections on learning. CoRR, vol. abs/1602.07261.

  • Tavares, R., & Coelho, R. (2016). Speech enhancement with nonstationary acoustic noise detection in time domain. IEEE Signal Processing Letters, 23(1), 6–10.

    Article  Google Scholar 

  • Torfi, A., Dawson, J., & Nasrabadi, N. M.(2018). Text-independent speaker verification using 3D Convolutional Neural Networks. In IEEE international conference on multimedia and expo (ICME) (pp. 1–6).

  • Variani, E., Lei, X., McDermott, E., Moreno, I. L., Gonzalez-Dominguez, J. (2014). Deep neural networks for small footprint text-dependent speaker verification. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4052–4056).

  • Vloed, D. van der., Bouten, J., & Leeuwen, D. Van. (2014). NFI-FRITS: A forensic speaker recognition database and some first experiments. In Proceedings of Odyssey speaker and language recognition workshop (pp. 6–13).

  • Woo, R. H., Park, A., & Hazen, T. J. (2006). The MIT mobile device speaker verification corpus: Data collection and preliminary experiments. In IEEE Odyssey—the speaker and language recognition workshop (pp. 1–6).

  • Yu, H., Tan, Z.-H., Ma, Z., & Guo, J. (2017). Adversarial network bottleneck features for noise robust speaker verification. In INTERSPEECH (pp. 1492–1496).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to J. V. Thomas Abraham.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Abraham, J.V.T., Khan, A.N. & Shahina, A. A deep learning approach for robust speaker identification using chroma energy normalized statistics and mel frequency cepstral coefficients. Int J Speech Technol 26, 579–587 (2023). https://doi.org/10.1007/s10772-021-09888-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-021-09888-y

Keywords

Navigation