Abstract
Speech signals used for training and testing may vary due to mismatch in the environment or variation in the channel used or due to the physiological changes of the speaker. The performance of a speaker identification system drops significantly due to these factors. In this paper, we propose a robust speaker identification system suitable for real-world speech signal using a deep learning architecture based on convolutional neural network (CNN). Mel frequency cepstral coefficients (MFCC) features are augmented with chroma energy normalized statistics (CENS) features to train a CNN model. The VoxCeleb1 dataset is used in this study and it is found that the proposed method gives better identification accuracy than existing speaker identification methods.
Similar content being viewed by others
References
Abraham, J. V. T., Shahina, A., & Khan, A. N. (2019). Enhancing noisy speech using WEMD. International Journal of Recent Technology and Engineering, 7, 705–708.
Alias, F., Carrié, J. C., & Sevillano, X. (2016). A review of physical and perceptual feature extraction techniques for speech, music and environmental sounds. Applied Sciences, 6, 143.
Arsikere, H., An, H., & Alwan, A. (2014). Speaker recognition via fusion of subglottal features and MFCCs. In INTERSPEECH 2014.
Bartsch, M., & Wakefield, G. (2005). Audio thumbnailing of popular music using chroma-based representations. IEEE Transactions on Multimedia, 7, 96–104.
Bell, P., Gales, M. J. F., Hain, T., Kilgour, J., Lanchantin, P., Liu, X., McParland, A., Renals, S., Saz, O., Wester, M., & Woodland, P. C. (2015). The MGB challenge: Evaluating multi-genre broadcast media recognition. In IEEE workshop on automatic speech recognition and understanding (ASRU) (pp. 687–693).
Campbell, J., Reynolds, D., & Dunn, R. (2003). Fusing high- and low-level features for speaker recognition. In In INTERSPEECH (pp. 2665–2668).
Campbell, W., Campbell, J., Reynolds, D., Singer, E., & Torres-Carrasquillo, P. (2006). Support vector machines for speaker and language recognition. Computer Speech & Language, 20, 210–229.
Chang, J., & Wang, D. (2017). Robust speaker recognition based on DNN/i-vectors and speech separation. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5415–5419).
Chowdhury, A., & Ross, A. (2020). Fusing mfcc and lpc features using 1d triplet cnn for speaker recognition in severely degraded audio signals. IEEE Transactions on Information Forensics and Security, 15, 1616–1629.
Convolutional Neural Networks. (2018). https://www.datasciencecentral.com/profiles/blogs/understanding-neural-networks-from-neuron-to-rnn-cnn-and-deep.
Dehak, N., Dehak, R., Kenny, P., Brummer, N., Ouellet, P., & Dumouchel, P. (2009). Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification. Proceedings of the annual conference of the international speech communication association, INTERSPEECH (vol. 1, pp. 1559–1562).
El-Fattah, M. A. A., Dessouky, M. I., Abbas, A. M., Diab, S. M., El-Rabaie, E.-S.M., Al-Nuaimy, W., et al. (2014). Speech enhancement with an adaptive wiener filter. International Journal of Speech Technology, 17(1), 53–64.
Friedland, G., Vinyals, O., Huang, C., & Müller, C. (2009). Fusing short term and long term features for improved speaker diarization. In IEEE international conference on acoustics, speech and signal processing (pp. 4077–4080).
Garofolo, J., Lamel, L., Fisher, W., Fiscus, J., & Pallett, D. (1993). DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1–1.1. NASA STI/Recon Technical Report, 93, 27403.
Guo, J., Yang, R., Arsikere, H., & Alwan, A. (2017). Robust speaker identification via fusion of subglottal resonances and cepstral features. The Journal of the Acoustical Society of America, 141(4), EL420–EL426.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 770–778).
He, J., Liu, L., & Palm, G. (1997). A new codebook training algorithm for VQ-based speaker recognition. IEEE International Conference on Acoustics, Speech, and Signal Processing, 2, 1091–1094.
Janin, A., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Morgan, N., Peskin, B., Pfau, T., Shriberg, E., Stolcke, A., & Wooters, C. (2003). The ICSI meeting corpus. In IEEE international conference on acoustics, speech, and signal processing (vol. 1).
Kanagasundaram, A., Vogt, R., Dean, D., Sridharan, S., & Mason, M. (2011). i-vector based speaker recognition on short utterances. In Proceedings of the annual conference of the international speech communication association, INTERSPEECH.
Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 52, 12–40.
Lawson, A., Vabishchevich, P., Huggins, M., Ardis, P., Battles, B., & Stauffer, A. (2011) Survey and evaluation of acoustic features for speaker recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5444–5447).
Lei, Y., Burget, L., & Scheffer, N. (2013). A noise robust i-vector extractor using vector taylor series for speaker recognition. In IEEE international conference on acoustics, speech and signal processing (pp. 6788–6791).
McCool, C., Marcel, S., & ”MOBIO Database for the ICPR . (2010). Face and Speech Competition. Idiap-Com Idiap-Com-02-2009. Idiap, 11, 2009.
Mccowan, I., Carletta, J., Kraaij, W., Ashby, S., Bourban, S., Flynn, M., Guillemot, M., Hain, T., Kadlec, J., Karaiskos, V., Kronenthal, M., Lathoud, G., Lincoln, M., Masson, A. Lisowska., Post, W., Reidsma, D., & Wellner, P. (2005). The AMI meeting corpus. In International conference on methods and techniques in behavioral research.
Millar, J. B., Vonwiller, J. P., Harrington, J. M., & Dermody, P. J. (1994). “The Australian National Database of Spoken Language. In Proceedings of IEEE international conference on acoustics, speech and signal processing (vol. i, pp. I/97–I100).
Morrison, G. S., & Enzinger, E. (2016). Multi-laboratory evaluation of forensic voice comparison systems under conditions reflecting those of a real forensic case (forensic\_eval\_01) introduction. Speech Communication, 85, 119–126.
Müller, M., Kurth, F., & Clausen, M. (2005). Audio matching via chroma-based statistical features. In 6th International conference on music information retrieval, ISMIR (pp. 288–295).
Nagrani, A., Chung, J. S., & Zisserman, A. (2017). VoxCeleb: a large-scale speaker identification dataset. In INTERSPEECH.
Petrovska-Delacrétaz, D., Hennebert, J., Melin, H., & Genoud, D. (June 2000). POLYCOST: A telephone-speech database for speaker recognition. Speech Communication, 31, 265–270.
Prince, S. J. D., & Elder, J. H. (2007). Probabilistic linear discriminant analysis for inferences about identity. In IEEE 11th international conference on computer vision (pp. 1–8).
Reynolds, D., & Rose, R. (1995). Robust text-independent speaker identification using Gaussian Mixture speaker models. IEEE Transactions on Speech and Audio Processing, 3, 72–83.
Richardson, F., Reynolds, D., & Dehak, N. (2015). Deep neural network approaches to speaker and language recognition. IEEE Signal Processing Letters, 22, 1–1.
Sell, G., & Clark, P. (2014). Music tonality features for speech/music discrimination. In IEEE international conference on acoustics (pp. 2489–2493). ICASSP: Speech and Signal Processing—Proceedings.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018). X-vectors: Robust DNN embeddings for speaker recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5329–5333).
Szegedy, C., Ioffe, S., & Vanhoucke, V. (2016). Inception-v4, inception-resnet and the impact of residual connections on learning. CoRR, vol. abs/1602.07261.
Tavares, R., & Coelho, R. (2016). Speech enhancement with nonstationary acoustic noise detection in time domain. IEEE Signal Processing Letters, 23(1), 6–10.
Torfi, A., Dawson, J., & Nasrabadi, N. M.(2018). Text-independent speaker verification using 3D Convolutional Neural Networks. In IEEE international conference on multimedia and expo (ICME) (pp. 1–6).
Variani, E., Lei, X., McDermott, E., Moreno, I. L., Gonzalez-Dominguez, J. (2014). Deep neural networks for small footprint text-dependent speaker verification. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4052–4056).
Vloed, D. van der., Bouten, J., & Leeuwen, D. Van. (2014). NFI-FRITS: A forensic speaker recognition database and some first experiments. In Proceedings of Odyssey speaker and language recognition workshop (pp. 6–13).
Woo, R. H., Park, A., & Hazen, T. J. (2006). The MIT mobile device speaker verification corpus: Data collection and preliminary experiments. In IEEE Odyssey—the speaker and language recognition workshop (pp. 1–6).
Yu, H., Tan, Z.-H., Ma, Z., & Guo, J. (2017). Adversarial network bottleneck features for noise robust speaker verification. In INTERSPEECH (pp. 1492–1496).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Abraham, J.V.T., Khan, A.N. & Shahina, A. A deep learning approach for robust speaker identification using chroma energy normalized statistics and mel frequency cepstral coefficients. Int J Speech Technol 26, 579–587 (2023). https://doi.org/10.1007/s10772-021-09888-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-021-09888-y