Abstract
Speaker identification is a biometric mechanism that determines a person who is speaking from a set of known speakers. It has vital applications in areas like security, surveillance, forensic investigations, and others. The accuracy of speaker identification systems was good by using clean speech. However, the speaker identification system performance gets degraded under noisy and mismatched conditions. Recently, a network of hybrid convolutional neural networks (CNN) and enhanced recurrent neural network (RNN) variants have performed better in speech recognition, image classification, and other pattern recognition. Moreover, cochleogram features have shown better accuracy in speech and speaker recognition under noisy conditions. However, there is no attempt conducted in speaker recognition using hybrid CNN and enhanced RNN variants with the cochleogram input to enhance the models’ accuracy in noisy environments. This study proposes a speaker identification for noisy conditions using a hybrid CNN and bidirectional gated recurrent unit (BiGRU) network on the cochleogram input. The models were evaluated by using the VoxCeleb1 speech dataset with real-world noise, white Gaussian noises (WGN), and without additive noise. Real-world noises and WGN were added to the dataset at the signal-to-noise ratio (SNR) of −5 dB up to 20 dB with 5 dB intervals. The proposed model attained an accuracy of 93.15%, 97.55%, and 98.60% on the dataset with real-world noises at SNR of −5 dB, 10 dB, and 20 dB, respectively. The proposed model shows approximately similar performance on both real-world noise and WGN at similar SNR levels. Using the dataset without additive noise the model achieved 98.85% accuracy. The evaluation accuracy and the comparison with the previous works indicate that our model has better accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Kinnunen, T., Li, H.: An overview of text independent speaker recognition from features to super-vectors. Speech Commun. 52(1), 12–40 (2010)
Ahmed, S., Mamun, N., Hossain, A.: Cochleagram based speaker identification using noise adapted CNN. In: 5th International Conference on Electrical Engineering, Information and Communication Technology (ICEEICT) (2021)
Gustavo, A.: Modeling prosodic differences for speaker recognition. Speech Commun. 49(4), 77–291 (2007)
Selvan, K., Joseph, A., Babu, A.: Speaker recognition system for security applications. In: IEEE Recent Advances in Intelligent Computational Systems (RAICS), Trivandrum, India (2013)
Han, K., Omar, M., Pelecanos, J., Pendus, C., Yaman, S., Zhu, W.: Forensically inspired approaches to automatic speaker recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic (2011)
Alegre, F., Soldi, G., Evans, N., Fauve, B., Liu, J.: Evasion and obfuscation in speaker recognition surveillance and forensics. In: IEEE 2nd International Workshop on Biometrics and Forensics, Valletta, Malta (2014)
Singh, N., Khan, R.A., Shree, R.: Applications of speaker recognition. Procedia Eng. 38, 3122–3126 (2012)
Li, L., et al.: CN-Celeb: multi-genre speaker recognition. Speech Commun. 137, 77–91 (2022)
Kanervisto, A., Vestman, V., Hautamäki, V., Kinnunen, T.: Effects of gender information in text-independent and text-dependent speaker verification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA (2017)
Chowdhury, L., Zunair, H., Mohammed, N.: Robust deep speaker recognition: learning latent representation with joint angular margin loss. Appl. Sci. 10(21), 1–17 (2020)
Paulose, S., Mathew, D., Thomas, A.: Performance evaluation of different modeling methods and classifiers with MFCC and IHC features for speaker recognition. Procedia Comput. Sci. 115, 55–62 (2017)
Ayadi, M., Hassan, A., Abdelnaby, A., Elgendy, O.: Text-independent speaker identification using robust statistics estimation. Speech Commun. 92, 52–63 (2017)
India, M., Safari, P., Hernando, J.: Self multi-head attention for speaker recognition. In: INTERSPEECH (2019)
Torfi, A., Dawson, J., Nasrabadi, N.:Text independent speaker verification using 3D convolutional neural network. In: 2018 IEEE International Conference on Multimedia and Expo (ICME), San Diego, CA, USA (2018)
Shon, S., Tang, H., Glass, J.: Frame-level speaker embeddings for text-independent speaker recognition and analysis of end-to-end model. In: 2018 IEEE Spoken Language Technology Workshop (SLT), Greece, Athens (2019)
Emre, S., Soufleris, P., Duan, Z., Heinzelman, W.: A deep neural network model for speaker identification. Appl. Sci. 11(8), 1–18 (2021)
Ye, F., Yang, J.: Front-end speech enhancement for commercial speaker verification systems. Speech Commun. 99, 101–113 (2018)
Liu, C., Yin, Y., Sun, Y., Ersoy, O.: Multi-scale ResNet and BiGRU automatic sleep staging based on attention mechanism. PloS One 17, 1–20 (2022)
Kumar, T., Bhukya, R.: Mel spectrogram based automatic speaker verification using GMM-UBM. In: 2022 IEEE 9th Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON), Prayagraj, India (2022)
Wang, J.C., Wang, C.Y., Chin, Y.H., Liu, Y.T., Chen, E.T., Chang, P.C: Spectral-temporal receptive fields and MFCC balanced feature extraction for robust speaker recognition. Multimedia Tools Appl. 76, 4055–4068 (2017)
Gurbuz, S., Gowdy J., Tufekci, Z.: Speech spectrogram based model adaptation for speaker identification. In: Proceedings of the IEEE SoutheastCon 2000 ‘Preparing for The New Millennium’, Nashville, TN, USA (2002)
Alam, J., Kinnunen, T., Kenny, P., Ouellet, P., O’Shaughnessy, D.: Multitaper MFCC and PLP features for speaker verification using i-vectors. Speech Commun. 55(2), 237–251 (2013)
Hossan, A., Memon, S., Gregory, M.: A novel approach for MFCC feature extraction. In: 2010 4th International Conference on Signal Processing and Communication Systems, Gold Coast, QLD, Australia (2010)
Weng, Z., Li, L., Guo, D.: Speaker recognition using weighted dynamic MFCC based on GMM. In: 2010 International Conference on Anti-Counterfeiting, Security and Identification, Chengdu, China (2010)
Abdul, R., Setianingsih, C., Nasrun, M.: Speaker recognition for device controlling using MFCC and GMM algorithm. In: 2020 2nd International Conference on Electrical, Control and Instrumentation Engineering (ICECIE), Kuala Lumpur, Malaysia (2021)
Ajgou, R., Sbaa, S., Ghendir, S., Chamsa, A., Taleb A.: Robust remote speaker recognition system based on AR-MFCC features and efficient speech activity detection algorithm. In: 2014 11th International Symposium on Wireless Communications Systems (ISWCS), Barcelona, Spain (2014)
Sharma, D., Ali, I.: A modified MFCC feature extraction technique For robust speaker recognition. In: 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Kochi, India (2015)
Zhao, X., Wang, D.: Analyzing noise robustness of MFCC and GFCC features in speaker identification. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada (2013)
Li, Q., Huang, Y.: An auditory-based feature extraction algorithm for robust speaker identification under mismatched conditions. IEEE Trans. Audio Speech Lang. Process. 19(6), 1791–1801 (2011)
Valero, X., Alias, F.: Gammatone cepstral coefficients: biologically inspired features for non-speech audio classification. IEEE Trans. Multimedia 14(6), 1684–1689 (2012)
Ayoub, B., Jamal, K., Arsalane, Z.: Gammatone frequency Cepstral coefficients for speaker identification over VoIP networks. In: 2016 International Conference on Information Technology for Organizations Development (IT4OD), Fez, Morocco (2016)
Wang, H., Zhang, C.: The application of Gammatone frequency cepstral coefficients for forensic voice comparison under noisy conditions. Aust. J. Forensic Sci. 52(5), 553–568 (2020)
Choudhary, H., Sadhya, D., Vinal, P.: Automatic speaker verification using gammatone frequency cepstral coefficients. In: 2021 8th International Conference on Signal Processing and Integrated Networks (SPIN), Noida, India (2021)
Farsiani, S., Izadkhah, H., Lotfi, S.: An optimum end-to-end text-independent speaker identification system using convolutional neural network. Comput. Electr. Eng. 100, 107882 (2022)
Ashar, A., Shahid, M., Mushtaq, U.: Speaker identification using a hybrid CNN-MFCC approach. In: 2020 International Conference on Emerging Trends in Smart Technologies (ICETST), Karachi, Pakistan (2020)
Dwijayanti, S., Yunita, A., Yudho, B.: Speaker identification using a convolutional neural network (2022)
Soleymani, S., Dabouei, A., Mehdi, S., Kazemi, H., Dawson, J.: Prosodic-enhanced Siamese convolutional neural networks for cross-device text-independent speaker verification. In: 2018 IEEE 9th International Conference on Biometrics Theory, Applications and Systems (BTAS), Redondo Beach, CA, USA (2019)
Salvati, D., Drioli, C., Luca, G.: A late fusion deep neural network for robust speaker identification using raw waveforms and gammatone cepstral coefficients. Expert Syst. Appl. 222, 119750 (2023)
Costantini, G., Cesarini, V., Brenna, E.: High-level CNN and machine learning methods for speaker recognition. Sensors 23(7), 3461 (2023)
Bunrit, S., Inkian, T., Kerdprasop, N., Kerdprasop, K.: Text independent speaker identification using deep learning model of convolutional neural network. Int. J. Mach. Learn. Comput. 9(2), 143–148 (2019)
Wondimu, L., Ramasamy, S., Worku, J.: Analyzing noise robustness of Cochleogram and Mel spectrogram features in deep learning based speaker recognition. Appl. Sci. 13, 1–16 (2022)
Zhao, Z., et al.: A lighten CNN-LSTM model for speaker verification on embedded devices. Futur. Gener. Comput. Syst. 100, 751–758 (2019)
Bader, M., Shahin, I., Ahmed, A., Werghi, N.: Hybrid CNN-LSTM speaker identification framework for evaluating the impact of face masks. In: 2022 International Conference on Electrical and Computing Technologies and Applications (ICECTA), Ras Al Khaimah, United Arab Emirates (2022)
Shekhar, H., Roy, P.: A CNN-BiLSTM based hybrid model for Indian language identification. Appl. Acoust. 182, 108274 (2021)
Liu, Y.-H., Liu, X., Fan, W., Zhong, B., Du, J.-X.: Efficient audio-visual speaker recognition via deep heterogeneous feature fusion. In: Zhou, J., et al. (eds.) CCBR 2017. LNCS, vol. 10568, pp. 575–583. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69923-3_62
Nagrani, A., Chung, J.S., Zisserman, A.: VoxCeleb: a large-scale speaker identification dataset. In: INTERSPEECH (2018)
Kim, S.H., Park, Y.H.: Adaptive convolutional neural network for text-independent speaker recognition. In: INTERSPEECH (2021)
Ding, S., Chen, T., Gong, X., Zha, W., Wang, Z.: AutoSpeech: neural architecture search for speaker recognition. arXiv:2005.03215v2 [eess.AS], vol. 31 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Lambamo, W., Srinivasagan, R., Jifara, W. (2024). Speaker Identification Under Noisy Conditions Using Hybrid Deep Learning Model. In: Debelee, T.G., Ibenthal, A., Schwenker, F., Megersa Ayano, Y. (eds) Pan-African Conference on Artificial Intelligence. PanAfriConAI 2023. Communications in Computer and Information Science, vol 2068. Springer, Cham. https://doi.org/10.1007/978-3-031-57624-9_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-57624-9_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-57623-2
Online ISBN: 978-3-031-57624-9
eBook Packages: Computer ScienceComputer Science (R0)