Speaker Identification Under Noisy Conditions Using Hybrid Deep Learning Model

Lambamo, Wondimu; Srinivasagan, Ramasamy; Jifara, Worku

doi:10.1007/978-3-031-57624-9_9

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 2068))

Included in the following conference series:

Pan African Conference on Artificial Intelligence

35 Accesses

Abstract

Speaker identification is a biometric mechanism that determines a person who is speaking from a set of known speakers. It has vital applications in areas like security, surveillance, forensic investigations, and others. The accuracy of speaker identification systems was good by using clean speech. However, the speaker identification system performance gets degraded under noisy and mismatched conditions. Recently, a network of hybrid convolutional neural networks (CNN) and enhanced recurrent neural network (RNN) variants have performed better in speech recognition, image classification, and other pattern recognition. Moreover, cochleogram features have shown better accuracy in speech and speaker recognition under noisy conditions. However, there is no attempt conducted in speaker recognition using hybrid CNN and enhanced RNN variants with the cochleogram input to enhance the models’ accuracy in noisy environments. This study proposes a speaker identification for noisy conditions using a hybrid CNN and bidirectional gated recurrent unit (BiGRU) network on the cochleogram input. The models were evaluated by using the VoxCeleb1 speech dataset with real-world noise, white Gaussian noises (WGN), and without additive noise. Real-world noises and WGN were added to the dataset at the signal-to-noise ratio (SNR) of −5 dB up to 20 dB with 5 dB intervals. The proposed model attained an accuracy of 93.15%, 97.55%, and 98.60% on the dataset with real-world noises at SNR of −5 dB, 10 dB, and 20 dB, respectively. The proposed model shows approximately similar performance on both real-world noise and WGN at similar SNR levels. Using the dataset without additive noise the model achieved 98.85% accuracy. The evaluation accuracy and the comparison with the previous works indicate that our model has better accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Kinnunen, T., Li, H.: An overview of text independent speaker recognition from features to super-vectors. Speech Commun. 52(1), 12–40 (2010)
Article Google Scholar
Ahmed, S., Mamun, N., Hossain, A.: Cochleagram based speaker identification using noise adapted CNN. In: 5th International Conference on Electrical Engineering, Information and Communication Technology (ICEEICT) (2021)
Google Scholar
Gustavo, A.: Modeling prosodic differences for speaker recognition. Speech Commun. 49(4), 77–291 (2007)
Google Scholar
Selvan, K., Joseph, A., Babu, A.: Speaker recognition system for security applications. In: IEEE Recent Advances in Intelligent Computational Systems (RAICS), Trivandrum, India (2013)
Google Scholar
Han, K., Omar, M., Pelecanos, J., Pendus, C., Yaman, S., Zhu, W.: Forensically inspired approaches to automatic speaker recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic (2011)
Google Scholar
Alegre, F., Soldi, G., Evans, N., Fauve, B., Liu, J.: Evasion and obfuscation in speaker recognition surveillance and forensics. In: IEEE 2nd International Workshop on Biometrics and Forensics, Valletta, Malta (2014)
Google Scholar
Singh, N., Khan, R.A., Shree, R.: Applications of speaker recognition. Procedia Eng. 38, 3122–3126 (2012)
Article Google Scholar
Li, L., et al.: CN-Celeb: multi-genre speaker recognition. Speech Commun. 137, 77–91 (2022)
Article Google Scholar
Kanervisto, A., Vestman, V., Hautamäki, V., Kinnunen, T.: Effects of gender information in text-independent and text-dependent speaker verification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA (2017)
Google Scholar
Chowdhury, L., Zunair, H., Mohammed, N.: Robust deep speaker recognition: learning latent representation with joint angular margin loss. Appl. Sci. 10(21), 1–17 (2020)
Article Google Scholar
Paulose, S., Mathew, D., Thomas, A.: Performance evaluation of different modeling methods and classifiers with MFCC and IHC features for speaker recognition. Procedia Comput. Sci. 115, 55–62 (2017)
Article Google Scholar
Ayadi, M., Hassan, A., Abdelnaby, A., Elgendy, O.: Text-independent speaker identification using robust statistics estimation. Speech Commun. 92, 52–63 (2017)
Article Google Scholar
India, M., Safari, P., Hernando, J.: Self multi-head attention for speaker recognition. In: INTERSPEECH (2019)
Google Scholar
Torfi, A., Dawson, J., Nasrabadi, N.:Text independent speaker verification using 3D convolutional neural network. In: 2018 IEEE International Conference on Multimedia and Expo (ICME), San Diego, CA, USA (2018)
Google Scholar
Shon, S., Tang, H., Glass, J.: Frame-level speaker embeddings for text-independent speaker recognition and analysis of end-to-end model. In: 2018 IEEE Spoken Language Technology Workshop (SLT), Greece, Athens (2019)
Google Scholar
Emre, S., Soufleris, P., Duan, Z., Heinzelman, W.: A deep neural network model for speaker identification. Appl. Sci. 11(8), 1–18 (2021)
Google Scholar
Ye, F., Yang, J.: Front-end speech enhancement for commercial speaker verification systems. Speech Commun. 99, 101–113 (2018)
Article Google Scholar
Liu, C., Yin, Y., Sun, Y., Ersoy, O.: Multi-scale ResNet and BiGRU automatic sleep staging based on attention mechanism. PloS One 17, 1–20 (2022)
Google Scholar
Kumar, T., Bhukya, R.: Mel spectrogram based automatic speaker verification using GMM-UBM. In: 2022 IEEE 9th Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON), Prayagraj, India (2022)
Google Scholar
Wang, J.C., Wang, C.Y., Chin, Y.H., Liu, Y.T., Chen, E.T., Chang, P.C: Spectral-temporal receptive fields and MFCC balanced feature extraction for robust speaker recognition. Multimedia Tools Appl. 76, 4055–4068 (2017)
Google Scholar
Gurbuz, S., Gowdy J., Tufekci, Z.: Speech spectrogram based model adaptation for speaker identification. In: Proceedings of the IEEE SoutheastCon 2000 ‘Preparing for The New Millennium’, Nashville, TN, USA (2002)
Google Scholar
Alam, J., Kinnunen, T., Kenny, P., Ouellet, P., O’Shaughnessy, D.: Multitaper MFCC and PLP features for speaker verification using i-vectors. Speech Commun. 55(2), 237–251 (2013)
Article Google Scholar
Hossan, A., Memon, S., Gregory, M.: A novel approach for MFCC feature extraction. In: 2010 4th International Conference on Signal Processing and Communication Systems, Gold Coast, QLD, Australia (2010)
Google Scholar
Weng, Z., Li, L., Guo, D.: Speaker recognition using weighted dynamic MFCC based on GMM. In: 2010 International Conference on Anti-Counterfeiting, Security and Identification, Chengdu, China (2010)
Google Scholar
Abdul, R., Setianingsih, C., Nasrun, M.: Speaker recognition for device controlling using MFCC and GMM algorithm. In: 2020 2nd International Conference on Electrical, Control and Instrumentation Engineering (ICECIE), Kuala Lumpur, Malaysia (2021)
Google Scholar
Ajgou, R., Sbaa, S., Ghendir, S., Chamsa, A., Taleb A.: Robust remote speaker recognition system based on AR-MFCC features and efficient speech activity detection algorithm. In: 2014 11th International Symposium on Wireless Communications Systems (ISWCS), Barcelona, Spain (2014)
Google Scholar
Sharma, D., Ali, I.: A modified MFCC feature extraction technique For robust speaker recognition. In: 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Kochi, India (2015)
Google Scholar
Zhao, X., Wang, D.: Analyzing noise robustness of MFCC and GFCC features in speaker identification. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada (2013)
Google Scholar
Li, Q., Huang, Y.: An auditory-based feature extraction algorithm for robust speaker identification under mismatched conditions. IEEE Trans. Audio Speech Lang. Process. 19(6), 1791–1801 (2011)
Article Google Scholar
Valero, X., Alias, F.: Gammatone cepstral coefficients: biologically inspired features for non-speech audio classification. IEEE Trans. Multimedia 14(6), 1684–1689 (2012)
Article Google Scholar
Ayoub, B., Jamal, K., Arsalane, Z.: Gammatone frequency Cepstral coefficients for speaker identification over VoIP networks. In: 2016 International Conference on Information Technology for Organizations Development (IT4OD), Fez, Morocco (2016)
Google Scholar
Wang, H., Zhang, C.: The application of Gammatone frequency cepstral coefficients for forensic voice comparison under noisy conditions. Aust. J. Forensic Sci. 52(5), 553–568 (2020)
Article Google Scholar
Choudhary, H., Sadhya, D., Vinal, P.: Automatic speaker verification using gammatone frequency cepstral coefficients. In: 2021 8th International Conference on Signal Processing and Integrated Networks (SPIN), Noida, India (2021)
Google Scholar
Farsiani, S., Izadkhah, H., Lotfi, S.: An optimum end-to-end text-independent speaker identification system using convolutional neural network. Comput. Electr. Eng. 100, 107882 (2022)
Article Google Scholar
Ashar, A., Shahid, M., Mushtaq, U.: Speaker identification using a hybrid CNN-MFCC approach. In: 2020 International Conference on Emerging Trends in Smart Technologies (ICETST), Karachi, Pakistan (2020)
Google Scholar
Dwijayanti, S., Yunita, A., Yudho, B.: Speaker identification using a convolutional neural network (2022)
Google Scholar
Soleymani, S., Dabouei, A., Mehdi, S., Kazemi, H., Dawson, J.: Prosodic-enhanced Siamese convolutional neural networks for cross-device text-independent speaker verification. In: 2018 IEEE 9th International Conference on Biometrics Theory, Applications and Systems (BTAS), Redondo Beach, CA, USA (2019)
Google Scholar
Salvati, D., Drioli, C., Luca, G.: A late fusion deep neural network for robust speaker identification using raw waveforms and gammatone cepstral coefficients. Expert Syst. Appl. 222, 119750 (2023)
Article Google Scholar
Costantini, G., Cesarini, V., Brenna, E.: High-level CNN and machine learning methods for speaker recognition. Sensors 23(7), 3461 (2023)
Article Google Scholar
Bunrit, S., Inkian, T., Kerdprasop, N., Kerdprasop, K.: Text independent speaker identification using deep learning model of convolutional neural network. Int. J. Mach. Learn. Comput. 9(2), 143–148 (2019)
Article Google Scholar
Wondimu, L., Ramasamy, S., Worku, J.: Analyzing noise robustness of Cochleogram and Mel spectrogram features in deep learning based speaker recognition. Appl. Sci. 13, 1–16 (2022)
Google Scholar
Zhao, Z., et al.: A lighten CNN-LSTM model for speaker verification on embedded devices. Futur. Gener. Comput. Syst. 100, 751–758 (2019)
Article Google Scholar
Bader, M., Shahin, I., Ahmed, A., Werghi, N.: Hybrid CNN-LSTM speaker identification framework for evaluating the impact of face masks. In: 2022 International Conference on Electrical and Computing Technologies and Applications (ICECTA), Ras Al Khaimah, United Arab Emirates (2022)
Google Scholar
Shekhar, H., Roy, P.: A CNN-BiLSTM based hybrid model for Indian language identification. Appl. Acoust. 182, 108274 (2021)
Article Google Scholar
Liu, Y.-H., Liu, X., Fan, W., Zhong, B., Du, J.-X.: Efficient audio-visual speaker recognition via deep heterogeneous feature fusion. In: Zhou, J., et al. (eds.) CCBR 2017. LNCS, vol. 10568, pp. 575–583. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69923-3_62
Chapter Google Scholar
Nagrani, A., Chung, J.S., Zisserman, A.: VoxCeleb: a large-scale speaker identification dataset. In: INTERSPEECH (2018)
Google Scholar
Kim, S.H., Park, Y.H.: Adaptive convolutional neural network for text-independent speaker recognition. In: INTERSPEECH (2021)
Google Scholar
Ding, S., Chen, T., Gong, X., Zha, W., Wang, Z.: AutoSpeech: neural architecture search for speaker recognition. arXiv:2005.03215v2 [eess.AS], vol. 31 (2020)

Download references

Author information

Authors and Affiliations

Adama Science and Technology University, 1888, Adama, Ethiopia
Wondimu Lambamo, Ramasamy Srinivasagan & Worku Jifara
King Faisal University, Al Hofuf, 31982, Al-Ahsa, Saudi Arabia
Ramasamy Srinivasagan

Authors

Wondimu Lambamo
View author publications
You can also search for this author in PubMed Google Scholar
Ramasamy Srinivasagan
View author publications
You can also search for this author in PubMed Google Scholar
Worku Jifara
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wondimu Lambamo .

Editor information

Editors and Affiliations

Ethiopian Artificial Intelligence Instit, Addis Adaba, Ethiopia
Taye Girma Debelee
HAWK University of Applied Sciences and Arts, Göttingen, Germany
Achim Ibenthal
Universität Ulm, Ulm, Germany
Friedhelm Schwenker
Ethiopian Artificial Intelligence Instit, Addis Ababa, Ethiopia
Yehualashet Megersa Ayano

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lambamo, W., Srinivasagan, R., Jifara, W. (2024). Speaker Identification Under Noisy Conditions Using Hybrid Deep Learning Model. In: Debelee, T.G., Ibenthal, A., Schwenker, F., Megersa Ayano, Y. (eds) Pan-African Conference on Artificial Intelligence. PanAfriConAI 2023. Communications in Computer and Information Science, vol 2068. Springer, Cham. https://doi.org/10.1007/978-3-031-57624-9_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-57624-9_9
Published: 07 April 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-57623-2
Online ISBN: 978-3-031-57624-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics