Abstract
In current scenario, speaker recognition under noisy condition is the major challenging task in the area of speech processing. Due to noise environment there is a significant degradation in the system performance. The major aim of the proposed work is to identify the speaker’s under clean and noise background using limited dataset. In this paper, we proposed a multitaper based Mel frequency cepstral coefficients (MFCC) and power normalization cepstral coefficients (PNCC) techniques with fusion strategies. Here, we used MFCC and PNCC techniques with different multitapers to extract the desired features from the obtained speech samples. Then, cepstral mean and variance normalization (CMVN) and Feature warping (FW) are the two techniques applied to normalize the obtained features from both the techniques. Furthermore, as a system model low dimension i-vector model is used and also different fusion score strategies like mean, maximum, weighted sum, cumulative and concatenated fusion techniques are utilized. Finally extreme learning machine (ELM) is used for classification in order to increase the system identification accuracy (SIA) intern which is having a single layer feedforward neural network with less complexity and time consuming compared to other neural networks. TIMIT and SITW 2016 are the two different databases are used to evaluate the proposed system under limited data of these databases. Both clean and noisy backgrounds conditions are used to check the SIA.
Similar content being viewed by others
References
Alku P, Saeidi R (2017) The linear predictive modeling of speech from higher-lag autocorrelation coefficients applied to noise-robust speaker recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 25(8):1606–1617
Angkititrakul P, Hansen (2007) Discriminative In-Set/Out-of-Set speaker recognition. IEEE Trans. Audio Speech Language Process 15(2):498–508
Bharath KP, and Kumar, Rajesh (2019) Multitaper Based MFCC Feature Extraction for Robust Speaker Recognition System." 2019 Innovations in Power and Advanced Computing Technologies (i-PACT). IEEE. 1
Bin Huang G, Bin Li M, Chen L, Siew CK (2008) Incremental extreme learning machine with fully complex hidden nodes. Neurocomputing 71(4–6):576–583
Campbell JP (1997) Speaker recognition: a tutorial. Proc IEEE 85(9):1437–1462
Chakroun R, Frikha M (2018) New approach for short utterance speaker identification. IET Signal Processing 12:873–880
Chen L and Yang Y (2013) “Emotional speaker recognition based on i-vector through Atom Aligned Sparse Representation”. ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc. (1): 7760–7764
Rohan Kumar Das and Prasanna S R M (2016) " Exploring different attributes of source information for speaker verification with limited test data". J Acoustical Soc Am
Dehak N, Kenny PJ, Dehak R, Dumouchel P, Ouellet P (2011) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798
El-Moneim A, Samia et al (2020) Text-independent speaker recognition using LSTM-RNN and speech enhancement. Multimed Tools Appl:1–16
M. N. Frankle and R. P. Ramachandran (2016) “Robust Speaker Identification Under Noisy Conditions Using Feature Compensation and Signal to Noise Ratio Estimation”. 2016 IEEE 59th Int. Midwest Symp. Circuits Syst., no. October, pp. 1–4
Gao Bin, and W L Woo (2014) “Wearable audio monitoring: content-based processing methodology and implementation”. IEEE Trans. Human-Mach Syst
John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren, Victor Zue, “TIMIT Acoustic-Phonetic Continuous Speech Corpus ”. https://catalog.ldc.upenn.edu/LDC93S1
Ghahabi O and Hernando J (2014) “Deep belief networks for i-vector based speaker recognition,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., no. June, pp. 1700–1704
Hansson M (1997) A Multiple Window Method for Estimation of Peaked. Spectra 45(3):1995–1998
Hansson-Sandsten M, Sandberg J (2009) Optimal cepstrum estimation using multiple windows. ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc. 5:3077–3080
Hu, Hailong, et al. (2018)"CNNAuth: Continuous Authentication via Two-Stream Convolutional Neural Networks." IEEE International Conference on Networking, Architecture and Storage (NAS). IEEE
Huang GB, Zhu QY, Siew CK (2006) Extreme learning machine : theory and applications. Neurocomputing 70(1–3):489–501
Huang G, Chen L, Siew C (2006) Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE Trans Neural Netw 17(4):879–892
Jahangir R, TEh YW, Memon NA, Mujtaba G, Zareei M, Ishtiaq U, Akhtar MZ, Ali I (2020) Text-independent speaker identification through feature fusion and deep neural network. IEEE Access 8:32187–32202
Jayanna, H. S., and SR Mahadeva Prasanna (2009) “An experimental comparison of modelling techniques for speaker recognition under limited data condition” Sadhana 34.5
Kanagasundaram A, Dean D, Sridharan S, Gonzalez-Dominguez J, Gonzalez-Rodriguez J, Ramos D (2014) Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques. Speech Commun. 59:69–82
Kennunen T, Li H (2010) An overview text dependent speaker recognition from features to supervectors. Speech Comm 52(1):12–40
Kenny P (2012) A small footprint i-vector extractor. IEEE Speak Lang Recognit Work:1–6
Kenny, Patrick. (2012) "A small footprint i-vector extractor." In Odyssey 2012-The Speaker and Language Recognition Workshop
Kenny, Patrick, Gilles Boulianne, and Pierre Dumouchel (2005) "Eigenvoice modeling with sparse training data." IEEE Trans Speech Audio Process 13(3):
Kim C, Stern RM (2016) Power-normalized Cepstral coefficients (PNCC) for robust speech recognition. IEEE/ACM Trans Audio Speech Lang Process 24(7):1315–1329
Kinnunen, Tomi, Rahim Saeidi, Johan Sandberg, and Maria Hansson-Sandsten (2009) "What else is new than the Hamming window? Robust MFCCs for speaker recognition via multitapering." In Eleventh Annual Conference of the International Speech Communication Association
Kua JMK, Epps J, Ambikairajah E (2013) I-vector with sparse representation classification for speaker verification. Speech Commun 55(5):707–720
Kumari RSS, Nidhyananthan SS, Anand G (2012) Fused Mel feature sets based text-independent speaker identification using gaussian mixture model. Procedia Eng 30:319–326
Lawson A et al (2013) Improving language identification robustness to highly channel-degraded speech through multiple system fusion. Proc Annu Conf Int Speech Commun Assoc INTERSPEECH 0:1507–1510
Li Y, Hu H, Zhou G, Deng S (2018) Sensor-based continuous authentication using cost-effective kernel ridge regression. IEEE Access 6:32554–32565
Li Y, Hu H, Zhou G (2018) Using data augmentation in continuous authentication on smartphones. IEEE Internet Things J 6(1):628–640
Li Y, Zou B, Deng S, Zhou G (2020) Using feature fusion strategies in continuous authentication on smartphones. IEEE Internet Comput 24(2):49–56
Liu, Fu-Hua, Richard M. Stern, Xuedong Huang, and Alejandro Acero (1993) "Efficient cepstral normalization for robust speech recognition." In Proceedings of the workshop on Human Language Technology, pp. 69–74. Association for Computational Linguistics
Liu, Tingting, Kai Kang, and Shengxiao Guan (2014) "I-vector based text-independent speaker identification." In Proceeding of the 11th World Congress on Intelligent Control and Automation, pp. 5420–5425. IEEE
Ma Z, Yu H, Tan Z (2017) Text-independent speaker identification using the histogram transform model. IEEE Access 4:9733–9739
L. Macková, A. Čižmár, and J. Juhár (2016) “Emotion recognition in I-vector space,” 2016 26th Int. Conf. Radioelektronika, RADIOELEKTRONIKA 2016, pp. 372–375
McLaren M, Ferrer L, Castan D, Lawson A (2016) The speakers in the wild (SITW) speaker recognition database. In: Proceedings Interspeech http://www.speech.sri.com/projects/sitw/
Murty KSR, Yegnanarayana B (2006) Combining evidence from residual phase and MFCC features for speaker recognition. IEEE Signal Process Lett 13(1):52–55
Park, Daniel S. et al. (2019)"Specaugment: A simple data augmentation method for automatic speech recognition" arXiv preprint arXiv:1904.08779
Prasad N (2013) Vishnu, and Srinivasan Umesh. "Improved cepstral mean and variance normalization using Bayesian framework." In IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 156–161. IEEE
Reynolds DA, Rose RC (1995) Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans Speech Audio Processing 3(1):72–83
Reynolds, Douglas A., Thomas F. Quatieri, and Robert B. Dunn (2009)"Speaker verification using adapted Gaussian mixture models." Digital signal processing 10(1–3):
K.Riedel and A.Sidorenko (1995) “Minimum Bias Multiple Taper pectral Estimation” 43(1): 188–195
Saeidi R, Alku P, Backstrom T (2016) Feature extraction using power-law adjusted linear prediction with application to speaker recognition under severe vocal effort mismatch. IEEE/ACM Trans. Audio Speech Lang. Process. 24(1):42–53
Sandberg J et al (2010) Multitaper estimation of frequency-warped cepstra with application to sepeaker verification. IEEE Signal Processing Letters 17(4):343–346
Toruk, Muhammet Mesut, and Ramazan Gokay (2019) "Short Utterance Speaker Recognition Using Time-Delay Neural Network." 2019 16th International Multi-Conference on Systems, Signals & Devices (SSD). IEEE
Verma P, Das PK (2015) I-vectors in speech processing applications: a survey. Int J Speech Technol 18(4):529–546
Viikki O, Laurila K (1998) Cepstral domain segmental feature vector normalization for noise robust speech recognition. Speech Comm 25:1–3
Wang L, Kitaoka N, Nakagawa S (2007) Robust distant speaker recognition based on position-dependent CMN by combining speaker-specific GMM with speaker-adapted HMM. Speech Commun. 49(6):501–513
Zaw, Win, and Aung Thi Ha Soe (2019) "Speaker identification using power spectral subtraction method", 2019 16th ECTI-CON. IEEE
Zhang, Chao, Wei Chen, and Chen Xu. (2019) "Depthwise Separable Convolutions for Short Utterance Speaker Identification." 2019 IEEE 8th Joint International Information Technology and Artificial Intelligence Conference (ITAIC). IEEE
Acknowledgements
First author Bharath K P, (CSIR-Senior Research Fellow) would like to thank Council of Scientific & Industrial Research (CSIR) Human Resource Development Group (HRDG), Govt of India, for financial assistance during his Ph.D. (CSIR-SRF, Ack. No.: 143672/2 k18/1, File No.: 09/844(0084)/2019 EMR-I.)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
K P, B., M, R. ELM speaker identification for limited dataset using multitaper based MFCC and PNCC features with fusion score. Multimed Tools Appl 79, 28859–28883 (2020). https://doi.org/10.1007/s11042-020-09353-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-09353-z