ELM speaker identification for limited dataset using multitaper based MFCC and PNCC features with fusion score

K P, Bharath; M, Rajesh Kumar

doi:10.1007/s11042-020-09353-z

ELM speaker identification for limited dataset using multitaper based MFCC and PNCC features with fusion score

Published: 06 August 2020

Volume 79, pages 28859–28883, (2020)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Bharath K P¹ &
Rajesh Kumar M¹

453 Accesses
13 Citations
Explore all metrics

Abstract

In current scenario, speaker recognition under noisy condition is the major challenging task in the area of speech processing. Due to noise environment there is a significant degradation in the system performance. The major aim of the proposed work is to identify the speaker’s under clean and noise background using limited dataset. In this paper, we proposed a multitaper based Mel frequency cepstral coefficients (MFCC) and power normalization cepstral coefficients (PNCC) techniques with fusion strategies. Here, we used MFCC and PNCC techniques with different multitapers to extract the desired features from the obtained speech samples. Then, cepstral mean and variance normalization (CMVN) and Feature warping (FW) are the two techniques applied to normalize the obtained features from both the techniques. Furthermore, as a system model low dimension i-vector model is used and also different fusion score strategies like mean, maximum, weighted sum, cumulative and concatenated fusion techniques are utilized. Finally extreme learning machine (ELM) is used for classification in order to increase the system identification accuracy (SIA) intern which is having a single layer feedforward neural network with less complexity and time consuming compared to other neural networks. TIMIT and SITW 2016 are the two different databases are used to evaluate the proposed system under limited data of these databases. Both clean and noisy backgrounds conditions are used to check the SIA.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Automatic speech recognition: a survey

Article 10 November 2020

Mishaim Malik, Muhammad Kamran Malik, … Imran Makhdoom

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Amandeep Singh Dhanjal & Williamjeet Singh

A Deep Learning Framework for Audio Deepfake Detection

Article 08 November 2021

Janavi Khochare, Chaitali Joshi, … Faruk Kazi

References

Alku P, Saeidi R (2017) The linear predictive modeling of speech from higher-lag autocorrelation coefficients applied to noise-robust speaker recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 25(8):1606–1617
Article Google Scholar
Angkititrakul P, Hansen (2007) Discriminative In-Set/Out-of-Set speaker recognition. IEEE Trans. Audio Speech Language Process 15(2):498–508
Article Google Scholar
Bharath KP, and Kumar, Rajesh (2019) Multitaper Based MFCC Feature Extraction for Robust Speaker Recognition System." 2019 Innovations in Power and Advanced Computing Technologies (i-PACT). IEEE. 1
Bin Huang G, Bin Li M, Chen L, Siew CK (2008) Incremental extreme learning machine with fully complex hidden nodes. Neurocomputing 71(4–6):576–583
Article Google Scholar
Campbell JP (1997) Speaker recognition: a tutorial. Proc IEEE 85(9):1437–1462
Article Google Scholar
Chakroun R, Frikha M (2018) New approach for short utterance speaker identification. IET Signal Processing 12:873–880
Article Google Scholar
Chen L and Yang Y (2013) “Emotional speaker recognition based on i-vector through Atom Aligned Sparse Representation”. ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc. (1): 7760–7764
Rohan Kumar Das and Prasanna S R M (2016) " Exploring different attributes of source information for speaker verification with limited test data". J Acoustical Soc Am
Dehak N, Kenny PJ, Dehak R, Dumouchel P, Ouellet P (2011) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798
Article Google Scholar
El-Moneim A, Samia et al (2020) Text-independent speaker recognition using LSTM-RNN and speech enhancement. Multimed Tools Appl:1–16
M. N. Frankle and R. P. Ramachandran (2016) “Robust Speaker Identification Under Noisy Conditions Using Feature Compensation and Signal to Noise Ratio Estimation”. 2016 IEEE 59th Int. Midwest Symp. Circuits Syst., no. October, pp. 1–4
Gao Bin, and W L Woo (2014) “Wearable audio monitoring: content-based processing methodology and implementation”. IEEE Trans. Human-Mach Syst
John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren, Victor Zue, “TIMIT Acoustic-Phonetic Continuous Speech Corpus ”. https://catalog.ldc.upenn.edu/LDC93S1
Ghahabi O and Hernando J (2014) “Deep belief networks for i-vector based speaker recognition,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., no. June, pp. 1700–1704
Hansson M (1997) A Multiple Window Method for Estimation of Peaked. Spectra 45(3):1995–1998
Google Scholar
Hansson-Sandsten M, Sandberg J (2009) Optimal cepstrum estimation using multiple windows. ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc. 5:3077–3080
Google Scholar
Hu, Hailong, et al. (2018)"CNNAuth: Continuous Authentication via Two-Stream Convolutional Neural Networks." IEEE International Conference on Networking, Architecture and Storage (NAS). IEEE
Huang GB, Zhu QY, Siew CK (2006) Extreme learning machine : theory and applications. Neurocomputing 70(1–3):489–501
Article Google Scholar
Huang G, Chen L, Siew C (2006) Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE Trans Neural Netw 17(4):879–892
Article Google Scholar
Jahangir R, TEh YW, Memon NA, Mujtaba G, Zareei M, Ishtiaq U, Akhtar MZ, Ali I (2020) Text-independent speaker identification through feature fusion and deep neural network. IEEE Access 8:32187–32202
Article Google Scholar
Jayanna, H. S., and SR Mahadeva Prasanna (2009) “An experimental comparison of modelling techniques for speaker recognition under limited data condition” Sadhana 34.5
Kanagasundaram A, Dean D, Sridharan S, Gonzalez-Dominguez J, Gonzalez-Rodriguez J, Ramos D (2014) Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques. Speech Commun. 59:69–82
Article Google Scholar
Kennunen T, Li H (2010) An overview text dependent speaker recognition from features to supervectors. Speech Comm 52(1):12–40
Article Google Scholar
Kenny P (2012) A small footprint i-vector extractor. IEEE Speak Lang Recognit Work:1–6
Kenny, Patrick. (2012) "A small footprint i-vector extractor." In Odyssey 2012-The Speaker and Language Recognition Workshop
Kenny, Patrick, Gilles Boulianne, and Pierre Dumouchel (2005) "Eigenvoice modeling with sparse training data." IEEE Trans Speech Audio Process 13(3):
Kim C, Stern RM (2016) Power-normalized Cepstral coefficients (PNCC) for robust speech recognition. IEEE/ACM Trans Audio Speech Lang Process 24(7):1315–1329
Article Google Scholar
Kinnunen, Tomi, Rahim Saeidi, Johan Sandberg, and Maria Hansson-Sandsten (2009) "What else is new than the Hamming window? Robust MFCCs for speaker recognition via multitapering." In Eleventh Annual Conference of the International Speech Communication Association
Kua JMK, Epps J, Ambikairajah E (2013) I-vector with sparse representation classification for speaker verification. Speech Commun 55(5):707–720
Article Google Scholar
Kumari RSS, Nidhyananthan SS, Anand G (2012) Fused Mel feature sets based text-independent speaker identification using gaussian mixture model. Procedia Eng 30:319–326
Article Google Scholar
Lawson A et al (2013) Improving language identification robustness to highly channel-degraded speech through multiple system fusion. Proc Annu Conf Int Speech Commun Assoc INTERSPEECH 0:1507–1510
Google Scholar
Li Y, Hu H, Zhou G, Deng S (2018) Sensor-based continuous authentication using cost-effective kernel ridge regression. IEEE Access 6:32554–32565
Article Google Scholar
Li Y, Hu H, Zhou G (2018) Using data augmentation in continuous authentication on smartphones. IEEE Internet Things J 6(1):628–640
Article Google Scholar
Li Y, Zou B, Deng S, Zhou G (2020) Using feature fusion strategies in continuous authentication on smartphones. IEEE Internet Comput 24(2):49–56
Article Google Scholar
Liu, Fu-Hua, Richard M. Stern, Xuedong Huang, and Alejandro Acero (1993) "Efficient cepstral normalization for robust speech recognition." In Proceedings of the workshop on Human Language Technology, pp. 69–74. Association for Computational Linguistics
Liu, Tingting, Kai Kang, and Shengxiao Guan (2014) "I-vector based text-independent speaker identification." In Proceeding of the 11th World Congress on Intelligent Control and Automation, pp. 5420–5425. IEEE
Ma Z, Yu H, Tan Z (2017) Text-independent speaker identification using the histogram transform model. IEEE Access 4:9733–9739
Article Google Scholar
L. Macková, A. Čižmár, and J. Juhár (2016) “Emotion recognition in I-vector space,” 2016 26th Int. Conf. Radioelektronika, RADIOELEKTRONIKA 2016, pp. 372–375
McLaren M, Ferrer L, Castan D, Lawson A (2016) The speakers in the wild (SITW) speaker recognition database. In: Proceedings Interspeech http://www.speech.sri.com/projects/sitw/
Google Scholar
Murty KSR, Yegnanarayana B (2006) Combining evidence from residual phase and MFCC features for speaker recognition. IEEE Signal Process Lett 13(1):52–55
Article Google Scholar
Park, Daniel S. et al. (2019)"Specaugment: A simple data augmentation method for automatic speech recognition" arXiv preprint arXiv:1904.08779
Prasad N (2013) Vishnu, and Srinivasan Umesh. "Improved cepstral mean and variance normalization using Bayesian framework." In IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 156–161. IEEE
Reynolds DA, Rose RC (1995) Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans Speech Audio Processing 3(1):72–83
Article Google Scholar
Reynolds, Douglas A., Thomas F. Quatieri, and Robert B. Dunn (2009)"Speaker verification using adapted Gaussian mixture models." Digital signal processing 10(1–3):
K.Riedel and A.Sidorenko (1995) “Minimum Bias Multiple Taper pectral Estimation” 43(1): 188–195
Saeidi R, Alku P, Backstrom T (2016) Feature extraction using power-law adjusted linear prediction with application to speaker recognition under severe vocal effort mismatch. IEEE/ACM Trans. Audio Speech Lang. Process. 24(1):42–53
Article Google Scholar
Sandberg J et al (2010) Multitaper estimation of frequency-warped cepstra with application to sepeaker verification. IEEE Signal Processing Letters 17(4):343–346
Article Google Scholar
Toruk, Muhammet Mesut, and Ramazan Gokay (2019) "Short Utterance Speaker Recognition Using Time-Delay Neural Network." 2019 16th International Multi-Conference on Systems, Signals & Devices (SSD). IEEE
Verma P, Das PK (2015) I-vectors in speech processing applications: a survey. Int J Speech Technol 18(4):529–546
Article Google Scholar
Viikki O, Laurila K (1998) Cepstral domain segmental feature vector normalization for noise robust speech recognition. Speech Comm 25:1–3
Article Google Scholar
Wang L, Kitaoka N, Nakagawa S (2007) Robust distant speaker recognition based on position-dependent CMN by combining speaker-specific GMM with speaker-adapted HMM. Speech Commun. 49(6):501–513
Article Google Scholar
Zaw, Win, and Aung Thi Ha Soe (2019) "Speaker identification using power spectral subtraction method", 2019 16th ECTI-CON. IEEE
Zhang, Chao, Wei Chen, and Chen Xu. (2019) "Depthwise Separable Convolutions for Short Utterance Speaker Identification." 2019 IEEE 8th Joint International Information Technology and Artificial Intelligence Conference (ITAIC). IEEE

Download references

Acknowledgements

First author Bharath K P, (CSIR-Senior Research Fellow) would like to thank Council of Scientific & Industrial Research (CSIR) Human Resource Development Group (HRDG), Govt of India, for financial assistance during his Ph.D. (CSIR-SRF, Ack. No.: 143672/2 k18/1, File No.: 09/844(0084)/2019 EMR-I.)

Author information

Authors and Affiliations

School of Electronics Engineering, Vellore Institute of Technology, Vellore, India
Bharath K P & Rajesh Kumar M

Authors

Bharath K P
View author publications
You can also search for this author in PubMed Google Scholar
Rajesh Kumar M
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rajesh Kumar M.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

K P, B., M, R. ELM speaker identification for limited dataset using multitaper based MFCC and PNCC features with fusion score. Multimed Tools Appl 79, 28859–28883 (2020). https://doi.org/10.1007/s11042-020-09353-z

Download citation

Received: 14 October 2019
Revised: 04 July 2020
Accepted: 13 July 2020
Published: 06 August 2020
Issue Date: October 2020
DOI: https://doi.org/10.1007/s11042-020-09353-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

ELM speaker identification for limited dataset using multitaper based MFCC and PNCC features with fusion score

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

A Deep Learning Framework for Audio Deepfake Detection

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

ELM speaker identification for limited dataset using multitaper based MFCC and PNCC features with fusion score

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

A Deep Learning Framework for Audio Deepfake Detection

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation