Skip to main content
Log in

ELM speaker identification for limited dataset using multitaper based MFCC and PNCC features with fusion score

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In current scenario, speaker recognition under noisy condition is the major challenging task in the area of speech processing. Due to noise environment there is a significant degradation in the system performance. The major aim of the proposed work is to identify the speaker’s under clean and noise background using limited dataset. In this paper, we proposed a multitaper based Mel frequency cepstral coefficients (MFCC) and power normalization cepstral coefficients (PNCC) techniques with fusion strategies. Here, we used MFCC and PNCC techniques with different multitapers to extract the desired features from the obtained speech samples. Then, cepstral mean and variance normalization (CMVN) and Feature warping (FW) are the two techniques applied to normalize the obtained features from both the techniques. Furthermore, as a system model low dimension i-vector model is used and also different fusion score strategies like mean, maximum, weighted sum, cumulative and concatenated fusion techniques are utilized. Finally extreme learning machine (ELM) is used for classification in order to increase the system identification accuracy (SIA) intern which is having a single layer feedforward neural network with less complexity and time consuming compared to other neural networks. TIMIT and SITW 2016 are the two different databases are used to evaluate the proposed system under limited data of these databases. Both clean and noisy backgrounds conditions are used to check the SIA.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Alku P, Saeidi R (2017) The linear predictive modeling of speech from higher-lag autocorrelation coefficients applied to noise-robust speaker recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 25(8):1606–1617

    Article  Google Scholar 

  2. Angkititrakul P, Hansen (2007) Discriminative In-Set/Out-of-Set speaker recognition. IEEE Trans. Audio Speech Language Process 15(2):498–508

    Article  Google Scholar 

  3. Bharath KP, and Kumar, Rajesh (2019) Multitaper Based MFCC Feature Extraction for Robust Speaker Recognition System." 2019 Innovations in Power and Advanced Computing Technologies (i-PACT). IEEE. 1

  4. Bin Huang G, Bin Li M, Chen L, Siew CK (2008) Incremental extreme learning machine with fully complex hidden nodes. Neurocomputing 71(4–6):576–583

    Article  Google Scholar 

  5. Campbell JP (1997) Speaker recognition: a tutorial. Proc IEEE 85(9):1437–1462

    Article  Google Scholar 

  6. Chakroun R, Frikha M (2018) New approach for short utterance speaker identification. IET Signal Processing 12:873–880

    Article  Google Scholar 

  7. Chen L and Yang Y (2013) “Emotional speaker recognition based on i-vector through Atom Aligned Sparse Representation”. ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc. (1): 7760–7764

  8. Rohan Kumar Das and Prasanna S R M (2016) " Exploring different attributes of source information for speaker verification with limited test data". J Acoustical Soc Am

  9. Dehak N, Kenny PJ, Dehak R, Dumouchel P, Ouellet P (2011) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798

    Article  Google Scholar 

  10. El-Moneim A, Samia et al (2020) Text-independent speaker recognition using LSTM-RNN and speech enhancement. Multimed Tools Appl:1–16

  11. M. N. Frankle and R. P. Ramachandran (2016) “Robust Speaker Identification Under Noisy Conditions Using Feature Compensation and Signal to Noise Ratio Estimation”. 2016 IEEE 59th Int. Midwest Symp. Circuits Syst., no. October, pp. 1–4

  12. Gao Bin, and W L Woo (2014) “Wearable audio monitoring: content-based processing methodology and implementation”. IEEE Trans. Human-Mach Syst

  13. John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren, Victor Zue, “TIMIT Acoustic-Phonetic Continuous Speech Corpus ”. https://catalog.ldc.upenn.edu/LDC93S1

  14. Ghahabi O and Hernando J (2014) “Deep belief networks for i-vector based speaker recognition,” ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., no. June, pp. 1700–1704

  15. Hansson M (1997) A Multiple Window Method for Estimation of Peaked. Spectra 45(3):1995–1998

    Google Scholar 

  16. Hansson-Sandsten M, Sandberg J (2009) Optimal cepstrum estimation using multiple windows. ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc. 5:3077–3080

    Google Scholar 

  17. Hu, Hailong, et al. (2018)"CNNAuth: Continuous Authentication via Two-Stream Convolutional Neural Networks." IEEE International Conference on Networking, Architecture and Storage (NAS). IEEE

  18. Huang GB, Zhu QY, Siew CK (2006) Extreme learning machine : theory and applications. Neurocomputing 70(1–3):489–501

    Article  Google Scholar 

  19. Huang G, Chen L, Siew C (2006) Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE Trans Neural Netw 17(4):879–892

    Article  Google Scholar 

  20. Jahangir R, TEh YW, Memon NA, Mujtaba G, Zareei M, Ishtiaq U, Akhtar MZ, Ali I (2020) Text-independent speaker identification through feature fusion and deep neural network. IEEE Access 8:32187–32202

    Article  Google Scholar 

  21. Jayanna, H. S., and SR Mahadeva Prasanna (2009) “An experimental comparison of modelling techniques for speaker recognition under limited data condition” Sadhana 34.5

  22. Kanagasundaram A, Dean D, Sridharan S, Gonzalez-Dominguez J, Gonzalez-Rodriguez J, Ramos D (2014) Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques. Speech Commun. 59:69–82

    Article  Google Scholar 

  23. Kennunen T, Li H (2010) An overview text dependent speaker recognition from features to supervectors. Speech Comm 52(1):12–40

    Article  Google Scholar 

  24. Kenny P (2012) A small footprint i-vector extractor. IEEE Speak Lang Recognit Work:1–6

  25. Kenny, Patrick. (2012) "A small footprint i-vector extractor." In Odyssey 2012-The Speaker and Language Recognition Workshop

  26. Kenny, Patrick, Gilles Boulianne, and Pierre Dumouchel (2005) "Eigenvoice modeling with sparse training data." IEEE Trans Speech Audio Process 13(3):

  27. Kim C, Stern RM (2016) Power-normalized Cepstral coefficients (PNCC) for robust speech recognition. IEEE/ACM Trans Audio Speech Lang Process 24(7):1315–1329

    Article  Google Scholar 

  28. Kinnunen, Tomi, Rahim Saeidi, Johan Sandberg, and Maria Hansson-Sandsten (2009) "What else is new than the Hamming window? Robust MFCCs for speaker recognition via multitapering." In Eleventh Annual Conference of the International Speech Communication Association

  29. Kua JMK, Epps J, Ambikairajah E (2013) I-vector with sparse representation classification for speaker verification. Speech Commun 55(5):707–720

    Article  Google Scholar 

  30. Kumari RSS, Nidhyananthan SS, Anand G (2012) Fused Mel feature sets based text-independent speaker identification using gaussian mixture model. Procedia Eng 30:319–326

    Article  Google Scholar 

  31. Lawson A et al (2013) Improving language identification robustness to highly channel-degraded speech through multiple system fusion. Proc Annu Conf Int Speech Commun Assoc INTERSPEECH 0:1507–1510

    Google Scholar 

  32. Li Y, Hu H, Zhou G, Deng S (2018) Sensor-based continuous authentication using cost-effective kernel ridge regression. IEEE Access 6:32554–32565

    Article  Google Scholar 

  33. Li Y, Hu H, Zhou G (2018) Using data augmentation in continuous authentication on smartphones. IEEE Internet Things J 6(1):628–640

    Article  Google Scholar 

  34. Li Y, Zou B, Deng S, Zhou G (2020) Using feature fusion strategies in continuous authentication on smartphones. IEEE Internet Comput 24(2):49–56

    Article  Google Scholar 

  35. Liu, Fu-Hua, Richard M. Stern, Xuedong Huang, and Alejandro Acero (1993) "Efficient cepstral normalization for robust speech recognition." In Proceedings of the workshop on Human Language Technology, pp. 69–74. Association for Computational Linguistics

  36. Liu, Tingting, Kai Kang, and Shengxiao Guan (2014) "I-vector based text-independent speaker identification." In Proceeding of the 11th World Congress on Intelligent Control and Automation, pp. 5420–5425. IEEE

  37. Ma Z, Yu H, Tan Z (2017) Text-independent speaker identification using the histogram transform model. IEEE Access 4:9733–9739

    Article  Google Scholar 

  38. L. Macková, A. Čižmár, and J. Juhár (2016) “Emotion recognition in I-vector space,” 2016 26th Int. Conf. Radioelektronika, RADIOELEKTRONIKA 2016, pp. 372–375

  39. McLaren M, Ferrer L, Castan D, Lawson A (2016) The speakers in the wild (SITW) speaker recognition database. In: Proceedings Interspeech http://www.speech.sri.com/projects/sitw/

    Google Scholar 

  40. Murty KSR, Yegnanarayana B (2006) Combining evidence from residual phase and MFCC features for speaker recognition. IEEE Signal Process Lett 13(1):52–55

    Article  Google Scholar 

  41. Park, Daniel S. et al. (2019)"Specaugment: A simple data augmentation method for automatic speech recognition" arXiv preprint arXiv:1904.08779

  42. Prasad N (2013) Vishnu, and Srinivasan Umesh. "Improved cepstral mean and variance normalization using Bayesian framework." In IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 156–161. IEEE

  43. Reynolds DA, Rose RC (1995) Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans Speech Audio Processing 3(1):72–83

    Article  Google Scholar 

  44. Reynolds, Douglas A., Thomas F. Quatieri, and Robert B. Dunn (2009)"Speaker verification using adapted Gaussian mixture models." Digital signal processing 10(1–3):

  45. K.Riedel and A.Sidorenko (1995) “Minimum Bias Multiple Taper pectral Estimation” 43(1): 188–195

  46. Saeidi R, Alku P, Backstrom T (2016) Feature extraction using power-law adjusted linear prediction with application to speaker recognition under severe vocal effort mismatch. IEEE/ACM Trans. Audio Speech Lang. Process. 24(1):42–53

    Article  Google Scholar 

  47. Sandberg J et al (2010) Multitaper estimation of frequency-warped cepstra with application to sepeaker verification. IEEE Signal Processing Letters 17(4):343–346

    Article  Google Scholar 

  48. Toruk, Muhammet Mesut, and Ramazan Gokay (2019) "Short Utterance Speaker Recognition Using Time-Delay Neural Network." 2019 16th International Multi-Conference on Systems, Signals & Devices (SSD). IEEE

  49. Verma P, Das PK (2015) I-vectors in speech processing applications: a survey. Int J Speech Technol 18(4):529–546

    Article  Google Scholar 

  50. Viikki O, Laurila K (1998) Cepstral domain segmental feature vector normalization for noise robust speech recognition. Speech Comm 25:1–3

    Article  Google Scholar 

  51. Wang L, Kitaoka N, Nakagawa S (2007) Robust distant speaker recognition based on position-dependent CMN by combining speaker-specific GMM with speaker-adapted HMM. Speech Commun. 49(6):501–513

    Article  Google Scholar 

  52. Zaw, Win, and Aung Thi Ha Soe (2019) "Speaker identification using power spectral subtraction method", 2019 16th ECTI-CON. IEEE

  53. Zhang, Chao, Wei Chen, and Chen Xu. (2019) "Depthwise Separable Convolutions for Short Utterance Speaker Identification." 2019 IEEE 8th Joint International Information Technology and Artificial Intelligence Conference (ITAIC). IEEE

Download references

Acknowledgements

First author Bharath K P, (CSIR-Senior Research Fellow) would like to thank Council of Scientific & Industrial Research (CSIR) Human Resource Development Group (HRDG), Govt of India, for financial assistance during his Ph.D. (CSIR-SRF, Ack. No.: 143672/2 k18/1, File No.: 09/844(0084)/2019 EMR-I.)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rajesh Kumar M.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

K P, B., M, R. ELM speaker identification for limited dataset using multitaper based MFCC and PNCC features with fusion score. Multimed Tools Appl 79, 28859–28883 (2020). https://doi.org/10.1007/s11042-020-09353-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-09353-z

Keywords

Navigation