Frame Selection for Robust Speaker Identification: A Hybrid Approach

Prasad, Swati; Tan, Zheng-Hua; Prasad, Ramjee

doi:10.1007/s11277-017-4544-1

Frame Selection for Robust Speaker Identification: A Hybrid Approach

Published: 30 May 2017

Volume 97, pages 933–950, (2017)
Cite this article

Wireless Personal Communications Aims and scope Submit manuscript

Swati Prasad¹,
Zheng-Hua Tan² &
Ramjee Prasad³

211 Accesses
3 Citations
Explore all metrics

Abstract

Identification of a person using voice is a challenging task under environmental noises. Important and reliable frame selection for feature extraction from the time-domain speech signal under noise can play a significant role in improving speaker identification accuracy. Therefore, this paper presents a frame selection method using hybrid technique, which combines two techniques, namely, voice activity detection (VAD) and variable frame rate (VFR) analysis. It efficiently captures the active speech part, the changes in the temporal characteristics of the speech signal, taking into account the signal-to-noise ratio, and thereby speaker-specific information. Experimental results on noisy speech, generated by artificially adding various noise signals to the clean YOHO speech at different SNRs have shown improved results for the frame selection by the hybrid technique in comparison with any one of the techniques used for the hybrid. The proposed hybrid technique outperformed both the VFR and the widely used Gaussian statistical model based VAD method for all noise scenarios at different SNRs, except for the Babble noise corrupted speech at 5 dB SNR, for which, VFR performed better. Considering the average identification accuracies of different noise scenarios, a relative improvement of 9.79% over the VFR, and 18.05% over the Gaussian statistical model based VAD method has been achieved.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

Article Open access 03 January 2024

Speech Emotion Recognition: A Comprehensive Survey

Article 08 March 2023

Heart failure recognition using human voice analysis and artificial intelligence

Article 13 March 2023

References

Atal, B. S. (1976). Automatic recognition of speakers from their voices. Proceedings of the IEEE, 64, 460–475.
Article Google Scholar
Doddington, G. R. (1985). Speaker recognition—identifying people by their voices. Proceedings of the IEEE, 73, 1651–1664.
Article Google Scholar
Campbel, J. P, Jr. (1997). Speaker recognition: A tutorial. Proceedings of the IEEE, 85(9), 1437–1462.
Article Google Scholar
Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communications, 52(1), 12–40.
Article Google Scholar
Mammone, R. J., Zhang, X., & Ramachandran, R. P. (1996). Robust speaker recognition–a feature based approach. IEEE Signal Processing Magazine, 13, 5871.
Article Google Scholar
Togneri, R., & Pullela, D. (2011). An overview of speaker identification: Accuracy and robustness issues. IEEE Circuits Systems Magazine, 11(2), 23–61.
Article Google Scholar
Zhao, X., Wang, Y., & Wang, D. L. (2014). Robust speaker identification in noisy and reverberant conditions. IEEE/ACM Transactions on Audio, Speech and Language Processing, 22(4), 836–845.
Article Google Scholar
Kinnunen, T., Saeidi, R., Sedlak, F., Lee, K. A., Sandberg, J., Hansson-Sandsten, M., et al. (2012). Low-variance multitaper MFCC features: A case study in robust speaker verification. IEEE/ACM Transactions on Audio, Speech and Language Processing, 20(7), 1990–2001.
Article Google Scholar
Alam, M. J., Kinnunen, T., Kenny, P., Ouellet, P., & O’Shaughnessy, D. (2013). Multitaper MFCC and PLP features for speaker verification using i-vectors. Speech Communications, 55, 237–251.
Article Google Scholar
Sadjadi, S. O., Hasan, T., & Hansen, J. H. L. (2012). Mean hilbert envelope coefficients (MHEC) for robust speaker recognition. In Proceedings of Interspeech (pp. 1696–1699).
Ephraim, Y., & Van Trees, H. (1995). A signal subspace approach for speech enhancement. IEEE/ACM Transactions on Audio, Speech and Language Processing, 3(6), 251–266.
Article Google Scholar
Brajevic, Z., & Petosic, A. (2012). Signal denoising using STFT with Bayes prediction and Ephraim–Malah estimation. In Proceedings of the 54th international symposium ELMAR (pp. 183–186).
Govindan, S. M., Duraisamy, P., & Yuan, X. (2014). Adaptive wavelet shrinkage for noise robust speaker recognition. Digital Signal Processing, 33, 180–190.
Article Google Scholar
Kim, K., & Kim, M. Y. (2010). Robust speaker recognition against background noise in an enhanced multicondition domain. IEEE Transactions on Consumer Electronics, 56(3), 1684–1688.
Article Google Scholar
Zao, L., & Coelho, R. (2011). Colored noise based multicondition training for robust speaker identification. IEEE Signal Processing Letters, 18(11), 675–678.
Article Google Scholar
Venturini, A., Zao, L., & Coelho, R. (2014). On speech features fusion, integration Gaussian modeling and multi-style training for noise robust speaker classification. IEEE/ACM Transactions on Audio, Speech and Language Processing, 22(12), 1951–1964.
Article Google Scholar
Dehak, N., kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verification. IEEE/ACM Transactions on Audio, Speech and Language Processing, 19(4), 788–798.
Article Google Scholar
Mashao, D. J., & Skosan, M. (2006). Combining classifier decisions for robust speaker identification. Pattern Recognition, 39, 147–155.
Article Google Scholar
Reynolds, D. A., & Rose, R. C. (1995). Robust text-independent speaker identification using Gaussian mixture models. IEEE/ACM Transactions on Audio, Speech and Language Processing, 3(1), 72–83.
Article Google Scholar
Mak, M.-W., & Yu, H.-B. (2014). A study of voice activity detection techniques for NIST speaker recognition evaluations. Computer Speech and Language, 28, 295–313.
Article Google Scholar
Deng, S., & Han, J. (2012). Likelihood ratio sign test for voice activity detection. IET Signal Processing, 6(4), 306–312.
Article MathSciNet Google Scholar
Jung, C.-S., Kim, M. Y., & Kang, H.-G. (2010). Selecting feature frames for automatic speaker recognition using mutual information. IEEE/ACM Transactions on Audio, Speech and Language Processing, 18(6), 1332–1340.
Article Google Scholar
Fujihara, H., Kitahara, T., Goto, M., Komatani, K., Ogata, T. & Okuno, H. G. (2006). Speaker identification under noisy environment by using harmonic structure extraction and reliable frame weighting. In Proceedings of interspeech (pp. 1459–1462).
Tan, Z.-H., & Lindberg, B. (2010). Low complexity frame rate analysis for speech recognition and voice activity detection. IEEE Journal of Selected Topics in Signal Processing, 4(5), 798–807.
Article Google Scholar
Tan, Z.-H., & Kraljevski, I. (2014). Joint variable frame rate and length analysis for speech recognition under adverse conditions. Computers and Electrical Engineering, 40, 2139–2149.
Article Google Scholar
Sohn, J., Kim, N. S., & Sung, W. (1999). A statistical model based voice activity detection. IEEE Signal Processing Letters, 6(1), 1–3.
Article Google Scholar
Hirsch, H. G. & Pearce, D. (2000). The AURORA experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In Proceedings of ISCA ITRW ASR.
Campbel, J. P. Jr. (1995). Testing with YOHO cd-rom verification corpus. In Proceedings of IEEE international conference on acoustics, speech, and signal processing (pp. 341–344).
M-Guarasa, J., Ordonez, J., Montero, J. M., Ferreiros, J., Cordoba, R., & Haro, L. F. D. (2003). Revisiting scenarios and methods for variable frame rate analysis in automatic speech recognition. In Proceedings of Eurospeech.
Zhu, Q. & Alwan, A. (2000). On the use of variable frame rate analysis in speech recognition. In Proceedings of IEEE international conference on acoustics, speech, and signal processing.

Download references

Acknowledgements

This study was supported by the Erasmus Mundus Mobility for Life Project funded by the European Commission. It is carried out at the Center for TeleInFrastruktur, Department of Electronic Systems, Aalborg University, Aalborg 9220, Denmark.

Author information

Authors and Affiliations

Department of Electronics and Communication Engineering, Birla Institute of Technology, Mesra, Ranchi, Jharkhand, 835215, India
Swati Prasad
Department of Electronic Systems, Aalborg University, Aalborg, 9220, Denmark
Zheng-Hua Tan
Department of Business Development and Technology, Aarhus University, Herning, 7400, Denmark
Ramjee Prasad

Authors

Swati Prasad
View author publications
You can also search for this author in PubMed Google Scholar
Zheng-Hua Tan
View author publications
You can also search for this author in PubMed Google Scholar
Ramjee Prasad
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Swati Prasad.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Prasad, S., Tan, ZH. & Prasad, R. Frame Selection for Robust Speaker Identification: A Hybrid Approach. Wireless Pers Commun 97, 933–950 (2017). https://doi.org/10.1007/s11277-017-4544-1

Download citation

Published: 30 May 2017
Issue Date: November 2017
DOI: https://doi.org/10.1007/s11277-017-4544-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Frame Selection for Robust Speaker Identification: A Hybrid Approach

Abstract

Access this article

Similar content being viewed by others

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

Speech Emotion Recognition: A Comprehensive Survey

Heart failure recognition using human voice analysis and artificial intelligence

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Frame Selection for Robust Speaker Identification: A Hybrid Approach

Abstract

Access this article

Similar content being viewed by others

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

Speech Emotion Recognition: A Comprehensive Survey

Heart failure recognition using human voice analysis and artificial intelligence

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation