Skip to main content

Advertisement

Log in

Improved i-vector extraction technique for speaker verification with short utterances

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

A major challenge in ASV is to improve performance with short speech segments for end-user convenience in real-world applications. In this paper, we present a detailed analysis of ASV systems to observe the duration variability effects on state-of-the-art i-vector and classical Gaussian mixture model-universal background model (GMM-UBM) based ASV systems. We observe an increase in uncertainty of model parameter estimation for i-vector based ASV with speech of shorter duration. In order to compensate the effect of duration variability in short utterances, we have proposed adaptation technique for Baum-Welch statistics estimation used to i-vector extraction. Information from pre-estimated background model parameters are used for adaptation method. The ASV performance with the proposed approach is considerably superior to the conventional i-vector based system. Furthermore, the fusion of proposed i-vector based system and GMM-UBM further improves the ASV performance, especially for short speech segments. Experiments conducted on two speech corpora, NIST SRE 2008 and 2010, have shown relative improvement in equal error rate (EER) in the range of 12–20%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. http://searchsecurity.techtarget.com/news/450301866/Barclays-replaces-passwords-with-voice-authentication.

  2. https://sites.google.com/site/nikobrummer/focal.

References

  • Angkititrakul, P., & Hansen, J. H. (2007). Discriminative in-set/out-of-set speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(2), 498–508.

    Article  Google Scholar 

  • Brummer, N., Burget, L., Cernocky, H., Glembek, O., Grezl, F., Karafiat, M., et al. (2007). Fusion of heterogeneous speaker recognition systems in the SBTU submission for the NIST speaker recognition evaluation 2006. IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 2072–2084.

    Article  Google Scholar 

  • Cai, W., Li, M., Li, L., & Hong, Q. (2015). Duration dependent covariance regularization in plda modeling for speaker verification. In INTERSPEECH (pp. 1027–1031).

  • Campbell, W. M., Sturim, D. E., & Reynolds, D. A. (2006a). Support vector machines using GMM supervectors for speaker verification. IEEE Signal Processing Letters, 13(5), 308–311.

    Article  Google Scholar 

  • Campbell, W. M., Sturim, D. E., Reynolds, D. A., & Solomonoff, A. (2006b). SVM based speaker verification using a GMM supervector kernel and NAP variability compensation. In IEEE International Conference on Acoustics, Speech and Signal Processing, (ICASSP), IEEE.

  • Campbell, J. P, Jr. (1997). Speaker recognition: A tutorial. Proceedings of the IEEE, 85(9), 1437–1462.

    Article  Google Scholar 

  • Davis, S. B., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech and Signal Processing, 28(4), 357–366.

    Article  Google Scholar 

  • Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798.

    Article  Google Scholar 

  • Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B, 39, 1–38.

    MathSciNet  MATH  Google Scholar 

  • Fauve, B. G., Evans, N. W., Pearson, N., Bonastre, J. F., & Mason, J. S. (2007). Influence of task duration in text-independent speaker verification. In Proceedings of INTERSPEECH, ISCA (pp. 794–797).

  • Fauve, B. G., Evans, N. W., & Mason, J. S. (2008). Improving the performance of text-independent short duration SVM-and GMM-based speaker verification. In Odyssey, ISCA (p. 18).

  • Ferrer, L., Bratt, H., Kajarekar, S., Shriberg, E., Sönmez, K., Stolcke, A., & Venkataraman, A. (2003). Modeling duration patterns for speaker recognition (pp. 2017–2020).

  • Gauvain, J. L., & Lee, C. H. (1994). Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Transactions on Speech and Audio Processing, 2(2), 291–298.

    Article  Google Scholar 

  • Hasan, T., & Hansen, J. H. (2011). A study on universal background model training in speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(7), 1890–1899.

    Article  Google Scholar 

  • Hasan, T., Saeidi, R., & Hansen, J. H., van Leeuwen, D. (2013). Duration mismatch compensation for i-vector based speaker recognition systems. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (pp. 7663–7667).

  • Kanagasundaram, A., Vogt, R., Dean, D. B., Sridharan, S., & Mason, M. W. (2011). I-vector based speaker recognition on short utterances. In Proceedings of INTERSPEECH, ISCA (pp. 2341–2344).

  • Kanagasundaram, A., Vogt, R. J., Dean, D. B., & Sridharan, S. (2012). PLDA based speaker recognition on short utterances. In The speaker and language recognition workshop (Odyssey) ISCA.

  • Kanagasundaram, A., Dean, D., Sridharan, S., Gonzalez-Dominguez, J., Gonzalez-Rodriguez, J., & Ramos, D. (2014). Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques. Speech Communication, 59, 69–82.

    Article  Google Scholar 

  • Kanagasundaram, A., Dean, D., Sridharan, S., Ghaemmaghami, H., & Fookes, C. (2017). A study on the effects of using short utterance length development data in the design of gplda speaker verification systems. International Journal of Speech Technology, 20(2), 247–259.

    Article  Google Scholar 

  • Kenny, P. (2010). Bayesian speaker verification with heavy-tailed priors. In The speaker and language recognition workshop (Odyssey) ISCA, (pp. 14).

  • Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2007). Joint factor analysis versus eigenchannels in speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(4), 1435–1447.

    Article  Google Scholar 

  • Kenny, P., Ouellet, P., Dehak, N., Gupta, V., & Dumouchel, P. (2008). A study of interspeaker variability in speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 16(5), 980–988.

    Article  Google Scholar 

  • Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: From features to supervectors. Speech Communication, 52(1), 12–40.

    Article  Google Scholar 

  • Krishnamoorthy, P., Jayanna, H., & Prasanna, S. (2011). Speaker recognition under limited data condition by noise addition. Expert Systems with Applications, 38(10), 13,487–13,490.

    Article  Google Scholar 

  • Li, L., Wang, D., Zhang, C., & Zheng, T. F. (2016a). Improving short utterance speaker recognition by modeling speech unit classes. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 24(6), 1129–1139.

    Article  Google Scholar 

  • Li, L., Wang, D., Zhang, X., Zheng, T. F., & Jin, P. (2016b). System combination for short utterance speaker recognition. In Signal and Information Processing Association Annual Summit and Conference (APSIPA), Asia-Pacific, IEEE, (pp. 1–5).

  • Li, M., & Narayanan, S. (2014). Simplified supervised i-vector modeling with application to robust and efficient language identification and speaker verification. Computer Speech & Language, 28(4), 940–958.

    Article  Google Scholar 

  • Li, W., Fu, T., You, H., Zhu, J., & Chen, N. (2016c). Feature sparsity analysis for i-vector based speaker verification. Speech Communication, 80, 60–70.

    Article  Google Scholar 

  • Mandasari, M.I., McLaren, M., & van Leeuwen, D. A. (2011). Evaluation of i-vector speaker recognition systems for forensic application. In Proceedings of INTERSPEECH, ISCA (pp. 21–24).

  • NIST. (2008). The NIST year 2008 speaker recognition evaluation plan. Technical report, NIST.

  • NIST. (2010). The NIST year 2010 speaker recognition evaluation plan. Technical report, NIST.

  • Poddar, A., Sahidullah, M., & Saha, G. (2015). Performance comparison of speaker recognition systems in presence of duration variability. In Annual IEEE India Conference (INDICON), IEEE (pp. 1–6).

  • Poddar, A., Sahidullah, M., & Saha, G. (2017). An adaptive i-vector extraction for speaker verification with short utterance. In Proc. of International Conference on Pattern Recognition and Machine Intelligence (PReMI 2017), Berlin: Springer.

  • Poorjam, A. H., Saeidi, R., Kinnunen, T., & Hautamäki, V. (2016). Incorporating uncertainty as a quality measure in i-vector based language recognition. Odyssey pp. 74–80.

  • Reynolds, D. A., & Rose, R. C. (1995). Robust text-independent speaker identification using gaussian mixture speaker models. IEEE transactions on speech and audio processing, 3(1), 72–83.

    Article  Google Scholar 

  • Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1), 19–41.

    Article  Google Scholar 

  • Sahidullah, M., & Kinnunen, T. (2016). Local spectral variability features for speaker verification. Digital Signal Processing, 50, 1–11.

    Article  Google Scholar 

  • Sahidullah, M., & Saha, G. (2012a). Comparison of speech activity detection techniques for speaker recognition. arXiv preprint arXiv:12100297

  • Sahidullah, M., & Saha, G. (2012b). Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition. Speech Communication, 54(4), 543–565.

    Article  Google Scholar 

  • Sahidullah, M., & Saha, G. (2013). A novel windowing technique for efficient computation of MFCC for speaker recognition. IEEE Signal Processing Letters, 20(2), 149–152.

    Article  Google Scholar 

  • Sarkar, A. K., Matrouf, D., Bousquet, P. M., & Bonastre, J. F. (2012). Study of the effect of i-vector modeling on short and mismatch utterance duration for speaker verification. In Proceedings of INTERSPEECH ISCA.

  • Shum, S. (2011). Unsupervised methods for speaker diarization. PhD thesis, Massachusetts Institute of Technology.

  • Suh, J. W., & Hansen, J. H. (2012). Acoustic hole filling for sparse enrollment data using a cohort universal corpus for speaker recognition. The Journal of the Acoustical Society of America, 131(2), 1515–1528.

    Article  Google Scholar 

  • Van Segbroeck, M., Travadi, R., & Narayanan, S. S. (2015). Rapid language identification. IEEE Transactions on Audio, Speech, and Language Processing, 23(7), 1118–1129.

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like thank Indian Space Research Organization (ISRO) for partial funding of the research outcome. The authors would also like to express gratitude to the lab members of (Audio and Bio-Signal Processing) ABSP Lab, especially Mr. Monisankha Pal and Mrs. Shefali Waldekar for mindful discussions and co-operation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arnab Poddar.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Poddar, A., Sahidullah, M. & Saha, G. Improved i-vector extraction technique for speaker verification with short utterances. Int J Speech Technol 21, 473–488 (2018). https://doi.org/10.1007/s10772-017-9477-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-017-9477-2

Keywords

Navigation