Skip to main content
Log in

Multi-resolution time frequency feature and complementary combination for short utterance speaker recognition

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

A human speaker recognition expert often observes the speech spectrogram in multiple different scales for speaker recognition, especially under the short utterance condition. Inspired by this action, this paper proposes a novel multi-resolution time frequency feature (MRTF) extraction method, which is obtained by performing a 2-Dimensional discrete cosine transform (DCT) in multi-scale on the time frequency spectrogram matrix and then selecting and combining to the final multi-scaled transformed elements. Compared to the traditional Mel-Frequency Cepstral Coefficient (MFCC) feature extraction, the proposed method can make better use of multi-resolution temporal-frequency information. Beyond this, we also proposed three complementary combination strategies of MFCC and MRTF: in feature level, in i-vector level and in score level. Comparing their performance. We found the best results are obtained by combination in i-vector level. In the three NIST 2008 Speaker Recognition Evaluation datasets, the proposed method is the most effective for improving the performance under short utterance than under long utterance. And after the combination, we can achieve an EER of 11.32 % and MinDCF of 0.054 in the 10sec-10sec trials on the male dataset, which is an absolute 3 % improvement of EER than the best reported result in this field.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Ajmera P, Holambe R (2009) Multiresolution features based polynomial kernel discriminant analysis for speaker recognition. In: International conference on advances in computing, control, telecommunication technologies, 2009, ACT ’09, pp 333–337

  2. Dehak N (2009) Discriminative and generative approches for long- and short-term speaker characteristics modeling: application to speaker verification. Ph.D. thesis, École de Technologie Supérieure, Montreal

  3. Dehak N, Kenny P, Dehak R, Dumouchel P, Ouellet P (2011) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Language Proc 19(4):788–798. doi:10.1109/TASL.2010.2064307

    Article  Google Scholar 

  4. Dehak N, Kenny P, da Dehak R, Glembek O, Dumouchel P, Burget L, Hubeika V, Castaldo F (2009) Support vector machines and joint factor analysis for speaker verification. In: ICASSP, pp 4237–4240

  5. Dehak N, Dehak R, Glass J, Reynolds D, Kenny P (2010) Cosine similarity scoring without score normalization techniques. In: Odyssey 2010—the speaker and language recognition workshop

  6. Hatch AO, Kajarekar S, Stolcke A (2006) Within-class covariance normalization for svm-based speaker recognition. In: Ninth international conference on spoken language processing

  7. Impedovo D, Pirlo G, Petrone M (2012) A multi-resolution multi-classifier system for speaker verification. Expert Syst 29(5):442–455

    Article  Google Scholar 

  8. Jayanna H, Mahadeva Prasanna S (2010) Limited data speaker identification. Sadhana 35(5):525–546

    Article  MATH  Google Scholar 

  9. Kanagasundaram A, Vogt R, Dean DB, Sridharan S, Mason MW (2011) I-vector based speaker recognition on short utterances. In: Interspeech 2011, International Speech Communication Association (ISCA). Firenze Fiera, Florence, pp 2341–2344

  10. Kenny P, Boulianne G, Dumouchel P (2005) Eigenvoice modeling with sparse training data. IEEE Trans Audio Speech Language Proc 13(3):345–354

    Article  Google Scholar 

  11. Kenny P, Boulianne G, Ouellet P, Dumouchel P (2007) Joint factor analysis versus eigenchannels in speaker recognition. IEEE Trans Audio Speech Language Proc 15(4):1435–1447

    Article  Google Scholar 

  12. Kenny P, Boulianne G, Ouellet P, Dumouchel P (2007) Speaker and session variability in gmm-based speaker verification. IEEE Trans Audio Speech Language Proc 15(4):1448–1460

    Article  Google Scholar 

  13. Kinnunen T, Li H (2010) An overview of text-independent speaker recognition: from features to supervectors. Speech Comm 52(1):12–40. doi:10.1016/j.specom.2009.08.009. http://www.sciencedirect.com/science/article/pii/S0167639309001289

  14. Li ZY, He L, Zhang WQ, Liu J (2010) Multi-feature combination for speaker recognition. In: 2010 7th International Symposium on Chinese Spoken Language Processing (ISCSLP). IEEE, pp 318–321

  15. Li ZY, Zhang WQ, He L, Liu J (2012) Complementary combination in i-vector level for language recognition. In: Odyssey 2012—the speaker and language recognition workshop

  16. Martinez AM, Kak AC (2001) Pca versus lda. IEEE Trans Pattern Anal Mach Intell 23(2):228–233

    Article  Google Scholar 

  17. Mclaren M, Vogt R, Baker B, Sridharan S, Sridharan S (2010) Experiments in svm-based speaker verification using short utterances. In: Odyssey 2012—the speaker and language recognition workshop

  18. NIST (2008) The nist year 2008 speaker recognition evaluation plan. National Institute of Standards and Technology

  19. Pelecanos J, Sridharan S (2001) Feature warping for robust speaker verification. In: Odyssey 2001—the speaker and language recognition workshop

  20. Reynolds DA, Quatieri TF, Dunn RB (2000) Speaker verification using adapted gaussian mixture models. Digital Sign Proc 10(1–3):19–41

    Article  Google Scholar 

  21. Shan Y, Liu J (2011) Robust speaker recognition in cross-channel condition based on gaussian mixture model. Multimed Tools Appl 52(1):159–173

    Article  Google Scholar 

  22. Stafylakis T, Kenny P, Senoussaoui M, Dumouchel P (2012) Plda using gaussian restricted boltzmann machines with application to speaker verification. In: INTERSPEECH

  23. Zhang WQ, Deng Y, He L, Liu J (2010) Variant time-frequency cepstral features for speaker recognition. In: INTERSPEECH, pp 2122–2125

  24. Zhang WQ, He L, Deng Y, Liu J, Johnson M (2011) Time frequency cepstral features and heteroscedastic linear discriminant analysis for language recognition. IEEE Trans Audio Speech Language Proc 19(2):266–276

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by National Natural Science Foundation of China (Project 61370034 , 61273268, 61005019, 90920302) and by Beijing Natural Science Foundation Program (Project KZ201110005005).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhi-Yi Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, ZY., Zhang, WQ. & Liu, J. Multi-resolution time frequency feature and complementary combination for short utterance speaker recognition. Multimed Tools Appl 74, 937–953 (2015). https://doi.org/10.1007/s11042-013-1705-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-013-1705-4

Keywords

Navigation