Abstract
In this paper, we investigate the use of Multiple Background Models (M-BMs) in Speaker Verification (SV). We cluster the speakers using either their Vocal Tract Lengths (VTLs) or by using their speaker specific Maximum Likelihood Linear Regression (MLLR) super-vector, and build a separate Background Model (BM) for each such cluster. We show that the use of M-BMs provide improved performance when compared to the use of a single/gender wise Universal Background Model (UBM). While the computational complexity during test remains same for both M-BMs and UBM, M-BMs require switching of models depending on the claimant and also score-normalization becomes difficult. To overcome these problems, we propose a novel method which aggregates the information from Multiple Background Models into a single gender independent UBM and is inspired by conventional Feature Mapping (FM) technique. We show that using this approach, we get improvement over the conventional UBM method, and yet this approach also permits easy use of score-normalization techniques. The proposed method provides relative improvement in Equal-Error Rate (EER) by 13.65 % in the case of VTL clustering, and 15.43 % in the case of MLLR super-vector when compared to the conventional single UBM system. When AT-norm score-normalization is used then the proposed method provided a relative improvement in EER of 20.96 % for VTL clustering and 22.48 % for MLLR super-vector based clustering. Furthermore, the proposed method is compared with the gender dependent speaker verification system using Gaussian Mixture Model-Support Vector Machines (GMM-SVM) super-vector linear kernel. The experimental results show that the proposed method perform better than gender dependent speaker verification system.
Similar content being viewed by others
References
Akhil, P. T., Rath, S. P., Umesh, S., & Sanand, D. R. (2008). A computationally efficient approach to warp factor estimation in VTLN using EM algorithm and sufficient statistics. In Proc. of interspeech (pp. 1713–1716).
Ariyaeeinia, A. M., & Sivakumaran, P. (1997). Analysis and comparison of score normalization methods for text dependent speaker verification. In Proc. of Eur. conf. speech commun. and tech. (Eurospeech) (pp. 1379–1382).
Auckenthaler, R., Carey, M., & Lloyd-Thomas, H. (2000). Score normalization for text-independent speaker verification system. Digital Signal Processing, 10, 42–54.
Bar-Yosef, Y., & Bistritz, Y. (2009). Adaptive individual background model for speaker verification. In Proc. of interspeech (pp. 1271–1274).
Bonastre, J. F., Scheffer, N., Fredouille, C., & Matrouf, D. (2004). Nist’04 speaker recognition evaluation campaign: new LIA speaker detection plateform based on ALIZE toolkit. In Proc. of NIST 2004 speaker recognition workshop.
Campbell, W., Sturim, D., Reynolds, D., & Solomonoff, A. (2006). SVM based speaker verification using a GMM supervector kernel and NAP variability compensation. In Proc. of IEEE int. conf. acoust. speech signal processing (ICASSP) (pp. 97–100).
Castro, D. R., et al. (2007). Speaker verification using speaker- and test-dependent fast score normalization. Pattern Recognition Letters, 28, 90–98.
Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via EM algorithm. Journal of the Royal Statistical Society, 39, 1–38.
Ferras, M., Leung, C. C., Barras, C., & Gauvain, J. L. (2007). Constrained MLLR for speaker recognition. In Proc. of IEEE int. conf. acoust. speech signal processing (ICASSP) (pp. 53–56).
Goldberger, J., & Aronowitz, H. (2005). A distance measure between GMMs based on the unscented transform and its application to speaker recognition. In Proc. of interspeech (pp. 1985–1989).
Isobe, T., & Takahashi, J. (1999). A new cohort normalization using local acoustic information for speaker verification. In Proc. of IEEE int. conf. acoust. speech signal processing (ICASSP) (pp. 841–844).
Kenny, P. (2006). Joint factor analysis of speaker and session variability: theory and algorithms (Technical report CRIM-06/08-13). Montreal, CRIM.
Lee, L., & Rose, R. (1998). Frequency warping approach to speaker normalization. IEEE Transactions on Speech and Audio Processing, 6, 49–59.
Leggetter, C., & Woodland, P. (1995). Maximum likelihood linear regression for speaker adaptation of Hmms. Computer Speech & Language, 9, 171–186.
Martin, A., Doddington, G., Kamm, T., Ordowskiand, M., & Przybocki, M. (1997). The det curve in assessment of detection task performance. In Proc. of Eur. conf. speech commun. and tech, Eurospeech (pp. 1895–1898).
Mason, M., Vogt, R., Baker, B., & Sridharan, S. (2005). Data-driven clustering for blind feature mapping in speaker verification. In Proc. of interspeech (pp. 3109–3112).
Reynolds, D. A. (1995). Speaker identification and verification using Gaussian mixture speaker models. Speech Communication, 17, 91–108.
Reynolds, D. A. (2003). Channel robust speaker verification via feature mapping. In Proc. of IEEE int. conf. acoust. speech signal processing (ICASSP) (pp. 6–10).
Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10, 19–41.
Rosenberg, A. E., & Parthasarathy, S. (1996). Speaker background models for connected digit password speaker verification. In Proc. of IEEE int. conf. acoust. speech signal processing (ICASSP) (pp. 81–84).
Rosenberg, A. E., DeLong, J., Lee, C. -H., Jaung, B. -H., & Soong, F. K. (1992). The use of cohort normalized scores for speaker verification. In Proc. of int. conf. spoken language processing (ICSLP) (pp. 599–602).
Sanand, D. R., & Umesh, S. (2008). Study of Jacobian compensation using linear transformation of conventional MFCC for VTLN. In Proc. of interspeech (pp. 1233–1236).
Sarkar, A. K., & Umesh, S. (2010). Investigation of speaker-clustered UBMs based on vocal tract lengths and MLLR matrices for speaker verification. In Proc. of Odyssey speaker and language recognition workshop (pp. 286–293).
Sarkar, A. K., & Umesh, S. (2011). Use of VTL-wise models in feature-mapping framework to achieve performance of multiple-background models in speaker verification. In Proc. of IEEE int. conf. acoust. speech signal processing (ICASSP) (pp. 4552–4555).
Stolcke, A., Ferrer, L., Kajarekar, S., Shriberg, E., & Venkataraman, A. (2005). MLLR transforms as features in speaker recognition. In Proc. of Eur. conf. speech commun. and tech, Eurospeech (pp. 2425–2428).
Sturim, D. E., & Reynolds, D. (2005). Speaker adaptive cohort selection for t-norm in text-independent speaker verification. In Proc. of IEEE int. conf. acoust. speech signal processing (ICASSP) (pp. 741–744).
Teunen, R., Shahshahani, B., & Heck, L. (2000). A model-based transformational approach to robust speaker recognition. In Proc. of int. conf. spoken language processing (ICSLP) (pp. 495–498).
The Evaluation Plan of NIST 2004 Speaker Recognition Campaign (2004). http://www.itl.nist.gov/iad/mig//tests/sre/2004/SRE04_evalplan-v1a.pdf.
Tran, D., & Wagner, M. (2000). A proposed likelihood transformation for speaker verification. In Proc. of IEEE int. conf. acoust. speech signal processing (ICASSP) (pp. 1069–1072).
Vuuren, S. V., & Hermansky, H. (1998). On the importance of components of the modulation spectrum for speaker verification. In Proc. of int. conf. spoken language processing (ICSLP) (pp. 3205–3208).
Zhang, W. Q., Shan, Y., & Liu, J. (2010). Multiple background models for speaker verification. In Proc. of Odyssey speaker and language recognition workshop (pp. 47–51).
Acknowledgements
A part of this work was supported by SERC project fund SR/S3/EECE/058/2008 from the Department of Science and Technology, Ministry of Science and Technology, India.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Sarkar, A.K., Umesh, S. Multiple background models for speaker verification using the concept of vocal tract length and MLLR super-vector. Int J Speech Technol 15, 351–364 (2012). https://doi.org/10.1007/s10772-012-9149-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-012-9149-1