Generalized I-vector Representation with Phonetic Tokenizations and Tandem Features for both Text Independent and Text Dependent Speaker Verification

Li, Ming; Liu, Lun; Cai, Weicheng; Liu, Wenbo

doi:10.1007/s11265-015-1019-z

Generalized I-vector Representation with Phonetic Tokenizations and Tandem Features for both Text Independent and Text Dependent Speaker Verification

Published: 02 July 2015

Volume 82, pages 207–215, (2016)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Ming Li^1,2,
Lun Liu³,
Weicheng Cai⁴ &
…
Wenbo Liu^1,5

382 Accesses
15 Citations
Explore all metrics

Abstract

This paper presents a generalized i-vector representation framework with phonetic tokenization and tandem features for text independent as well as text dependent speaker verification. In the conventional i-vector framework, the tokens for calculating the zero-order and first-order Baum-Welch statistics are Gaussian Mixture Model (GMM) components trained from acoustic level MFCC features. Yet besides MFCC, we believe that phonetic information makes another direction that can benefit the system performance. Our contribution in this paper lies in integrating phonetic information into the i-vector representation by several extensions, forming a more generalized i-vector framework. First, the tokens for calculating the zero-order statistics is extended from the MFCC trained GMM components to phonetic phonemes, trigrams and tandem feature trained GMM components, using phoneme posterior probabilities. Second, given the zero-order statistics (posterior probabilities on tokens), the feature used to calculate the first-order statistics is also extended from MFCC to tandem feature, and is not necessarily the same feature employed by the tokenizer. Third, the zero-order and first-order statistics vectors are then concatenated and represented by the simplified supervised i-vector approach followed by the standard Probabilistic Linear Discriminant Analysis (PLDA) back-end. We study different token and feature combinations, and we show that the feature level fusion of acoustic level MFCC features and phonetic level tandem features with GMM based i-vector representation achieves the best performance for text independent speaker verification. Furthermore, we demonstrate that the phonetic level phoneme constraints introduced by the tandem features help the text dependent speaker verification system to reject wrong password trials and improve the performance dramatically. Experimental results are reported on the NIST SRE 2010 common condition 5 female part task and the RSR 2015 part 1 female part task for text independent and text dependent speaker verification, respectively. For the text independent speaker verification task, the proposed generalized i-vector representation outperforms the i-vector baseline by relatively 53 % in terms of equal error rate (EER) and norm minDCF values. For the text dependent speaker verification task, our proposed approach also reduced the EER significantly from 23 % to 90 % relatively for different types of trials.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Identify the Benefits of the Different Steps in an i-Vector Based Speaker Verification System

Noise robust speaker verification via the fusion of SNR-independent and SNR-dependent PLDA

Article 12 October 2015

Improved i-vector Speaker Verification Based on WCCN and ZT-norm

References

Campbell, W., Sturim, D., & Reynolds, D. (2006). Support vector machines using gmm supervectors for speaker verification. IEEE Signal Processing Letters, 13(5), 308–311.
Article Google Scholar
Cumani, S., Brummer, N., Burget, L., & Laface, P. (2011). Fast discriminative speaker verification in the i-vector space. In Proceedings ICASSP (pp. 4852–4855): IEEE.
Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798.
Article Google Scholar
Dehak, N., Torres-Carrasquillo, P., Reynolds, D., & Dehak, R. (2011). Language recognition via i-vectors and dimensionality reduction. In Proceedings INTERSPEECH (pp. 857–860).
D’Haro, L.F., Cordoba, R., Salamea, C., & Echeverry, J.D. (2014). Extended phone log-likelihood ratio features and acoustic-based i-vectors for language recognition. In Proceedings ICASSP (pp. 5379–5383): IEEE.
Ellis, D.P., Singh, R., & Sivadas, S. (2001). Tandem acoustic modeling in large-vocabulary recognition, (Vol. 1 pp. 517–520): Proceedings ICASSP.
Hatch, A., Kajarekar, S., & Stolcke, A. (2006). Within-class covariance normalization for SVM-based speaker recognition, (Vol. 4 pp. 1471–1474): Proceedings INTERSPEECH.
Hébert, M. (2008). Text-dependent speaker recognition. Springer Handbook of Speech Processing, 743–762.
Hermansky, H., Ellis, D.P., & Sharma, S. (2000). Tandem connectionist feature extraction for conventional hmm systems. In Proceedings ICASSP, (Vol. 3 pp. 1635–1638).
Kenny, P., Boulianne, G., & Dumouchel, P. (2005). Eigenvoice modeling with sparse training data. IEEE Transactions on Speech and Audio Processing, 13(3), 345–354.
Article Google Scholar
Kenny, P., Stafylakis, T., Ouellet, P., & Alam, M.J. (2014). Jfa-based front ends for speaker recognition. In Proceedings ICASSP (pp. 1724–1728).
Larcher, A., Lee, K.A., Ma, B., & Li, H. (2014). Imposture classification for text-dependent speaker verification. In Proceedings ICASSP (pp. 739–743).
Larcher, A., Lee, K.A., Ma, B., & Li, H. (2014). Text-dependent speaker verification: Classifiers, databases and rsr2015. Speech Communication, 60, 56–77.
Article Google Scholar
Lei, Y., Scheffer, N., Ferrer, L., & McLaren, M. (2014). A novel scheme for speaker recognition using a phonetically-aware deep neural network. In Proceedings ICASSP.
Li, H., Ma, B., & Lee, C. (2007). A vector space modeling approach to spoken language identification. IEEE Transactions on Audio. Speech, and Language Processing, 15(1), 271–284.
Article Google Scholar
Li, M., & Narayanan, S. (2014). Simplified supervised i-vector modeling with application to robust and efficient language identification and speaker verification: Computer speech and language.
Li, M., Tsiartas, A., Van Segbroeck, M., & Narayanan, S.S. (2013). Speaker verification using simplified and supervised i-vector modeling. In Proceedings ICASSP (pp. 7199–7203): IEEE.
Li, M., Zhang, X., Yan, Y., & Narayanan, S. (2011). Speaker verification using sparse representations on total variability i-vectors. In Proceedings INTERSPEECH (pp. 4548–4551).
Matejka, P., Glembek, O., Castaldo, F., Alam, M., Plchot, O., Kenny, P., Burget, L., & Cernocky, J. (2011). Full-covariance ubm and heavy-tailed plda in i-vector speaker verification. In Proceedings ICASSP (pp. 4828–4831).
(2010). NIST: The NIST 2010 Speaker Recognition Evaluation Plan. www.itl.nist.gov/iad/mig/tests/spk/2010/index.html.
Novoselov, S., Pekhovsky, T., Shulipa, A., & Sholokhov, A. (2014). Text-dependent gmm-jfa system for password based speaker verification. In Proceedings ICASSP (pp. 729–733).
Pinto, J., Garimella, S., Hermansky, H., Bourlard, H., & et al. (2011). Analysis of mlp-based hierarchical phoneme posterior probability estimator. IEEE Transactions on Audio, Speech, and Language Processing, 19(2), 225–241.
Prince, S., & Elder, J. (2007). Probabilistic linear discriminant analysis for inferences about identity (pp. 1–8): Proceedings ICCV.
Schwarz, P., Matejka, P., & Cernocky, J. (2006). Hierarchical structures of neural networks for phoneme. In Proc. ICASSP. Software available at http://speech.fit.vutbr.cz/software/phoneme-recognizer-based-long-temporal-context (pp. 325–328).
Stolcke, A., & et al. (2002). Srilm-an extensible language modeling toolkit. In Proceedings INTERSPEECH.
Variani, E., Lei, X., McDermott, E., Moreno, I.L., & Gonzalez-Dominguez, J. (2014). Deep neural networks for small footprint text-dependent speaker verification. In Proceedings ICASSP (pp. 4080–4084).
Wang, H., Leung, C.C., Lee, T., Ma, B., & Li, H. (2013). Shifted-delta mlp features for spoken language recognition. IEEE Signal Processing Letters, 20(1), 15–18.
Article Google Scholar
Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J., Ollason, D., Valtchev, V., & Woodland, P. (1997). The HTK book, vol. 2: Entropic Cambridge Research Laboratory Cambridge.
Zhu, Q., Stolcke, A., Chen, B.Y., & Morgan, N. (2005). Using mlp features in sris conversational speech recognition system. In Proc. INTERSPEECH.

Download references

Acknowledgments

This research is funded in part by the National Natural Science Foundation of China (NSFC 61401524),Natural Science Foundation of Guangdong Province (2014A030313123), SYSU-CMU Shunde International Joint Research Institute and CMU-SYSU Collaborative Innovation Research Center.

Author information

Authors and Affiliations

SYSU-CMU Joint Institute of Engineering, Sun Yat-sen University, Guangdong, China
Ming Li & Wenbo Liu
SYSU-CMU Shunde International Joint Research Institute, Guangdong, China
Ming Li
School of Mobile Information Engineering, Sun Yat-sen University, Guangdong, China
Lun Liu
School of Information Science and Technology, Sun Yat-sen University, Guangdong, China
Weicheng Cai
Department of ECE, Carnegie Mellon University, Pittsburgh, PA, USA
Wenbo Liu

Authors

Ming Li
View author publications
You can also search for this author in PubMed Google Scholar
Lun Liu
View author publications
You can also search for this author in PubMed Google Scholar
Weicheng Cai
View author publications
You can also search for this author in PubMed Google Scholar
Wenbo Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ming Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, M., Liu, L., Cai, W. et al. Generalized I-vector Representation with Phonetic Tokenizations and Tandem Features for both Text Independent and Text Dependent Speaker Verification. J Sign Process Syst 82, 207–215 (2016). https://doi.org/10.1007/s11265-015-1019-z

Download citation

Received: 14 January 2015
Revised: 26 May 2015
Accepted: 17 June 2015
Published: 02 July 2015
Issue Date: February 2016
DOI: https://doi.org/10.1007/s11265-015-1019-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Generalized I-vector Representation with Phonetic Tokenizations and Tandem Features for both Text Independent and Text Dependent Speaker Verification

Abstract

Access this article

Similar content being viewed by others

Identify the Benefits of the Different Steps in an i-Vector Based Speaker Verification System

Noise robust speaker verification via the fusion of SNR-independent and SNR-dependent PLDA

Improved i-vector Speaker Verification Based on WCCN and ZT-norm

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Generalized I-vector Representation with Phonetic Tokenizations and Tandem Features for both Text Independent and Text Dependent Speaker Verification

Abstract

Access this article

Similar content being viewed by others

Identify the Benefits of the Different Steps in an i-Vector Based Speaker Verification System

Noise robust speaker verification via the fusion of SNR-independent and SNR-dependent PLDA

Improved i-vector Speaker Verification Based on WCCN and ZT-norm

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation