Abstract
Gaussian Mixture Models (GMM) have been the most popular approach in speaker recognition and verification for over two decades. The inefficiencies of this model for signals such as speech are well documented and include an inability to model temporal dependencies that result from nonlinearities in the speech signal. The resulting models are often complex and overdetermined, which leads to a lack of generalization. In this paper, we present a nonlinear mixture autoregressive model (MixAR) that attempts to directly model nonlinearities in the trajectories of the speech features. We apply this model to the problem of speaker verification. Experiments with synthetic data demonstrate the viability of the model. Evaluations on standard speech databases, including TIMIT, NTIMIT, and NIST-2001, demonstrate that MixAR, using only half the number of parameters and only static features, can achieve a lower equal error rate when compared to GMMs, particularly in the presence of previously unseen noise. Performance as a function of the duration of both the training and evaluation utterances is also analyzed.






Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Ayadi, M. (2008). Autoregressive models for text independent speaker identification in noisy environments. Waterloo: University of Waterloo.
Banbrook, M., Ushaw, G., & McLaughlin, S. (1997). How to extract Lyapunov exponents from short and noisy time series. IEEE Transactions on Signal Processing, 45(5), 1378–1382.
Beigi, H. (2011). Fundamentals of speaker recognition (p. 942). Upper Saddle River: Springer.
Chen, C.-P., & Bilmes, J. A. (2007). MVA processing of speech features. IEEE Transactions on Audio, Speech, and Language Processing, 15(1), 257–270.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1), 1–38.
Dennis, J., & Schnabel, R. (1996). Numerical methods for unconstrained optimization and nonlinear equations (p. 394). Englewood Cliffs: Prentice Hall.
Ephraim, Y., & Roberts, W. (2005). Revisiting autoregressive hidden Markov modeling of speech signals. IEEE Signal Processing Letters, 12(2), 166–169.
Garofolo, J., Lamel, L., Fisher, W., Fiscus, J., Pallet, D., Dahlgren, N., & Zue, V. (1993). TIMIT acoustic-phonetic continuous speech corpus. The linguistic data consortium catalog. Philadelphia: The Linguistic Data Consortium.
Greenberg, C. S., & Martin, A. F. (2009). NIST speaker recognition evaluations 1996–2008. In Proceedings of SPIE (Stereoscopic displays and applications XX), Orlando, FL, USA (p. 732411).
Huang, K., & Picone, J. (2002). Internet-accessible speech recognition technology. In Proceedings of the IEEE midwest symposium on circuits and systems, Tulsa, OK, USA (pp. III-73–III-76).
Jankowski, C., Kalyanswamy, A., Basson, S., & Spitz, J. (1990). NTIMIT: a phonetically balanced, continuous speech, telephone bandwidth speech database. In IEEE international conference on acoustics speech and signal processing, Albuquerque, NM, USA (Vol. 1, pp. 109–112).
Juang, B.-H., & Rabiner, L. (1985). Mixture autoregressive hidden Markov models for speech signals. IEEE Transactions on Acoustics, Speech, and Signal Processing, 33(6), 1404–1413.
Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: from features to supervectors. Speech Communication, 52(1), 12–40.
Kokkinos, I., & Maragos, P. (2005). Nonlinear speech analysis using models for chaotic systems. IEEE Transactions on Speech and Audio Processing, 13(6), 1098–1109.
McLachlan, G. & Thriyambakam, K. (2008). The EM algorithm and extensions (p. 400). Hoboken: Wiley-Interscience.
May, D. (2008). Nonlinear dynamic invariants for continuous speech recognition. Starkville: Mississippi State University.
Parihar, N., Picone, J., Pearce, D., & Hirsch, H.-G. (2004). Performance analysis of the Aurora large vocabulary baseline system. In Proceedings of the European signal processing conference, Vienna, Austria (pp. 553–556).
Petry, A., Augusto, D., & Barone, C. (2002). Speaker identification using nonlinear dynamical features. Chaos, Solitons and Fractals, 13(2), 221–231.
Reynolds, D., & Campbell, W. (2008). Springer handbook of speech processing. Text-independent speaker recognition (1st ed., p. 1176). Berlin: Springer.
Srinivasan, S., Ma, T., May, D., Lazarou, G., & Picone, J. (2008). Nonlinear mixture autoregressive hidden Markov models for speech recognition. In Proceedings of the international conference on spoken language processing, Brisbane, Australia (pp. 960–963).
Zeevi, A., Meir, R., & Adler, R. Nonlinear models for time series using mixtures of autoregressive models (p. 25). Haifa, Israel. Retrieved from http://ie.technion.ac.il/~radler/mixar.pdf.
Wong, C. S., & Li, W. K. (2000). On a mixture autoregressive model. Journal of the Royal Statistical Society. Series B. Statistical Methodology, 62(1), 95–115.
Author information
Authors and Affiliations
Corresponding author
Additional information
This material is based upon work supported by the National Science Foundation under Grant No. IIS-0414450. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Rights and permissions
About this article
Cite this article
Srinivasan, S., Ma, T., Lazarou, G. et al. A nonlinear autoregressive model for speaker verification. Int J Speech Technol 17, 17–25 (2014). https://doi.org/10.1007/s10772-013-9201-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-013-9201-9