Skip to main content
Log in

i-Vectors in speech processing applications: a survey

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

In the domain of speech recognition many methods have been proposed over time like Gaussian mixture models (GMM), GMM with universal background model (GMM-UBM framework), joint factor analysis, etc. i-Vector subspace modeling is one of the recent methods that has become the state of the art technique in this domain. This method largely provides the benefit of modeling both the intra-domain and inter-domain variabilities into the same low dimensional space. In this survey, we present a comprehensive collection of research work related to i-vectors since its inception. Some recent trends of using i-vectors in combination with other approaches are also discussed. The application of i-vectors in various fields of speech recognition, viz speaker, language, accent recognition, etc. is also presented. This paper should serve as a good starting point for anyone interested in working with i-vectors for speech processing in general. We then conclude the paper with a brief discussion on the future of i-vectors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. https://catalog.ldc.upenn.edu/LDC93S1.

  2. https://catalog.ldc.upenn.edu/LDC97S62.

  3. http://kaldi.sourceforge.net.

  4. http://alize.univ-avignon.fr.

  5. http://research.microsoft.com/en-us/downloads/2476c44a-1f63-4fe0-b805-8c2de395bb2c/.

  6. http://www-lium.univ-lemans.fr/diarization/.

  7. https://ivectorchallenge.nist.gov.

References

  • Adami, A., Mihaescu, R., Reynolds, D., & Godfrey. J. (2003). Modeling prosodic dynamics for speaker recognition. 2003 IEEE international conference on, acoustics, speech, and signal processing, 2003, proceedings, (ICASSP ’03). (Vol. 4), pp. IV-788-91. doi:10.1109/ICASSP.2003.1202761.

  • Adami, A. G. (2007). Modeling prosodic differences for speaker recognition. Speech Communications, 49(4), 277–291. doi:10.1016/j.specom.2007.02.005.

    Article  Google Scholar 

  • Alam, M. J., Ouellet, P., Kenny, P., & O’Shaughnessy, D. D. (2011) Comparative evaluation of feature normalization techniques for speaker verification. Advances in nonlinear speech processing—proceedings of 5th international conference on nonlinear speech processing, NOLISP 2011, Las Palmas de Gran Canaria. Retrieved November 7–9, 2011, pp. 246–253. doi:10.1007/978-3-642-25020-0_32.

  • Aronowitz, H. (2014). Inter dataset variability compensation for speaker recognition. IEEE international conference on acoustics, speech and signal processing (ICASSP), 2014, pp. 4002–4006. doi:10.1109/ICASSP.2014.6854353.

  • Aronowitz, H., & Barkan, O. (2012). Efficient approximated i-vector extraction. 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4789–4792. doi:10.1109/ICASSP.2012.6288990.

  • Aronowitz, H., & Rendel, A. (2014). Domain adaptation for text dependent speaker verification. INTERSPEECH 2014, 15th annual conference of the international speech communication Association, Singapore. Retrieved September 14–18, 2014, pp. 1337–1341. http://www.isca-speech.org/archive/interspeech_2014/i14_1337.html.

  • Bahari, M., Saeidi, R., Van hamme, H., & Van Leeuwen, D. (2013). Accent recognition using i-vector, gaussian mean supervector and gaussian posterior probability supervector for spontaneous telephone speech. 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 7344–7348. doi:10.1109/ICASSP.2013.6639089.

  • Bahari, M. H., McLaren, M., Hamme, H. V., & van Leeuwen, D. A. (2014). Speaker age estimation using i-vectors. Engineering Applications of AI, 34, 99–108. doi:10.1016/j.engappai.2014.05.003.

    Google Scholar 

  • Behravan, H., Hautamäki, V., & Kinnunen, T. (2013). Foreign accent detection from spoken Finnish using i-vectors. INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon. Retrieved August 25–29, 2013, pp. 79–83. http://www.isca-speech.org/archive/interspeech_2013/i13_0079.html.

  • Behravan, H., Hautamäki, V., & Kinnunen, T. (2015). Factors affecting i-vector based foreign accent recognition: A case study in spoken finnish. Speech Communication, 66, 118–129. doi:10.1016/j.specom.2014.

    Article  Google Scholar 

  • Biswas, S., Rohdin, J., & Shinoda, K. (2014). i-Vector selection for effective PLDA modeling in speaker recognition. Proceedings Odyssey 2014—The speaker and language recognition workshop. pp. 100–105.

  • Bousquet, P., Matrouf, D., & Bonastre, J. (2011). Intersession compensation and scoring methods in the i-vectors space for speaker recognition. INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. Retrieved August 27–31, 2011. pp. 485–488. http://www.isca-speech.org/archive/interspeech_2011/i11_0485.html.

  • Brümmer, N., Strasheim, A., Hubeika, V., Matejka, P., Burget, L., & Glembek, O. (2009). Discriminative acoustic language recognition via channel-compensated GMM statistics. INTERSPEECH 2009, 10th annual conference of the international speech communication association, Brighton. Retrieved September 6–10, 2009. pp. 2187–2190. http://www.isca-speech.org/archive/interspeech_2009/i09_2187.html.

  • Burget, L., Plchot, O., Cumani, S., Glembek, O., Matejka, P., & Brümmer, N. (2011). Discriminatively trained probabilistic linear discriminant analysis for speaker verification. Proceedings of the IEEE international conference on acoustics, speech, and signal processing, ICASSP 2011. Retrieved May 22–27, 2011. Prague Congress Center, Prague. pp. 4832–4835, doi10.1109/ICASSP.2011.5947437.

  • Busso, C., Bulut, M., Lee, C. C., Kazemzadeh, A., Mower, E., Kim, S., et al. (2008). IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42(4), 335–359. doi:10.1007/s10579-008-9076-6.

    Article  Google Scholar 

  • Chen, L., & Yang, Y. (2011). Applying emotional factor analysis and i-Vector to emotional speaker recognition. In Z. Sun, J. Lai, X. Chen, & T. Tan (Eds.), Biometric recognition, lecture notes in computer science (Vol. 7098, pp. 174–179). Berlin: Springer. doi:10.1007/978-3-642-25449-9-22.

    Chapter  Google Scholar 

  • Chen, L., & Yang, Y. (2013). Emotional speaker recognition based on i-vector through atom aligned sparse representation. 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 7760–7764. doi:10.1109/ICASSP.2013.6639174.

  • Chen, N., Shen, W., & Campbell, J. (2010). A linguistically-informative approach to dialect recognition using dialect-discriminating context-dependent phonetic models. 2010 IEEE international conference on acoustics speech and signal processing (ICASSP), pp. 5014–5017. doi10.1109/ICASSP.2010.5495068.

  • Cheng, Y. C., Hautamaki, V., Huang, Z., Li, K., & Lee, C. H. (2014). An i-vector based descriptor for alphabetical gesture recognition. 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). pp. 6593–6597. doi10.1109/ICASSP.2014.6854875.

  • Cumani, S., Glembek, O., Brümmer, N., de Villiers, E., & Laface, P. (2012). Gender independent discriminative speaker recognition in i-vector space. 2012 IEEE international conference on acoustics, speech and signal processing, ICASSP 2012, Kyoto. Retrieved March 25–30, 2012. pp. 4361–4364. doi:10.1109/ICASSP.2012.6288885.

  • Dehak, N. (2009). Discriminative and generative approaches for long- and short-term speaker characteristics modeling: Application to speaker verification. PhD thesis, Ecole de Technologie Superieure (Canada), aAINR50490.

  • Dehak, N., & Shum, S. (2011). Low-dimensional speech representation based on factor analysis and its applications. Johns Hopkins CLSP Lecture.

  • Dehak, N., Dumouchel, P., & Kenny, P. (2007). Modeling prosodic features with joint factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 2095–2103. doi:10.1109/TASL.2007.902758.

    Article  Google Scholar 

  • Dehak, N., Kenny, P., & Dumouchel, P. (2007b) Continuous prosodic features and formant modeling with joint factor analysis for speaker verification. INTERSPEECH 2007, 8th annual conference of the international speech communication association, Antwerp. Retrieved August 27–31, 2007. pp 1234–1237. http://www.isca-speech.org/archive/interspeech_2007/i07_1234.html.

  • Dehak, N., Dehak, R., Glass, J. R., Reynolds, D. A., & Kenny, P. (2010). Cosine similarity scoring without score normalization techniques. Proceedings Odyssey 2010—The speaker and language recognition workshop. pp. 15–19.

  • Dehak, N., Karam, Z. N., Reynolds, D. A., Dehak, R., Campbell, W. M., & Glass, J. R. (2011a). A channel-blind system for speaker verification. Proceedings of the IEEE international conference on acoustics, speech, and signal processing, ICASSP 2011. Retrieved May 22–27, 2011, Prague Congress Center, Prague. pp. 4536–4539. doi:10.1109/ICASSP.2011.5947363.

  • Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Ouellet, P. (2011b). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798. doi:10.1109/TASL.2010.2064307.

    Article  Google Scholar 

  • Dehak, N., Torres-Carrasquillo, P. A., Reynolds, D. A., & Dehak, R. (2011c). Language recognition via i-vectors and dimensionality reduction. INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. Retrieved August 27–31, 2011. pp. 857–860. http://www.isca-speech.org/archive/interspeech_2011/i11_0857.html.

  • DeMarco, A., & Cox, S. J. (2012). Iterative classification of regional British accents in i-vector space. 2012 Symposium on machine learning in speech and language processing, MLSLP 2012, Portland. Retrieved September 14, 2012, pp. 1–4.

  • DeMarco, A., & Cox, S. J. (2013). Native accent classification via I-vectors and speaker compensation fusion. INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon. Retrieved August 25–29, 2013. pp. 1472–1476. http://www.isca-speech.org/archive/interspeech_2013/i13_1472.html.

  • Dupuy, G., Rouvier, M., Meignier, S., & Estève, Y. (2012). i-Vectors and ILP clustering adapted to cross-show speaker diarization. INTERSPEECH 2012, 13th annual conference of the international speech communication association, Portland. Retrieved September 9–13, 2012. pp. 2174–2177. http://www.isca-speech.org/archive/interspeech_2012/i12_2174.html.

  • Ferrer, L., Scheffer, N., & Shriberg, E. (2010). A comparison of approaches for modeling prosodic features in speaker recognition. 2010 IEEE international conference on acoustics speech and signal processing (ICASSP). pp. 4414–4417. doi:10.1109/ICASSP.2010.5495632.

  • Foil, J. (1986). Language identification using noisy speech. IEEE international conference on ICASSP ’86. acoustics, speech, and signal processing (Vol. 11)

  • Gaida, C., Lange, P., Petrick, R., Proba, P., Malatawy, A., & Suendermann-Oeft, D. (2014). Comparing open-source speech recognition toolkits. http://suendermann.com/su/pdf/oasis2014.pdf.

  • Garcia-Romero, D., & Espy-Wilson, C. Y. (2011). Analysis of i-vector length nNormalization in speaker recognition systems. INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence Retrieved August 27–31, 2011. pp. 249–252. http://www.isca-speech.org/archive/interspeech_2011/i11_0249.html.

  • Garcia-Romero, D., Zhou, X., & Espy-Wilson, C. Y. (2012). Multicondition training of Gaussian PLDA models in i-vector space for noise and reverberation robust speaker recognition. 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4257–4260. doi:10.1109/ICASSP.2012.6288859.

  • Ghahabi, O., & Hernando, J. (2014a). Deep belief networks for i-vector based speaker recognition. IEEE International conference on acoustics, speech and signal processing, ICASSP 2014, Florence. Retrieved May 4–9, 2014. pp. 1700–1704. doi:10.1109/ICASSP.2014.6853888.

  • Ghahabi, O., & Hernando, J. (2014b). Global impostor selection for dbns in multi-session i-vector speaker recognition. Proceedings of advances in speech and language technologies for Iberian languages—Second international conference, IberSPEECH 2014, Las Palmas de Gran Canaria. Retrieved November 19–21, 2014. pp. 89–98, doi:10.1007/978-3-319-13623-3.

  • Glembek, O., Burget, L., Dehak, N., Brummer, N., & Kenny, P. (2009). Comparison of scoring methods used in speaker recognition with joint factor analysis. IEEE International conference on acoustics, speech and signal processing, ICASSP 2009. pp. 4057–4060. doi:10.1109/ICASSP.2009.4960519.

  • Glembek, O., Burget, L., Matejka, P., Karafiat, M., & Kenny, P. (2011). Simplification and optimization of i-vector extraction. IEEE International conference on acoustics, speech and signal processing (ICASSP), 2011. pp. 4516–4519. doi:10.1109/ICASSP.2011.5947358.

  • Glembek, O., Ma, J., Matejka, P., Zhang, B., Plchot, O., Burget, L., & Matsoukas, S. (2014). Domain adaptation via within-class covariance correction in i-vector based speaker recognition systems. 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). pp. 4032–4036. doi10.1109/ICASSP.2014.6854359.

  • González, D. M., Plchot, O., Burget, L., Glembek, O., & Matejka, P. (2011). Language recognition in iVectors space. INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. Retrieved August 27–31, 2011. pp. 861–864. http://www.isca-speech.org/archive/interspeech_2011/i11_0861.html.

  • Gupta, V., Kenny, P., Ouellet, P., & Stafylakis, T. (2014). i-Vector-based speaker adaptation of deep neural networks for French broadcast audio transcription. 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). pp. 6334–6338. doi:10.1109/ICASSP.2014.6854823.

  • Hasan, T., Saeidi, R., Hansen, J., & van Leeuwen, D. (2013). Duration mismatch compensation for i-vector based speaker recognition systems. 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP). pp. 7663–7667. doi:10.1109/ICASSP.2013.6639154.

  • Hautamäki, V., Cheng, Y., Rajan, P., & Lee, C. (2013). Minimax i-vector extractor for short duration speaker verification. INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon. Retrieved August 25–29, 2013. pp. 3708–3712. http://www.isca-speech.org/archive/interspeech_2013/i13_3708.html.

  • Huang, Z., Cheng, Y., Li, K., Hautamäki, V., & Lee, C. (2013). A blind segmentation approach to acoustic event detection based on i-vector. INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon. Retrieved August 25–29, 2013. pp. 2282–2286. http://www.isca-speech.org/archive/interspeech_2013/i13_2282.html.

  • Huggins-Daines, D., Kumar, M., Chan, A., Black, A., Ravishankar, M., & Rudnicky, A. (2006), Pocketsphinx: A free, real-time continuous speech recognition system for hand-held devices. 2006 IEEE international conference on acoustics, speech and signal processing, 2006, ICASSP 2006 proceedings (Vol. 1), pp. I-I. doi:10.1109/ICASSP.2006.1659988.

  • Jancik, Z., Plchot, O., Brummer, N., Burget, L., Glembek, O., Hubeika, V., et al. (2010). Data selection and calibration issues in automatic language recognition—investigation with BUT-AGNITIO NIST LRE 2009 system. Proceedings Odyssey 2010—The speaker and language recognition workshop. pp. 215–221.

  • Jiang, Y., Lee, K., Tang, Z., Ma, B., Larcher, A., & Li, H. (2012), PLDA modeling in i-vector and supervector space for speaker verification. INTERSPEECH 2012, 13th annual conference of the international speech communication association, Portland. Retrieved September 9–13, 2012. pp. 1680–1683. http://www.isca-speech.org/archive/interspeech_2012/i12_1680.html.

  • Kanagasundaram, A., Vogt, R., Dean, D., Sridharan, S., & Mason, M. (2011). i-Vector based speaker recognition on short utterances. INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. Retrieved August 27–31, 2011. pp. 2341–2344. http://www.isca-speech.org/archive/interspeech_2011/i11_2341.html.

  • Kanagasundaram, A., Dean, D., Vogt, R., McLaren, M., Subramanian, S.,& Mason, M. (2012a). Weighted LDA techniques for i-vector based speaker verification. 2012 IEEE international conference on acoustics, speech and signal processing, ICASSP 2012, Kyoto. Retrieved March 25–30, 2012. pp. 4781–4784. doi:10.1109/ICASSP.2012.6288988.

  • Kanagasundaram, A., Vogt, R., Dean, D., & Sridharan, S. (2012b). PLDA based speaker recognition on short utterances. Odyssey 2012: The speaker and language recognition workshop, Singapore. Retrieved June 25–28, 2012. pp 28–33.

  • Kanagasundaram, A., Dean, D., Gonzalez-Dominguez, J., Sridharan, S., Ramos, D., & Gonzalez-Rodriguez, J. (2013). Improving short utterance based i-vector speaker recognition using source and utterance-duration normalization techniques. INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon, Retrieved August 25–29, 2013. pp. 2465–2469. http://www.isca-speech.org/archive/interspeech_2013/i13_2465.html.

  • Kanagasundaram, A., Dean, D., Sridharan, S., Gonzalez-Dominguez, J., Gonzalez-Rodriguez, J., & Ramos, D. (2014). Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques. Speech Communication, 59, 69–82. doi:10.1016/j.specom.2014.01.004.

    Article  Google Scholar 

  • Kanagasundaram, A., Dean, D., Sridharan, S., Gonzalez-Dominguez, J., González-Rodríguez, J., & Ramos, D. (2014). Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques. Speech Communication, 59, 69–82. doi:10.1016/j.specom.2014.01.004.

    Article  Google Scholar 

  • Karafiát, M., Burget, L., Matejka, P., Glembek, O., & Cernocký, J. (2011). iVector-based discriminative adaptation for automatic speech recognition. 2011 IEEE workshop on automatic speech recognition & understanding, ASRU 2011, Waikoloa. Retrieved December 11–15, 2011. pp. 152–157. doi:10.1109/ASRU.2011.6163922.

  • Karanasou, P., Wang, Y., Gales, M. J. F., & Woodland, P. C. (2014). Adaptation of deep neural network acoustic models using factorised i-Vectors. INTERSPEECH 2014, 15th annual conference of the international speech communication association, Singapore. Retrieved September 14–18, 2014. pp. 2180–2184. http://www.isca-speech.org/archive/interspeech_2014/i14_2180.html.

  • Kenny, P. (2005). Joint factor analysis of speaker and session variability: Theory and algorithms. CRIM, Montreal, (Report) CRIM-06/08-13.

  • Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2007a). Joint factor analysis versus eigenchannels in speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(4), 1435–1447. doi:10.1109/TASL.2006.881693.

    Article  Google Scholar 

  • Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2007b). Speaker and session variability in GMM-based speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 15(4), 1448–1460. doi:10.1109/TASL.2007.894527.

    Article  Google Scholar 

  • Kenny, P., Ouellet, P., Dehak, N., Gupta, V., & Dumouchel, P. (2008). A study of interspeaker variability in speaker verification. Audio, Speech, and Language Processing, IEEE Transactions on, 16(5), 980–988. doi:10.1109/TASL.2008.925147.

    Article  Google Scholar 

  • Kenny, P., Stafylakis, T., Ouellet, P., Alam, M., & Dumouchel, P. (2013). PLDA for speaker verification with utterances of arbitrary duration. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 7649–7653. doi:10.1109/ICASSP.2013.6639151.

  • Kockmann, M., Burget, L., & Cernocký, J. (2010). Brno University of Technology System for Interspeech 2010 Paralinguistic Challenge. In: INTERSPEECH 2010, 11th annual conference of the international speech communication association, Makuhari, Chiba. September 26–30, 2010, pp 2822–2825. http://www.isca-speech.org/archive/interspeech_2010/i10_2822.html

  • Kockmann, M., Ferrer, L., Burget, L., & Cernocký, J. (2011). iVector fusion of prosodic and cepstral features for speaker verification. In: INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence,. August 27–31, 2011, pp 265–268. http://www.isca-speech.org/archive/interspeech_2011/i11_0265.html

  • Lamere, P., Kwok, P., Gouvea, E., Raj, B., Singh, R., Walker, W., et al. (2003). The CMU SPHINX-4 speech recognition system. In: IEEE international conference on acoustics, speech and signal processing (ICASSP 2003). Hong Kong, 1, 2–5.

  • Larcher, A., Bousquet, P., Lee, K. A., Matrouf, D., Li, H., & Bonastre, J. F. (2012) i-Vectors in the context of phonetically-constrained short utterances for speaker verification. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4773–4776. doi:10.1109/ICASSP.2012.6288986

  • Larcher, A., Bonastre, J., Fauve, B. G. B., Lee, K., Lévy, C., Li, H., et al. (2013). ALIZE 3.0: Open source toolkit for state-of-the-art speaker recognition. In: INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon. August 25–29, 2013, pp 2768–2772, http://www.isca-speech.org/archive/interspeech_2013/i13_2768.html

  • Le,V. B., Mella, O., & Fohr, D. (2007). Speaker diarization using normalized cross likelihood ratio. In: INTERSPEECH 2007, 8th annual conference of the international speech communication association, Antwerp. August 27–31, 2007, pp. 1869–1872, http://www.isca-speech.org/archive/interspeech_2007/i07_1869.html

  • Lei, Y., Burget, L., Ferrer, L., Graciarena, M., & Scheffer, N. (2012a). Towards noise-robust speaker recognition using probabilistic linear discriminant analysis. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4253–4256. 10.1109/ICASSP.2012.6288858

  • Lei, Y., Burget, L., & Scheffer, N. (2012b). Bilinear factor analysis for ivector based speaker verification. In: INTERSPEECH 2012, 13th annual conference of the international speech communication association, Portland. September 9–13, 2012, pp 1588–1591, http://www.isca-speech.org/archive/interspeech_2012/i12_1588.html

  • Lei, Y., Burget, L., & Scheffer, N. (2013). A noise robust i-vector extractor using vector taylor series for speaker recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 6788–6791. doi:10.1109/ICASSP.2013.6638976

  • Lei, Y., McLaren, M., Ferrer, L., & Scheffer, N. (2014a). Simplified VTS-based i-Vector extraction in noise-robust speaker recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4037–4041. doi:10.1109/ICASSP.2014.6854360.

  • Lei, Y., Scheffer, N., Ferrer, L., & McLaren, M. (2014b). A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 1695–1699. doi:10.1109/ICASSP.2014.6853887

  • Li, M., & Liu, W. (2014). Speaker verification and spoken language identification using a generalized i-vector framework with phonetic tokenizations and tandem features. In: INTERSPEECH 2014, 15th annual conference of the international speech communication association, Singapore. September 14–18, 2014, pp 1120–1124, http://www.isca-speech.org/archive/interspeech_2014/i14_1120.html

  • Li, M., Zhang, X., Yan, Y., & Narayanan, S. S. (2011). Speaker verification using sparse representations on total variability i-vectors. In: INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. August 27–31, 2011, pp 2729–2732, http://www.isca-speech.org/archive/interspeech_2011/i11_2729.html

  • Mandasari, M., McLaren, M., & van Leeuwen, D. (2012). The effect of noise on modern automatic speaker recognition systems. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4249–4252. doi:10.1109/ICASSP.2012.6288857

  • Mandasari, M. I., McLaren, M., & van Leeuwen, D. A. (2011). Evaluation of i-vector speaker recognition systems for forensic application. In: INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. August 27–31, 2011, pp. 21–24, http://www.isca-speech.org/archive/interspeech_2011/i11_0021.html

  • Mariooryad, S., & Busso, C. (2014). Compensating for speaker or lexical variabilities in speech for emotion recognition. Speech Communication, 57(0):1–12. doi:10.1016/j.specom.2013.07.011, http://www.sciencedirect.com/science/article/pii/S0167639313001015

  • Martinez, D., Burget, L., Ferrer, L., & Scheffer, N. (2012). iVector-based prosodic system for language identification. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4861–4864. doi:10.1109/ICASSP.2012.6289008

  • Martinez, D., Lleida, E., Ortega, A., & Miguel, A. (2013). Prosodic features and formant modeling for an ivector-based language recognition system. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 6847–6851. doi:10.1109/ICASSP.2013.6638988

  • Martinez, D., Burget, L., Stafylakis, T., Lei, Y., Kenny, P., & Lleida, E. (2014). Unscented transform for ivector-based noisy speaker recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4042–4046. doi:10.1109/ICASSP.2014.6854361

  • Matejka, P., Glembek, O., Castaldo, F., Alam, M., Plchot, O., Kenny, P., et al. (2011). Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4828–4831. doi:10.1109/ICASSP.2011.5947436

  • McLaren, M., & van Leeuwen, D. (2011a). Source-normalised-and-weighted LDA for robust speaker recognition using i-vectors. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5456–5459, DOI 10.1109/ICASSP.2011.5947593.

  • McLaren, M., & van Leeuwen, D. (2012a). Gender-independent speaker recognition using source normalisation. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4373–4376. doi:10.1109/ICASSP.2012.6288888

  • McLaren, M., & van Leeuwen, D. (2012b). Source-normalized LDA for Robust speaker recognition using i-vectors from multiple speech sources. IEEE Transactions on Audio, Speech, and Language Processing, 20(3), 755–766. doi:10.1109/TASL.2011.2164533.

    Article  Google Scholar 

  • McLaren, M., & van Leeuwen, D. A. (2011b). To weight or not to weight: source-normalised LDA for speaker recognition using i-vectors. In: INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. August 27–31, 2011, pp. 2709–2712, http://www.isca-speech.org/archive/interspeech_2011/i11_2709.html

  • Meignier, S., & Merlin, T. (2010). LIUM SpkDiarization: An open source toolkit for diarization. In: CMU SPUD workshop (Vol. 2010)

  • Novoselov, S., Pekhovsky, T., Simonchik, K., & Shulipa, A. (2014). RBM-PLDA subsystem for the NIST i-vector challenge. In: INTERSPEECH 2014, 15th annual conference of the international speech communication association, Singapore, September 14–18, 2014, pp 378–382, http://www.isca-speech.org/archive/interspeech_2014/i14_0378.html

  • Picone, J. (1993). Signal modeling techniques in speech recognition. Proceedings of the IEEE, 81(9), 1215–1247. doi:10.1109/5.237532.

    Article  Google Scholar 

  • Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., et al. (2011). The kaldi speech recognition toolkit. In: IEEE 2011 workshop on automatic speech recognition and understanding, IEEE Signal Processing Society, iEEE Catalog No.: CFP11SRW-USB.

  • Reynolds, D. A. (1992). A Gaussian mixture modeling approach to text-independent speaker identification. Ph.D. dissertation, Georgia Institute of Technology.

  • Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted gaussian mixture models. Digital Signal Processing, 10(13), 19–41. doi:10.1006/dspr.1999.0361.

    Article  Google Scholar 

  • Rouvier, M., & Favre, B. (2014). Speaker adaptation of DNN-based ASR with i-vectors: Does it actually adapt models to speakers? In: INTERSPEECH 2014, 15th annual conference of the international speech communication association, Singapore, September 14–18, 2014, pp. 3007–3011, http://www.isca-speech.org/archive/interspeech_2014/i14_3007.html

  • Rouvier, M., Dupuy, G., Gay, P., el Khoury, E., Merlin, T., & Meignier, S. (2013). An open-source state-of-the-art toolbox for broadcast news diarization. In: INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon. August 25–29, 2013, pp. 1477–1481, http://www.isca-speech.org/archive/interspeech_2013/i13_1477.html

  • Sadjadi, S. O., Slaney, M., & Heck, L. (2013). Msr identity toolbox v1.0: A matlab toolbox for speaker-recognition research. Speech and Language Processing Technical Committee Newsletter http://research.microsoft.com/apps/pubs/default.aspx?id=205119

  • Sarkar, A. K., Matrouf, D., Bousquet, P., & Bonastre, J. (2012). Study of the effect of i-vector modeling on short and mismatch utterance duration for speaker verification. In: INTERSPEECH 2012, 13th annual conference of the international speech communication association, Portland. September 9–13, 2012, pp. 2662–2665, http://www.isca-speech.org/archive/interspeech_2012/i12_2662.html

  • Sarkar, S., & Rao, K. S. (2014). A novel boosting algorithm for improved i-vector based speaker verification in noisy environments. In: INTERSPEECH 2014, 15th annual conference of the international speech communication association, Singapore. September 14–18, 2014, pp. 671–675, http://www.isca-speech.org/archive/interspeech_2014/i14_0671.html

  • Segbroeck, M. V., Travadi, R., & Narayanan, S. S. (2014a) UBM fused total variability modeling for language identification. In: INTERSPEECH 2014, 15th annual conference of the international speech communication association, Singapore. September 14–18, 2014, pp. 3027–3031, http://www.isca-speech.org/archive/interspeech_2014/i14_3027.html

  • Segbroeck, M. V., Travadi, R., Vaz, C., Kim, J., Black, M. P., Potamianos, A., et al. (2014b). Classification of cognitive load from speech using an i-vector framework. in: INTERSPEECH 2014, 15th annual conference of the international speech communication association, Singapore. September 14–18, 2014, pp. 751–755, http://www.isca-speech.org/archive/interspeech_2014/i14_0751.html

  • Senior, A., & Lopez-Moreno, I. (2014). Improving DNN speaker independence with i-vector inputs. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 225–229. doi:10.1109/ICASSP.2014.6853591

  • Senoussaoui, M., Kenny, P., Dehak, N., & Dumouchel, P. (2010). An i-vector extractor suitable for speaker recognition with both microphone and telephone speech. In: Odyssey 2010: the speaker and language recognition workshop, Brno, June 28–July 1, 2010, p. 6

  • Senoussaoui, M., Kenny, P., Brümmer, N., de Villiers, E., & Dumouchel, P. (2011). Mixture of PLDA models in i-vector space for gender-independent speaker recognition. In: INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. August 27–31, 2011, pp. 25–28, http://www.isca-speech.org/archive/interspeech_2011/i11_0025.html

  • Shum, S., Dehak, N., Chuangsuwanich, E., Reynolds, D. A., & Glass, J. R. (2011). Exploiting intra-conversation variability for speaker diarization. In: INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. August 27–31, 2011, pp. 945–948, http://www.isca-speech.org/archive/interspeech_2011/i11_0945.html

  • Silovsky, J., & Prazak, J. (2012). Speaker diarization of broadcast streams using two-stage clustering based on i-vectors and cosine distance scoring. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4193–4196. doi:10.1109/ICASSP.2012.6288843

  • Simonchik, K., Pekhovsky, T., Shulipa, A., & Afanasyev, A. (2012). Supervized mixture of PLDA models for cross-channel speaker verification. In: INTERSPEECH 2012, 13th annual conference of the international speech communication association, Portland. September 9–13, 2012, pp. 1684–1687, http://www.isca-speech.org/archive/interspeech_2012/i12_1684.html

  • Sizov, A., el Khoury, E., Kinnunen, T., Wu, Z., & Marcel, S. (2015). Joint speaker verification and antispoofing in the i-vector space. IEEE Transactions on Information Forensics and Security, 10(4), 821–832. doi:10.1109/TIFS.2015.2407362.

    Article  Google Scholar 

  • Soufifar, M., Kockmann, M., Burget, L., Plchot, O., Glembek, O., & Svendsen, T. (2011). iVector approach to phonotactic language recognition. In: INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. August 27–31, 2011, pp. 2913–2916, http://www.isca-speech.org/archive/interspeech_2011/i11_2913.html

  • Travadi, R., Segbroeck, M. V., & Narayanan, S. S. (2014). Modified-prior i-vector estimation for language identification of short duration utterances. In: INTERSPEECH 2014, 15th annual conference of the international speech communication association, Singapore. September 14–18, 2014, pp. 3037–3041, http://www.isca-speech.org/archive/interspeech_2014/i14_3037.html

  • Varga, A., & Steeneken, H. J. (1993). Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication 12(3):247–251. doi:10.1016/0167-6393(93)90095-3, http://www.sciencedirect.com/science/article/pii/0167639393900953

  • Variani, E., Lei, X., McDermott, E., Lopez Moreno, I., & Gonzalez-Dominguez, J. (2014). Deep neural networks for small footprint text-dependent speaker verification. In: Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pp 4052–4056. doi:10.1109/ICASSP.2014.6854363

  • Villalba, J., & Lleida, E. (2013). Handling i-vectors from different recording conditions using multi-channel simplified PLDA in speaker recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 6763–6767. doi:10.1109/ICASSP.2013.6638971

  • Wolf, J. J. (1972). Efficient acoustic parameters for speaker recognition. The Journal of the Acoustical Society of America, 51(6B):2044–2056. doi:10.1121/1.1913065, http://scitation.aip.org/content/asa/journal/jasa/51/6B/10.1121/1.1913065

  • Wu, T., Yang, Y., Wu, Z., & Li, D. (2006). MASC: A speech corpus in mandarin for emotion analysis and affective speaker recognition. In: Speaker and language recognition workshop, 2006. IEEE Odyssey 2006: The, pp 1–5. doi:10.1109/ODYSSEY.2006.248084

  • Xia, R., & Liu, Y. (2012). Using i-vector space model for emotion recognition. In: INTERSPEECH 2012, 13th annual conference of the international speech communication association, Portland. September 9–13, 2012, pp. 2230–2233, http://www.isca-speech.org/archive/interspeech_2012/i12_2230.html

  • Yin, S. C., Kenny, P., & Rose, R. (2006). Experiments in speaker adaptation for factor analysis based speaker verification. In: Speaker and language recognition workshop, 2006. IEEE Odyssey 2006: The, pp. 1–6. doi:10.1109/ODYSSEY.2006.248130

  • Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X. A., et al. (2006) The HTK book (for HTK version 3.4).

  • Yu, C., Liu, G., Hahm, S., & Hansen, J. (2014). Uncertainty propagation in front end factor analysis for noise robust speaker recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4017–4021. doi:10.1109/ICASSP.2014.6854356

  • Zheng, R., Zhang, C., Zhang, S., & Xu, B. (2014). Variational bayes based i-vector for speaker diarization of telephone conversations. In: IEEE international conference on acoustics, speech and signal processing, ICASSP, Florence. May 4–9, 2014, pp. 91–95. doi:10.1109/ICASSP.2014.6853564

  • Zhuang, X., Tsakalidis, S., Wu, S., Natarajan, P., Prasad, R., & Natarajan, P. (2012). Compact audio representation for event detection in consumer media. In: INTERSPEECH 2012, 13th annual conference of the international speech communication association, Portland. September 9–13, 2012, pp. 2089–2092, http://www.isca-speech.org/archive/interspeech_2012/i12_2089.html

Download references

Acknowledgments

The authors wish to acknowledge UNICEF India and the DST, Government of India, for the funding provided under their FIST scheme, which greatly aided in the work reported herein.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pulkit Verma.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Verma, P., Das, P.K. i-Vectors in speech processing applications: a survey. Int J Speech Technol 18, 529–546 (2015). https://doi.org/10.1007/s10772-015-9295-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-015-9295-3

Keywords

Navigation