i-Vectors in speech processing applications: a survey

Verma, Pulkit; Das, Pradip K.

doi:10.1007/s10772-015-9295-3

i-Vectors in speech processing applications: a survey

Published: 06 August 2015

Volume 18, pages 529–546, (2015)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Pulkit Verma¹ &
Pradip K. Das¹

3135 Accesses
23 Citations
6 Altmetric
Explore all metrics

Abstract

In the domain of speech recognition many methods have been proposed over time like Gaussian mixture models (GMM), GMM with universal background model (GMM-UBM framework), joint factor analysis, etc. i-Vector subspace modeling is one of the recent methods that has become the state of the art technique in this domain. This method largely provides the benefit of modeling both the intra-domain and inter-domain variabilities into the same low dimensional space. In this survey, we present a comprehensive collection of research work related to i-vectors since its inception. Some recent trends of using i-vectors in combination with other approaches are also discussed. The application of i-vectors in various fields of speech recognition, viz speaker, language, accent recognition, etc. is also presented. This paper should serve as a good starting point for anyone interested in working with i-vectors for speech processing in general. We then conclude the paper with a brief discussion on the future of i-vectors.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Autoencoders and their applications in machine learning: a survey

Article Open access 03 February 2024

Kamal Berahmand, Fatemeh Daneshfar, … Yue Xu

Automatic speech recognition: a survey

Article 10 November 2020

Mishaim Malik, Muhammad Kamran Malik, … Imran Makhdoom

Conventional and contemporary approaches used in text to speech synthesis: a review

Article 13 November 2022

Navdeep Kaur & Parminder Singh

Notes

References

Adami, A., Mihaescu, R., Reynolds, D., & Godfrey. J. (2003). Modeling prosodic dynamics for speaker recognition. 2003 IEEE international conference on, acoustics, speech, and signal processing, 2003, proceedings, (ICASSP ’03). (Vol. 4), pp. IV-788-91. doi:10.1109/ICASSP.2003.1202761.
Adami, A. G. (2007). Modeling prosodic differences for speaker recognition. Speech Communications, 49(4), 277–291. doi:10.1016/j.specom.2007.02.005.
Article Google Scholar
Alam, M. J., Ouellet, P., Kenny, P., & O’Shaughnessy, D. D. (2011) Comparative evaluation of feature normalization techniques for speaker verification. Advances in nonlinear speech processing—proceedings of 5th international conference on nonlinear speech processing, NOLISP 2011, Las Palmas de Gran Canaria. Retrieved November 7–9, 2011, pp. 246–253. doi:10.1007/978-3-642-25020-0_32.
Aronowitz, H. (2014). Inter dataset variability compensation for speaker recognition. IEEE international conference on acoustics, speech and signal processing (ICASSP), 2014, pp. 4002–4006. doi:10.1109/ICASSP.2014.6854353.
Aronowitz, H., & Barkan, O. (2012). Efficient approximated i-vector extraction. 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4789–4792. doi:10.1109/ICASSP.2012.6288990.
Aronowitz, H., & Rendel, A. (2014). Domain adaptation for text dependent speaker verification. INTERSPEECH 2014, 15th annual conference of the international speech communication Association, Singapore. Retrieved September 14–18, 2014, pp. 1337–1341. http://www.isca-speech.org/archive/interspeech_2014/i14_1337.html.
Bahari, M., Saeidi, R., Van hamme, H., & Van Leeuwen, D. (2013). Accent recognition using i-vector, gaussian mean supervector and gaussian posterior probability supervector for spontaneous telephone speech. 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 7344–7348. doi:10.1109/ICASSP.2013.6639089.
Bahari, M. H., McLaren, M., Hamme, H. V., & van Leeuwen, D. A. (2014). Speaker age estimation using i-vectors. Engineering Applications of AI, 34, 99–108. doi:10.1016/j.engappai.2014.05.003.
Google Scholar
Behravan, H., Hautamäki, V., & Kinnunen, T. (2013). Foreign accent detection from spoken Finnish using i-vectors. INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon. Retrieved August 25–29, 2013, pp. 79–83. http://www.isca-speech.org/archive/interspeech_2013/i13_0079.html.
Behravan, H., Hautamäki, V., & Kinnunen, T. (2015). Factors affecting i-vector based foreign accent recognition: A case study in spoken finnish. Speech Communication, 66, 118–129. doi:10.1016/j.specom.2014.
Article Google Scholar
Biswas, S., Rohdin, J., & Shinoda, K. (2014). i-Vector selection for effective PLDA modeling in speaker recognition. Proceedings Odyssey 2014—The speaker and language recognition workshop. pp. 100–105.
Bousquet, P., Matrouf, D., & Bonastre, J. (2011). Intersession compensation and scoring methods in the i-vectors space for speaker recognition. INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. Retrieved August 27–31, 2011. pp. 485–488. http://www.isca-speech.org/archive/interspeech_2011/i11_0485.html.
Brümmer, N., Strasheim, A., Hubeika, V., Matejka, P., Burget, L., & Glembek, O. (2009). Discriminative acoustic language recognition via channel-compensated GMM statistics. INTERSPEECH 2009, 10th annual conference of the international speech communication association, Brighton. Retrieved September 6–10, 2009. pp. 2187–2190. http://www.isca-speech.org/archive/interspeech_2009/i09_2187.html.
Burget, L., Plchot, O., Cumani, S., Glembek, O., Matejka, P., & Brümmer, N. (2011). Discriminatively trained probabilistic linear discriminant analysis for speaker verification. Proceedings of the IEEE international conference on acoustics, speech, and signal processing, ICASSP 2011. Retrieved May 22–27, 2011. Prague Congress Center, Prague. pp. 4832–4835, doi10.1109/ICASSP.2011.5947437.
Busso, C., Bulut, M., Lee, C. C., Kazemzadeh, A., Mower, E., Kim, S., et al. (2008). IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation, 42(4), 335–359. doi:10.1007/s10579-008-9076-6.
Article Google Scholar
Chen, L., & Yang, Y. (2011). Applying emotional factor analysis and i-Vector to emotional speaker recognition. In Z. Sun, J. Lai, X. Chen, & T. Tan (Eds.), Biometric recognition, lecture notes in computer science (Vol. 7098, pp. 174–179). Berlin: Springer. doi:10.1007/978-3-642-25449-9-22.
Chapter Google Scholar
Chen, L., & Yang, Y. (2013). Emotional speaker recognition based on i-vector through atom aligned sparse representation. 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 7760–7764. doi:10.1109/ICASSP.2013.6639174.
Chen, N., Shen, W., & Campbell, J. (2010). A linguistically-informative approach to dialect recognition using dialect-discriminating context-dependent phonetic models. 2010 IEEE international conference on acoustics speech and signal processing (ICASSP), pp. 5014–5017. doi10.1109/ICASSP.2010.5495068.
Cheng, Y. C., Hautamaki, V., Huang, Z., Li, K., & Lee, C. H. (2014). An i-vector based descriptor for alphabetical gesture recognition. 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). pp. 6593–6597. doi10.1109/ICASSP.2014.6854875.
Cumani, S., Glembek, O., Brümmer, N., de Villiers, E., & Laface, P. (2012). Gender independent discriminative speaker recognition in i-vector space. 2012 IEEE international conference on acoustics, speech and signal processing, ICASSP 2012, Kyoto. Retrieved March 25–30, 2012. pp. 4361–4364. doi:10.1109/ICASSP.2012.6288885.
Dehak, N. (2009). Discriminative and generative approaches for long- and short-term speaker characteristics modeling: Application to speaker verification. PhD thesis, Ecole de Technologie Superieure (Canada), aAINR50490.
Dehak, N., & Shum, S. (2011). Low-dimensional speech representation based on factor analysis and its applications. Johns Hopkins CLSP Lecture.
Dehak, N., Dumouchel, P., & Kenny, P. (2007). Modeling prosodic features with joint factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 15(7), 2095–2103. doi:10.1109/TASL.2007.902758.
Article Google Scholar
Dehak, N., Kenny, P., & Dumouchel, P. (2007b) Continuous prosodic features and formant modeling with joint factor analysis for speaker verification. INTERSPEECH 2007, 8th annual conference of the international speech communication association, Antwerp. Retrieved August 27–31, 2007. pp 1234–1237. http://www.isca-speech.org/archive/interspeech_2007/i07_1234.html.
Dehak, N., Dehak, R., Glass, J. R., Reynolds, D. A., & Kenny, P. (2010). Cosine similarity scoring without score normalization techniques. Proceedings Odyssey 2010—The speaker and language recognition workshop. pp. 15–19.
Dehak, N., Karam, Z. N., Reynolds, D. A., Dehak, R., Campbell, W. M., & Glass, J. R. (2011a). A channel-blind system for speaker verification. Proceedings of the IEEE international conference on acoustics, speech, and signal processing, ICASSP 2011. Retrieved May 22–27, 2011, Prague Congress Center, Prague. pp. 4536–4539. doi:10.1109/ICASSP.2011.5947363.
Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Ouellet, P. (2011b). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788–798. doi:10.1109/TASL.2010.2064307.
Article Google Scholar
Dehak, N., Torres-Carrasquillo, P. A., Reynolds, D. A., & Dehak, R. (2011c). Language recognition via i-vectors and dimensionality reduction. INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. Retrieved August 27–31, 2011. pp. 857–860. http://www.isca-speech.org/archive/interspeech_2011/i11_0857.html.
DeMarco, A., & Cox, S. J. (2012). Iterative classification of regional British accents in i-vector space. 2012 Symposium on machine learning in speech and language processing, MLSLP 2012, Portland. Retrieved September 14, 2012, pp. 1–4.
DeMarco, A., & Cox, S. J. (2013). Native accent classification via I-vectors and speaker compensation fusion. INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon. Retrieved August 25–29, 2013. pp. 1472–1476. http://www.isca-speech.org/archive/interspeech_2013/i13_1472.html.
Dupuy, G., Rouvier, M., Meignier, S., & Estève, Y. (2012). i-Vectors and ILP clustering adapted to cross-show speaker diarization. INTERSPEECH 2012, 13th annual conference of the international speech communication association, Portland. Retrieved September 9–13, 2012. pp. 2174–2177. http://www.isca-speech.org/archive/interspeech_2012/i12_2174.html.
Ferrer, L., Scheffer, N., & Shriberg, E. (2010). A comparison of approaches for modeling prosodic features in speaker recognition. 2010 IEEE international conference on acoustics speech and signal processing (ICASSP). pp. 4414–4417. doi:10.1109/ICASSP.2010.5495632.
Foil, J. (1986). Language identification using noisy speech. IEEE international conference on ICASSP ’86. acoustics, speech, and signal processing (Vol. 11)
Gaida, C., Lange, P., Petrick, R., Proba, P., Malatawy, A., & Suendermann-Oeft, D. (2014). Comparing open-source speech recognition toolkits. http://suendermann.com/su/pdf/oasis2014.pdf.
Garcia-Romero, D., & Espy-Wilson, C. Y. (2011). Analysis of i-vector length nNormalization in speaker recognition systems. INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence Retrieved August 27–31, 2011. pp. 249–252. http://www.isca-speech.org/archive/interspeech_2011/i11_0249.html.
Garcia-Romero, D., Zhou, X., & Espy-Wilson, C. Y. (2012). Multicondition training of Gaussian PLDA models in i-vector space for noise and reverberation robust speaker recognition. 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4257–4260. doi:10.1109/ICASSP.2012.6288859.
Ghahabi, O., & Hernando, J. (2014a). Deep belief networks for i-vector based speaker recognition. IEEE International conference on acoustics, speech and signal processing, ICASSP 2014, Florence. Retrieved May 4–9, 2014. pp. 1700–1704. doi:10.1109/ICASSP.2014.6853888.
Ghahabi, O., & Hernando, J. (2014b). Global impostor selection for dbns in multi-session i-vector speaker recognition. Proceedings of advances in speech and language technologies for Iberian languages—Second international conference, IberSPEECH 2014, Las Palmas de Gran Canaria. Retrieved November 19–21, 2014. pp. 89–98, doi:10.1007/978-3-319-13623-3.
Glembek, O., Burget, L., Dehak, N., Brummer, N., & Kenny, P. (2009). Comparison of scoring methods used in speaker recognition with joint factor analysis. IEEE International conference on acoustics, speech and signal processing, ICASSP 2009. pp. 4057–4060. doi:10.1109/ICASSP.2009.4960519.
Glembek, O., Burget, L., Matejka, P., Karafiat, M., & Kenny, P. (2011). Simplification and optimization of i-vector extraction. IEEE International conference on acoustics, speech and signal processing (ICASSP), 2011. pp. 4516–4519. doi:10.1109/ICASSP.2011.5947358.
Glembek, O., Ma, J., Matejka, P., Zhang, B., Plchot, O., Burget, L., & Matsoukas, S. (2014). Domain adaptation via within-class covariance correction in i-vector based speaker recognition systems. 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). pp. 4032–4036. doi10.1109/ICASSP.2014.6854359.
González, D. M., Plchot, O., Burget, L., Glembek, O., & Matejka, P. (2011). Language recognition in iVectors space. INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. Retrieved August 27–31, 2011. pp. 861–864. http://www.isca-speech.org/archive/interspeech_2011/i11_0861.html.
Gupta, V., Kenny, P., Ouellet, P., & Stafylakis, T. (2014). i-Vector-based speaker adaptation of deep neural networks for French broadcast audio transcription. 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). pp. 6334–6338. doi:10.1109/ICASSP.2014.6854823.
Hasan, T., Saeidi, R., Hansen, J., & van Leeuwen, D. (2013). Duration mismatch compensation for i-vector based speaker recognition systems. 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP). pp. 7663–7667. doi:10.1109/ICASSP.2013.6639154.
Hautamäki, V., Cheng, Y., Rajan, P., & Lee, C. (2013). Minimax i-vector extractor for short duration speaker verification. INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon. Retrieved August 25–29, 2013. pp. 3708–3712. http://www.isca-speech.org/archive/interspeech_2013/i13_3708.html.
Huang, Z., Cheng, Y., Li, K., Hautamäki, V., & Lee, C. (2013). A blind segmentation approach to acoustic event detection based on i-vector. INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon. Retrieved August 25–29, 2013. pp. 2282–2286. http://www.isca-speech.org/archive/interspeech_2013/i13_2282.html.
Huggins-Daines, D., Kumar, M., Chan, A., Black, A., Ravishankar, M., & Rudnicky, A. (2006), Pocketsphinx: A free, real-time continuous speech recognition system for hand-held devices. 2006 IEEE international conference on acoustics, speech and signal processing, 2006, ICASSP 2006 proceedings (Vol. 1), pp. I-I. doi:10.1109/ICASSP.2006.1659988.
Jancik, Z., Plchot, O., Brummer, N., Burget, L., Glembek, O., Hubeika, V., et al. (2010). Data selection and calibration issues in automatic language recognition—investigation with BUT-AGNITIO NIST LRE 2009 system. Proceedings Odyssey 2010—The speaker and language recognition workshop. pp. 215–221.
Jiang, Y., Lee, K., Tang, Z., Ma, B., Larcher, A., & Li, H. (2012), PLDA modeling in i-vector and supervector space for speaker verification. INTERSPEECH 2012, 13th annual conference of the international speech communication association, Portland. Retrieved September 9–13, 2012. pp. 1680–1683. http://www.isca-speech.org/archive/interspeech_2012/i12_1680.html.
Kanagasundaram, A., Vogt, R., Dean, D., Sridharan, S., & Mason, M. (2011). i-Vector based speaker recognition on short utterances. INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. Retrieved August 27–31, 2011. pp. 2341–2344. http://www.isca-speech.org/archive/interspeech_2011/i11_2341.html.
Kanagasundaram, A., Dean, D., Vogt, R., McLaren, M., Subramanian, S.,& Mason, M. (2012a). Weighted LDA techniques for i-vector based speaker verification. 2012 IEEE international conference on acoustics, speech and signal processing, ICASSP 2012, Kyoto. Retrieved March 25–30, 2012. pp. 4781–4784. doi:10.1109/ICASSP.2012.6288988.
Kanagasundaram, A., Vogt, R., Dean, D., & Sridharan, S. (2012b). PLDA based speaker recognition on short utterances. Odyssey 2012: The speaker and language recognition workshop, Singapore. Retrieved June 25–28, 2012. pp 28–33.
Kanagasundaram, A., Dean, D., Gonzalez-Dominguez, J., Sridharan, S., Ramos, D., & Gonzalez-Rodriguez, J. (2013). Improving short utterance based i-vector speaker recognition using source and utterance-duration normalization techniques. INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon, Retrieved August 25–29, 2013. pp. 2465–2469. http://www.isca-speech.org/archive/interspeech_2013/i13_2465.html.
Kanagasundaram, A., Dean, D., Sridharan, S., Gonzalez-Dominguez, J., Gonzalez-Rodriguez, J., & Ramos, D. (2014). Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques. Speech Communication, 59, 69–82. doi:10.1016/j.specom.2014.01.004.
Article Google Scholar
Kanagasundaram, A., Dean, D., Sridharan, S., Gonzalez-Dominguez, J., González-Rodríguez, J., & Ramos, D. (2014). Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques. Speech Communication, 59, 69–82. doi:10.1016/j.specom.2014.01.004.
Article Google Scholar
Karafiát, M., Burget, L., Matejka, P., Glembek, O., & Cernocký, J. (2011). iVector-based discriminative adaptation for automatic speech recognition. 2011 IEEE workshop on automatic speech recognition & understanding, ASRU 2011, Waikoloa. Retrieved December 11–15, 2011. pp. 152–157. doi:10.1109/ASRU.2011.6163922.
Karanasou, P., Wang, Y., Gales, M. J. F., & Woodland, P. C. (2014). Adaptation of deep neural network acoustic models using factorised i-Vectors. INTERSPEECH 2014, 15th annual conference of the international speech communication association, Singapore. Retrieved September 14–18, 2014. pp. 2180–2184. http://www.isca-speech.org/archive/interspeech_2014/i14_2180.html.
Kenny, P. (2005). Joint factor analysis of speaker and session variability: Theory and algorithms. CRIM, Montreal, (Report) CRIM-06/08-13.
Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2007a). Joint factor analysis versus eigenchannels in speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 15(4), 1435–1447. doi:10.1109/TASL.2006.881693.
Article Google Scholar
Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2007b). Speaker and session variability in GMM-based speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 15(4), 1448–1460. doi:10.1109/TASL.2007.894527.
Article Google Scholar
Kenny, P., Ouellet, P., Dehak, N., Gupta, V., & Dumouchel, P. (2008). A study of interspeaker variability in speaker verification. Audio, Speech, and Language Processing, IEEE Transactions on, 16(5), 980–988. doi:10.1109/TASL.2008.925147.
Article Google Scholar
Kenny, P., Stafylakis, T., Ouellet, P., Alam, M., & Dumouchel, P. (2013). PLDA for speaker verification with utterances of arbitrary duration. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 7649–7653. doi:10.1109/ICASSP.2013.6639151.
Kockmann, M., Burget, L., & Cernocký, J. (2010). Brno University of Technology System for Interspeech 2010 Paralinguistic Challenge. In: INTERSPEECH 2010, 11th annual conference of the international speech communication association, Makuhari, Chiba. September 26–30, 2010, pp 2822–2825. http://www.isca-speech.org/archive/interspeech_2010/i10_2822.html
Kockmann, M., Ferrer, L., Burget, L., & Cernocký, J. (2011). iVector fusion of prosodic and cepstral features for speaker verification. In: INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence,. August 27–31, 2011, pp 265–268. http://www.isca-speech.org/archive/interspeech_2011/i11_0265.html
Lamere, P., Kwok, P., Gouvea, E., Raj, B., Singh, R., Walker, W., et al. (2003). The CMU SPHINX-4 speech recognition system. In: IEEE international conference on acoustics, speech and signal processing (ICASSP 2003). Hong Kong, 1, 2–5.
Larcher, A., Bousquet, P., Lee, K. A., Matrouf, D., Li, H., & Bonastre, J. F. (2012) i-Vectors in the context of phonetically-constrained short utterances for speaker verification. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4773–4776. doi:10.1109/ICASSP.2012.6288986
Larcher, A., Bonastre, J., Fauve, B. G. B., Lee, K., Lévy, C., Li, H., et al. (2013). ALIZE 3.0: Open source toolkit for state-of-the-art speaker recognition. In: INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon. August 25–29, 2013, pp 2768–2772, http://www.isca-speech.org/archive/interspeech_2013/i13_2768.html
Le,V. B., Mella, O., & Fohr, D. (2007). Speaker diarization using normalized cross likelihood ratio. In: INTERSPEECH 2007, 8th annual conference of the international speech communication association, Antwerp. August 27–31, 2007, pp. 1869–1872, http://www.isca-speech.org/archive/interspeech_2007/i07_1869.html
Lei, Y., Burget, L., Ferrer, L., Graciarena, M., & Scheffer, N. (2012a). Towards noise-robust speaker recognition using probabilistic linear discriminant analysis. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4253–4256. 10.1109/ICASSP.2012.6288858
Lei, Y., Burget, L., & Scheffer, N. (2012b). Bilinear factor analysis for ivector based speaker verification. In: INTERSPEECH 2012, 13th annual conference of the international speech communication association, Portland. September 9–13, 2012, pp 1588–1591, http://www.isca-speech.org/archive/interspeech_2012/i12_1588.html
Lei, Y., Burget, L., & Scheffer, N. (2013). A noise robust i-vector extractor using vector taylor series for speaker recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 6788–6791. doi:10.1109/ICASSP.2013.6638976
Lei, Y., McLaren, M., Ferrer, L., & Scheffer, N. (2014a). Simplified VTS-based i-Vector extraction in noise-robust speaker recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4037–4041. doi:10.1109/ICASSP.2014.6854360.
Lei, Y., Scheffer, N., Ferrer, L., & McLaren, M. (2014b). A novel scheme for speaker recognition using a phonetically-aware deep neural network. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 1695–1699. doi:10.1109/ICASSP.2014.6853887
Li, M., & Liu, W. (2014). Speaker verification and spoken language identification using a generalized i-vector framework with phonetic tokenizations and tandem features. In: INTERSPEECH 2014, 15th annual conference of the international speech communication association, Singapore. September 14–18, 2014, pp 1120–1124, http://www.isca-speech.org/archive/interspeech_2014/i14_1120.html
Li, M., Zhang, X., Yan, Y., & Narayanan, S. S. (2011). Speaker verification using sparse representations on total variability i-vectors. In: INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. August 27–31, 2011, pp 2729–2732, http://www.isca-speech.org/archive/interspeech_2011/i11_2729.html
Mandasari, M., McLaren, M., & van Leeuwen, D. (2012). The effect of noise on modern automatic speaker recognition systems. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4249–4252. doi:10.1109/ICASSP.2012.6288857
Mandasari, M. I., McLaren, M., & van Leeuwen, D. A. (2011). Evaluation of i-vector speaker recognition systems for forensic application. In: INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. August 27–31, 2011, pp. 21–24, http://www.isca-speech.org/archive/interspeech_2011/i11_0021.html
Mariooryad, S., & Busso, C. (2014). Compensating for speaker or lexical variabilities in speech for emotion recognition. Speech Communication, 57(0):1–12. doi:10.1016/j.specom.2013.07.011, http://www.sciencedirect.com/science/article/pii/S0167639313001015
Martinez, D., Burget, L., Ferrer, L., & Scheffer, N. (2012). iVector-based prosodic system for language identification. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4861–4864. doi:10.1109/ICASSP.2012.6289008
Martinez, D., Lleida, E., Ortega, A., & Miguel, A. (2013). Prosodic features and formant modeling for an ivector-based language recognition system. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 6847–6851. doi:10.1109/ICASSP.2013.6638988
Martinez, D., Burget, L., Stafylakis, T., Lei, Y., Kenny, P., & Lleida, E. (2014). Unscented transform for ivector-based noisy speaker recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4042–4046. doi:10.1109/ICASSP.2014.6854361
Matejka, P., Glembek, O., Castaldo, F., Alam, M., Plchot, O., Kenny, P., et al. (2011). Full-covariance UBM and heavy-tailed PLDA in i-vector speaker verification. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4828–4831. doi:10.1109/ICASSP.2011.5947436
McLaren, M., & van Leeuwen, D. (2011a). Source-normalised-and-weighted LDA for robust speaker recognition using i-vectors. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5456–5459, DOI 10.1109/ICASSP.2011.5947593.
McLaren, M., & van Leeuwen, D. (2012a). Gender-independent speaker recognition using source normalisation. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4373–4376. doi:10.1109/ICASSP.2012.6288888
McLaren, M., & van Leeuwen, D. (2012b). Source-normalized LDA for Robust speaker recognition using i-vectors from multiple speech sources. IEEE Transactions on Audio, Speech, and Language Processing, 20(3), 755–766. doi:10.1109/TASL.2011.2164533.
Article Google Scholar
McLaren, M., & van Leeuwen, D. A. (2011b). To weight or not to weight: source-normalised LDA for speaker recognition using i-vectors. In: INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. August 27–31, 2011, pp. 2709–2712, http://www.isca-speech.org/archive/interspeech_2011/i11_2709.html
Meignier, S., & Merlin, T. (2010). LIUM SpkDiarization: An open source toolkit for diarization. In: CMU SPUD workshop (Vol. 2010)
Novoselov, S., Pekhovsky, T., Simonchik, K., & Shulipa, A. (2014). RBM-PLDA subsystem for the NIST i-vector challenge. In: INTERSPEECH 2014, 15th annual conference of the international speech communication association, Singapore, September 14–18, 2014, pp 378–382, http://www.isca-speech.org/archive/interspeech_2014/i14_0378.html
Picone, J. (1993). Signal modeling techniques in speech recognition. Proceedings of the IEEE, 81(9), 1215–1247. doi:10.1109/5.237532.
Article Google Scholar
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., et al. (2011). The kaldi speech recognition toolkit. In: IEEE 2011 workshop on automatic speech recognition and understanding, IEEE Signal Processing Society, iEEE Catalog No.: CFP11SRW-USB.
Reynolds, D. A. (1992). A Gaussian mixture modeling approach to text-independent speaker identification. Ph.D. dissertation, Georgia Institute of Technology.
Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted gaussian mixture models. Digital Signal Processing, 10(13), 19–41. doi:10.1006/dspr.1999.0361.
Article Google Scholar
Rouvier, M., & Favre, B. (2014). Speaker adaptation of DNN-based ASR with i-vectors: Does it actually adapt models to speakers? In: INTERSPEECH 2014, 15th annual conference of the international speech communication association, Singapore, September 14–18, 2014, pp. 3007–3011, http://www.isca-speech.org/archive/interspeech_2014/i14_3007.html
Rouvier, M., Dupuy, G., Gay, P., el Khoury, E., Merlin, T., & Meignier, S. (2013). An open-source state-of-the-art toolbox for broadcast news diarization. In: INTERSPEECH 2013, 14th annual conference of the international speech communication association, Lyon. August 25–29, 2013, pp. 1477–1481, http://www.isca-speech.org/archive/interspeech_2013/i13_1477.html
Sadjadi, S. O., Slaney, M., & Heck, L. (2013). Msr identity toolbox v1.0: A matlab toolbox for speaker-recognition research. Speech and Language Processing Technical Committee Newsletter http://research.microsoft.com/apps/pubs/default.aspx?id=205119
Sarkar, A. K., Matrouf, D., Bousquet, P., & Bonastre, J. (2012). Study of the effect of i-vector modeling on short and mismatch utterance duration for speaker verification. In: INTERSPEECH 2012, 13th annual conference of the international speech communication association, Portland. September 9–13, 2012, pp. 2662–2665, http://www.isca-speech.org/archive/interspeech_2012/i12_2662.html
Sarkar, S., & Rao, K. S. (2014). A novel boosting algorithm for improved i-vector based speaker verification in noisy environments. In: INTERSPEECH 2014, 15th annual conference of the international speech communication association, Singapore. September 14–18, 2014, pp. 671–675, http://www.isca-speech.org/archive/interspeech_2014/i14_0671.html
Segbroeck, M. V., Travadi, R., & Narayanan, S. S. (2014a) UBM fused total variability modeling for language identification. In: INTERSPEECH 2014, 15th annual conference of the international speech communication association, Singapore. September 14–18, 2014, pp. 3027–3031, http://www.isca-speech.org/archive/interspeech_2014/i14_3027.html
Segbroeck, M. V., Travadi, R., Vaz, C., Kim, J., Black, M. P., Potamianos, A., et al. (2014b). Classification of cognitive load from speech using an i-vector framework. in: INTERSPEECH 2014, 15th annual conference of the international speech communication association, Singapore. September 14–18, 2014, pp. 751–755, http://www.isca-speech.org/archive/interspeech_2014/i14_0751.html
Senior, A., & Lopez-Moreno, I. (2014). Improving DNN speaker independence with i-vector inputs. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 225–229. doi:10.1109/ICASSP.2014.6853591
Senoussaoui, M., Kenny, P., Dehak, N., & Dumouchel, P. (2010). An i-vector extractor suitable for speaker recognition with both microphone and telephone speech. In: Odyssey 2010: the speaker and language recognition workshop, Brno, June 28–July 1, 2010, p. 6
Senoussaoui, M., Kenny, P., Brümmer, N., de Villiers, E., & Dumouchel, P. (2011). Mixture of PLDA models in i-vector space for gender-independent speaker recognition. In: INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. August 27–31, 2011, pp. 25–28, http://www.isca-speech.org/archive/interspeech_2011/i11_0025.html
Shum, S., Dehak, N., Chuangsuwanich, E., Reynolds, D. A., & Glass, J. R. (2011). Exploiting intra-conversation variability for speaker diarization. In: INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. August 27–31, 2011, pp. 945–948, http://www.isca-speech.org/archive/interspeech_2011/i11_0945.html
Silovsky, J., & Prazak, J. (2012). Speaker diarization of broadcast streams using two-stage clustering based on i-vectors and cosine distance scoring. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4193–4196. doi:10.1109/ICASSP.2012.6288843
Simonchik, K., Pekhovsky, T., Shulipa, A., & Afanasyev, A. (2012). Supervized mixture of PLDA models for cross-channel speaker verification. In: INTERSPEECH 2012, 13th annual conference of the international speech communication association, Portland. September 9–13, 2012, pp. 1684–1687, http://www.isca-speech.org/archive/interspeech_2012/i12_1684.html
Sizov, A., el Khoury, E., Kinnunen, T., Wu, Z., & Marcel, S. (2015). Joint speaker verification and antispoofing in the i-vector space. IEEE Transactions on Information Forensics and Security, 10(4), 821–832. doi:10.1109/TIFS.2015.2407362.
Article Google Scholar
Soufifar, M., Kockmann, M., Burget, L., Plchot, O., Glembek, O., & Svendsen, T. (2011). iVector approach to phonotactic language recognition. In: INTERSPEECH 2011, 12th annual conference of the international speech communication association, Florence. August 27–31, 2011, pp. 2913–2916, http://www.isca-speech.org/archive/interspeech_2011/i11_2913.html
Travadi, R., Segbroeck, M. V., & Narayanan, S. S. (2014). Modified-prior i-vector estimation for language identification of short duration utterances. In: INTERSPEECH 2014, 15th annual conference of the international speech communication association, Singapore. September 14–18, 2014, pp. 3037–3041, http://www.isca-speech.org/archive/interspeech_2014/i14_3037.html
Varga, A., & Steeneken, H. J. (1993). Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Communication 12(3):247–251. doi:10.1016/0167-6393(93)90095-3, http://www.sciencedirect.com/science/article/pii/0167639393900953
Variani, E., Lei, X., McDermott, E., Lopez Moreno, I., & Gonzalez-Dominguez, J. (2014). Deep neural networks for small footprint text-dependent speaker verification. In: Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pp 4052–4056. doi:10.1109/ICASSP.2014.6854363
Villalba, J., & Lleida, E. (2013). Handling i-vectors from different recording conditions using multi-channel simplified PLDA in speaker recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 6763–6767. doi:10.1109/ICASSP.2013.6638971
Wolf, J. J. (1972). Efficient acoustic parameters for speaker recognition. The Journal of the Acoustical Society of America, 51(6B):2044–2056. doi:10.1121/1.1913065, http://scitation.aip.org/content/asa/journal/jasa/51/6B/10.1121/1.1913065
Wu, T., Yang, Y., Wu, Z., & Li, D. (2006). MASC: A speech corpus in mandarin for emotion analysis and affective speaker recognition. In: Speaker and language recognition workshop, 2006. IEEE Odyssey 2006: The, pp 1–5. doi:10.1109/ODYSSEY.2006.248084
Xia, R., & Liu, Y. (2012). Using i-vector space model for emotion recognition. In: INTERSPEECH 2012, 13th annual conference of the international speech communication association, Portland. September 9–13, 2012, pp. 2230–2233, http://www.isca-speech.org/archive/interspeech_2012/i12_2230.html
Yin, S. C., Kenny, P., & Rose, R. (2006). Experiments in speaker adaptation for factor analysis based speaker verification. In: Speaker and language recognition workshop, 2006. IEEE Odyssey 2006: The, pp. 1–6. doi:10.1109/ODYSSEY.2006.248130
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X. A., et al. (2006) The HTK book (for HTK version 3.4).
Yu, C., Liu, G., Hahm, S., & Hansen, J. (2014). Uncertainty propagation in front end factor analysis for noise robust speaker recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 4017–4021. doi:10.1109/ICASSP.2014.6854356
Zheng, R., Zhang, C., Zhang, S., & Xu, B. (2014). Variational bayes based i-vector for speaker diarization of telephone conversations. In: IEEE international conference on acoustics, speech and signal processing, ICASSP, Florence. May 4–9, 2014, pp. 91–95. doi:10.1109/ICASSP.2014.6853564
Zhuang, X., Tsakalidis, S., Wu, S., Natarajan, P., Prasad, R., & Natarajan, P. (2012). Compact audio representation for event detection in consumer media. In: INTERSPEECH 2012, 13th annual conference of the international speech communication association, Portland. September 9–13, 2012, pp. 2089–2092, http://www.isca-speech.org/archive/interspeech_2012/i12_2089.html

Download references

Acknowledgments

The authors wish to acknowledge UNICEF India and the DST, Government of India, for the funding provided under their FIST scheme, which greatly aided in the work reported herein.

Author information

Authors and Affiliations

Department of Computer Science & Engineering, Indian Institute of Technology Guwahati, Guwahati, Assam, 781039, India
Pulkit Verma & Pradip K. Das

Authors

Pulkit Verma
View author publications
You can also search for this author in PubMed Google Scholar
Pradip K. Das
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pulkit Verma.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Verma, P., Das, P.K. i-Vectors in speech processing applications: a survey. Int J Speech Technol 18, 529–546 (2015). https://doi.org/10.1007/s10772-015-9295-3

Download citation

Received: 14 May 2015
Accepted: 27 July 2015
Published: 06 August 2015
Issue Date: December 2015
DOI: https://doi.org/10.1007/s10772-015-9295-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

i-Vectors in speech processing applications: a survey

Abstract

Access this article

Similar content being viewed by others

Autoencoders and their applications in machine learning: a survey

Automatic speech recognition: a survey

Conventional and contemporary approaches used in text to speech synthesis: a review

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

i-Vectors in speech processing applications: a survey

Abstract

Access this article

Similar content being viewed by others

Autoencoders and their applications in machine learning: a survey

Automatic speech recognition: a survey

Conventional and contemporary approaches used in text to speech synthesis: a review

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation