Transfer learning for PLDA-based speaker verification
Introduction
The task of speaker verification entails confirming the identity of a speaker when given a speech utterance. The robustness of this verification process is affected by many factors, such as channel variation, utterance duration, noise level, speakers emotion, and so on. In the past ten years, many machine learning approaches, including support vector machine (SVM) (Campbell, Sturim, Reynolds, 2006, Campbell, Sturim, Reynolds, Solomonoff, 2006), joint factor analysis (JFA) (Kenny, 2006, Matrouf, Scheffer, Fauve, Bonastre, 2007), and i-vector (Dehak et al., 2011), have been proposed. Currently, the state-of-the-art speaker verification systems are based on i-vector and probabilistic linear discriminant analysis (PLDA) (Prince, Elder, 2007, Garcia-Romero, Espy-Wilson, 2011, Kenny, 2010). With the posterior estimation of the hidden variables on the Baum–Welch statistics from the Gaussian components of a Universal Background Model (UBM) (Reynolds et al., 2000), each speech utterance can be represented as a fixed-length low-dimension vector, i.e. i-vector. Based on the i-vector, the task of speaker verification becomes a problem of machine learning. Therefore, many machine learning approaches, such as PLDA, can be applied in this area.
PLDA was first proposed for face verification (Prince and Elder, 2007), and it has gained popularity as an elegant classification tool for finding target classes in recent NIST Speaker Recognition Evaluation (SRE) challenges. Despite this success, it still requires tens of thousands of labeled development sessions from many speakers. For the NIST evaluation, this is not a problem since there are sufficient data provided, but this makes it ill-suited for practical applications. Even if we have sufficient development data to obtain a well-optimized PLDA for a source domain, it is not suitable for direct use on a new target domain with different distributions. This is because the PLDA represents the distribution of development data and works well under the assumption that the development and evaluation data have the same distribution. Several studies have shown that when the development data and evaluation data are from different domains, the performance of speaker verification significantly deteriorates due to domain mismatch (Villalba, Lleida, 2012, Garcia-Romero, McCree, 2014, Garcia-Romero, McCree, Shum, Brummer, Vaquero, 2014, Glembek, Ma, Matejka, Zhang, Plchot, Burget, Matsoukas, 2014, Aronowitz, 2014, Aronowitz, 2014, Kanagasundaram, Dean, Sridharan, 2015, Singer, Reynolds, 2015, Wang, Yamamoto, Koshinaka, 2016).
To minimize the performance gap among different domains, Villalba and Lleida (2012) applied variational Bayes method for a two-covariance model (Villalba and Lleida, 2012). Garcia-Romero et al. proposed several adaptive approaches with similar performances, in which the PLDA interpolation approach did not require keeping i-vectors of the source domain to retrain the PLDA (Garcia-Romero, McCree, 2014, Garcia-Romero, McCree, Shum, Brummer, Vaquero, 2014). Aronowitz introduced IDVC (inter-dataset variability compensation) method to compensate for mismatched shifts in the i-vector space (Aronowitz, 2014b) and the PLDA hyper-parameters (Aronowitz, 2014a). Kanagasundaram et al. proposed an unsupervised inter-dataset variability approach to compensate for this mismatch, but only a linear discriminant analysis (LDA) projection was applied prior to PLDA modeling (Kanagasundaram et al., 2015). The domain mismatch during i-vector length normalization was further analyzed (Singer and Reynolds, 2015), and an adaptive whitening method was proposed with a library of whiteners to achieve a performance that is comparable to IDVC approach (Aronowitz, 2014b). Wang et al. (2016) proposed a framework for the maximum likelihood linear transformation (MLLT) to infer the relationship between the datasets from two domains in training PLDA with two-covariance representation. The reported results showed that the fused system performed better than the in-domain PLDA.
The above domain adaptation methods improved speaker verification systems by remedying the problem of domain mismatch. To some extent, these works can be viewed as transfer learning methods. As shown in Fig. 1, when given sufficient development data for the source domain, the transfer learning task requires finding a matched PLDA classifier for the target domain, which has limited development data or different distribution. Transfer learning (Pan, Yang, 2010, Deng, Li, 2013) addresses the cross domain learning problems and has been successfully applied in many fields, such as face verification (Cao et al., 2013), spoken language understanding (Jeong and Lee, 2009), and visual categorization (Shao et al., 2015). Based on prior knowledge of the source domain, the objective of transfer learning is to reduce the mismatch between training and test conditions through finding a mapping solution for the classifier to make it fit the specifics of the target domain. Many transfer learning techniques have been proposed, such as nonnegative matrix tri-factorization (Shao, Zhu, Li, 2015, Wang, Nie, Huang, Ding, 2011, Li, Ding, 2006) and dimensionality reduction (Pan et al., 2008). In the area of automatic speech recognition (ASR), maximum a posteriori (MAP) and maximum likelihood linear regression (MLLR) provide two examples of commonly used homogeneous adaptation techniques (Deng and Li, 2013).
The Kullback–Leibler (KL) divergence is defined in order to measure the distance or difference between two distributions; therefore, it is very suitable to adopt the KL divergence as the optimization objective to minimize the domain mismatch. In Cao et al. (2013), this measure was used in transfer learning for face verification between the distributions of the source domain and the target domain. To improve speaker verification in situations with scarce numbers of speakers and sessions, we have proposed a novel transfer learning method (Hong et al., 2016), in which the KL regularization factor was added to the objective function of PLDA. Experimental results based on the NIST SRE and Switchboard corpus showed that our method improved the verification performance greatly when compared with the Gaussian PLDA. Furthermore, it was more effective at reducing the performance gap than PLDA interpolation (Garcia-Romero and McCree, 2014).
For speaker verification, domain mismatches might also occur due to utterance variations. Since the statistics for extracting i-vectors are accumulated over-time, the duration of speech utterance has great influence on the significance of i-vector features (Nautsch et al., 2015). For short utterances with sparse statistics, the performance of speaker verification deteriorates greatly due to limited discriminative information. Many studies have been conducted to investigate the influence of duration, including (Mandasari, Saeidi, Van Leeuwen, 2015, Mandasari, Saeidi, McLaren, Van Leeuwen, 2013, Nautsch, Rathgeb, Busch, Reininger, Kasper, 2014, Hasan, Saeidi, Hansen, Van Leeuwen, 2013, Kenny, Stafylakis, Ouellet, Alam, Dumouchel, 2013, Vogt, Baker, Sridharan, Kanagasundaram, Vogt, Dean, Sridharan, Mason, 2011, Sarkar, Matrouf, Bousquet, Bonastre, 2012, Kanagasundaram, Dean, Sridharan, Gonzalez-Dominguez, Gonzalez-Rodriguez, Ramos, 2014, Hong, Li, Li, Huang, Wan, Zhang, 2015), which introduce the improved methods of score calibration and duration modeling. In Mandasari et al. (2015, 2013), a quality measure function (QMF) for duration is adopted to adjust the score distributions that were shifted by the duration mismatches. Duration modeling is usually based on the PLDA, and duration variation could propagate uncertainty in a PLDA classifier, especially for short utterances. In (Kenny, Stafylakis, Ouellet, Alam, Dumouchel, 2013, Vogt, Baker, Sridharan, Kanagasundaram, Vogt, Dean, Sridharan, Mason, 2011, Sarkar, Matrouf, Bousquet, Bonastre, 2012, Kanagasundaram, Dean, Sridharan, Gonzalez-Dominguez, Gonzalez-Rodriguez, Ramos, 2014), the duration variability of the i-vector was included in the PLDA model and the performance was improved accordingly. In Hong et al. (2015), we also proposed an effective modified-prior PLDA framework to deal with the duration variation. As shorter utterances tend to have large covariance, the probability distribution function of the i-vector can be modified with a duration scaled covariance matrix during the PLDA training process. Then, the formulation of the likelihood for the standard Gaussian PLDA model is revised according to the duration-dependent posterior distribution of the i-vector.
Overall, these works only consider the duration variations of the target domain. Nevertheless, the performance is more likely to deteriorate due to short utterances with limited discriminative information. Speaker verification of short utterances can be viewed as a task in a domain with a limited amount of long utterances with enough linguistic content. The differences in the linguistic content of short utterances might be found in the development data with long-length i-vectors. Therefore, transfer learning can also be adopted for learning valuable information from other domains with long utterances. In this study, we further extend the framework of transfer learning for the target domain with short utterances. Using the source domain with a large amount of long utterances, the proposed transfer learning method would train an appropriate PLDA model for the target domain with a limited amount of long utterances.
This paper is organized as follows. First, the baseline i-vector and the general theory of standard Gaussian PLDA are introduced, and then the re-estimation and scoring formulas of PLDA are given. After that, we describe in detail the proposed transfer learning method for PLDA, including the objective function and re-estimation formulas. Experiments based on the NIST SRE and Switchboard corpus are then conducted to verify the effectiveness of this proposed method.
Section snippets
The i-vector
In the state-of-the-art i-vector speaker verification system, an i-vector x is a fixed-length vector, which is decomposed by a total variability matrix T into a single low dimensional subspace. where M is a speaker- and session-dependent Gaussian mean supervector, m is the speaker- and session-independent mean supervector of the UBM, and x is a hidden variable that is defined as the mean of the posterior distribution of the Baum–Welch statistics for an utterance. If we are given a UBM
Transfer learning
Due to the problem of scarce data, the PLDA that is directly optimized on the limited development data from the target domain may lead to an over-fitting solution. This complication can be avoided by adding a regularization factor in the optimization function. Furthermore, this regularization factor can be designed to utilize the development data of another domain. In view of the distribution similarity between the data in the source domain and the target domain, we can utilize the information
Experiments
To evaluate the effectiveness of the proposed transfer learning method for the target domain in cases with limited development data or short utterances, experiments were conducted based on the NIST SRE and Switchboard (SWB) corpus. We extracted 32 dimensional MFCC with appended delta coefficients from each speech utterance. The total variability subspace of dimension 400 was estimated using Baum–Welch statistics, and the PLDA was trained with a speaker subspace of dimension 120. All of the
Conclusions
This study addresses the problem of domain mismatch in PLDA-based speaker verification. Generally, PLDA requires immense amounts of development data from many speakers. This makes it difficult to learn the PLDA parameters in the target domain with scarce data or other domains with different distributions. In this paper, we designed a transfer learning method from a source domain with sufficient development data to a new target domain. The proposed method is based on the KL divergence, which
References (41)
- et al.
Multi-domain spoken language understanding with transfer learning
Speech Commun.
(2009) - et al.
Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques
Speech Commun.
(2014) - et al.
Quality measures based calibration with duration and noise dependency for speaker recognition
Speech Commun.
(2015) - et al.
Speaker verification using adapted gaussian mixture models
Digital Signal Processing
(2000) Compensating inter-dataset variability in protectPLDA hyper-parameters for robust speaker recognition
Proc. Speaker Odyssey, Joensuu, Finland
(2014)Inter dataset variability compensation for speaker recognition
International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Florence, Italy
(2014)- et al.
based speaker verification using a protectGMM supervector kernel and protectNAP variability compensation
International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2006, Toulouse, France
(2006) - et al.
Support vector machines using protectGMM supervectors for speaker verification
IEEE Signal Process. Lett.
(2006) - et al.
A practical transfer learning algorithm for face verification
Proceedings of the IEEE International Conference on Computer Vision
(2013) - et al.
Pairwise discriminative speaker verification in the i-vector space
Audio Speech Lang. Process. IEEE Trans.
(2013)
Front-end factor analysis for speaker verification
Audio Speech Lang. Process. IEEE Trans.
Machine learning paradigms for speech recognition: An overview
Audio Speech Lang. Process. IEEE Trans.
Analysis of i-vector length normalization in speaker recognition systems
Proc. INTERSPEECH, Florence, Italy, Aug.
Supervised domain adaptation for i-vector based speaker recognition
International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Florence, Italy
Unsupervised domain adaptation for i-vector speaker recognition
Proc. Odyssey, The Speaker and Language Recognition Workshop, Joensuu, Finland, Jun
Domain adaptation via within-class covariance correction in i-vector based speaker recognition systems
International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Florence, Italy
Duration mismatch compensation for i-vector based speaker recognition systems
Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, Vancouver, Canada
Modified-prior protectPLDA and score calibration for duration mismatch compensation in speaker recognition system
Proc. INTERSPEECH, Dresden, Germany
A transfer learning method for protectPLDA-based speaker verification
International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Shanghai, China
Cited by (12)
Deep transfer learning for automatic speech recognition: Towards better generalization
2023, Knowledge-Based SystemsECAPA-Based Speaker Verification of Virtual Assistants: A Transfer Learning Approach
2023, 2023 14th International Conference on Computing Communication and Networking Technologies, ICCCNT 2023Utterance-level Feature Extraction in Text-independent Speaker Recognition: A Review
2022, Zidonghua Xuebao/Acta Automatica SinicaSpeaker-Phrase-Specific Adaptation of PLDA Model for Improved Performance in Text-Dependent Speaker Verification
2021, Circuits, Systems, and Signal ProcessingAn Always-on Ultra-Low Power Speaker Verification Accelerator based on Binary Weighted Neural Network with System Co-optimization
2021, Proceedings of International Conference on ASIC