Elsevier

Speech Communication

Volume 92, September 2017, Pages 90-99
Speech Communication

Transfer learning for PLDA-based speaker verification

https://doi.org/10.1016/j.specom.2017.05.004Get rights and content

Abstract

Currently, the majority of the state-of-the-art speaker verification systems are based on i-vector and PLDA; however, PLDA requires a huge volume of development data from multiple different speakers. This makes it difficult to learn PLDA parameters for a domain with scarce data. In this paper, we study and extend an effective transfer learning method based on Bayesian joint probability, in which the Kullback–Leibler (KL) divergence between the source domain and the target domain is added as a regularization factor. This method utilizes the development data from the source domain to help find the optimal PLDA parameters for the target domain. Specifically, speaker verification of short utterances can be viewed as a task in the domain with a limited amount of long utterances. Therefore, transfer learning for PLDA can also be adopted to learn discriminative information from other domains with a great deal of long utterances. Experimental results based on the NIST SRE and Switchboard corpus demonstrate that the proposed method offers a significant performance gain when compared with the traditional PLDA.

Introduction

The task of speaker verification entails confirming the identity of a speaker when given a speech utterance. The robustness of this verification process is affected by many factors, such as channel variation, utterance duration, noise level, speakers emotion, and so on. In the past ten years, many machine learning approaches, including support vector machine (SVM) (Campbell, Sturim, Reynolds, 2006, Campbell, Sturim, Reynolds, Solomonoff, 2006), joint factor analysis (JFA) (Kenny, 2006, Matrouf, Scheffer, Fauve, Bonastre, 2007), and i-vector (Dehak et al., 2011), have been proposed. Currently, the state-of-the-art speaker verification systems are based on i-vector and probabilistic linear discriminant analysis (PLDA) (Prince, Elder, 2007, Garcia-Romero, Espy-Wilson, 2011, Kenny, 2010). With the posterior estimation of the hidden variables on the Baum–Welch statistics from the Gaussian components of a Universal Background Model (UBM) (Reynolds et al., 2000), each speech utterance can be represented as a fixed-length low-dimension vector, i.e. i-vector. Based on the i-vector, the task of speaker verification becomes a problem of machine learning. Therefore, many machine learning approaches, such as PLDA, can be applied in this area.

PLDA was first proposed for face verification (Prince and Elder, 2007), and it has gained popularity as an elegant classification tool for finding target classes in recent NIST Speaker Recognition Evaluation (SRE) challenges. Despite this success, it still requires tens of thousands of labeled development sessions from many speakers. For the NIST evaluation, this is not a problem since there are sufficient data provided, but this makes it ill-suited for practical applications. Even if we have sufficient development data to obtain a well-optimized PLDA for a source domain, it is not suitable for direct use on a new target domain with different distributions. This is because the PLDA represents the distribution of development data and works well under the assumption that the development and evaluation data have the same distribution. Several studies have shown that when the development data and evaluation data are from different domains, the performance of speaker verification significantly deteriorates due to domain mismatch (Villalba, Lleida, 2012, Garcia-Romero, McCree, 2014, Garcia-Romero, McCree, Shum, Brummer, Vaquero, 2014, Glembek, Ma, Matejka, Zhang, Plchot, Burget, Matsoukas, 2014, Aronowitz, 2014, Aronowitz, 2014, Kanagasundaram, Dean, Sridharan, 2015, Singer, Reynolds, 2015, Wang, Yamamoto, Koshinaka, 2016).

To minimize the performance gap among different domains, Villalba and Lleida (2012) applied variational Bayes method for a two-covariance model (Villalba and Lleida, 2012). Garcia-Romero et al. proposed several adaptive approaches with similar performances, in which the PLDA interpolation approach did not require keeping i-vectors of the source domain to retrain the PLDA (Garcia-Romero, McCree, 2014, Garcia-Romero, McCree, Shum, Brummer, Vaquero, 2014). Aronowitz introduced IDVC (inter-dataset variability compensation) method to compensate for mismatched shifts in the i-vector space (Aronowitz, 2014b) and the PLDA hyper-parameters (Aronowitz, 2014a). Kanagasundaram et al. proposed an unsupervised inter-dataset variability approach to compensate for this mismatch, but only a linear discriminant analysis (LDA) projection was applied prior to PLDA modeling (Kanagasundaram et al., 2015). The domain mismatch during i-vector length normalization was further analyzed (Singer and Reynolds, 2015), and an adaptive whitening method was proposed with a library of whiteners to achieve a performance that is comparable to IDVC approach (Aronowitz, 2014b). Wang et al. (2016) proposed a framework for the maximum likelihood linear transformation (MLLT) to infer the relationship between the datasets from two domains in training PLDA with two-covariance representation. The reported results showed that the fused system performed better than the in-domain PLDA.

The above domain adaptation methods improved speaker verification systems by remedying the problem of domain mismatch. To some extent, these works can be viewed as transfer learning methods. As shown in Fig. 1, when given sufficient development data for the source domain, the transfer learning task requires finding a matched PLDA classifier for the target domain, which has limited development data or different distribution. Transfer learning (Pan, Yang, 2010, Deng, Li, 2013) addresses the cross domain learning problems and has been successfully applied in many fields, such as face verification (Cao et al., 2013), spoken language understanding (Jeong and Lee, 2009), and visual categorization (Shao et al., 2015). Based on prior knowledge of the source domain, the objective of transfer learning is to reduce the mismatch between training and test conditions through finding a mapping solution for the classifier to make it fit the specifics of the target domain. Many transfer learning techniques have been proposed, such as nonnegative matrix tri-factorization (Shao, Zhu, Li, 2015, Wang, Nie, Huang, Ding, 2011, Li, Ding, 2006) and dimensionality reduction (Pan et al., 2008). In the area of automatic speech recognition (ASR), maximum a posteriori (MAP) and maximum likelihood linear regression (MLLR) provide two examples of commonly used homogeneous adaptation techniques (Deng and Li, 2013).

The Kullback–Leibler (KL) divergence is defined in order to measure the distance or difference between two distributions; therefore, it is very suitable to adopt the KL divergence as the optimization objective to minimize the domain mismatch. In Cao et al. (2013), this measure was used in transfer learning for face verification between the distributions of the source domain and the target domain. To improve speaker verification in situations with scarce numbers of speakers and sessions, we have proposed a novel transfer learning method (Hong et al., 2016), in which the KL regularization factor was added to the objective function of PLDA. Experimental results based on the NIST SRE and Switchboard corpus showed that our method improved the verification performance greatly when compared with the Gaussian PLDA. Furthermore, it was more effective at reducing the performance gap than PLDA interpolation (Garcia-Romero and McCree, 2014).

For speaker verification, domain mismatches might also occur due to utterance variations. Since the statistics for extracting i-vectors are accumulated over-time, the duration of speech utterance has great influence on the significance of i-vector features (Nautsch et al., 2015). For short utterances with sparse statistics, the performance of speaker verification deteriorates greatly due to limited discriminative information. Many studies have been conducted to investigate the influence of duration, including (Mandasari, Saeidi, Van Leeuwen, 2015, Mandasari, Saeidi, McLaren, Van Leeuwen, 2013, Nautsch, Rathgeb, Busch, Reininger, Kasper, 2014, Hasan, Saeidi, Hansen, Van Leeuwen, 2013, Kenny, Stafylakis, Ouellet, Alam, Dumouchel, 2013, Vogt, Baker, Sridharan, Kanagasundaram, Vogt, Dean, Sridharan, Mason, 2011, Sarkar, Matrouf, Bousquet, Bonastre, 2012, Kanagasundaram, Dean, Sridharan, Gonzalez-Dominguez, Gonzalez-Rodriguez, Ramos, 2014, Hong, Li, Li, Huang, Wan, Zhang, 2015), which introduce the improved methods of score calibration and duration modeling. In Mandasari et al. (2015, 2013), a quality measure function (QMF) for duration is adopted to adjust the score distributions that were shifted by the duration mismatches. Duration modeling is usually based on the PLDA, and duration variation could propagate uncertainty in a PLDA classifier, especially for short utterances. In (Kenny, Stafylakis, Ouellet, Alam, Dumouchel, 2013, Vogt, Baker, Sridharan, Kanagasundaram, Vogt, Dean, Sridharan, Mason, 2011, Sarkar, Matrouf, Bousquet, Bonastre, 2012, Kanagasundaram, Dean, Sridharan, Gonzalez-Dominguez, Gonzalez-Rodriguez, Ramos, 2014), the duration variability of the i-vector was included in the PLDA model and the performance was improved accordingly. In Hong et al. (2015), we also proposed an effective modified-prior PLDA framework to deal with the duration variation. As shorter utterances tend to have large covariance, the probability distribution function of the i-vector can be modified with a duration scaled covariance matrix during the PLDA training process. Then, the formulation of the likelihood for the standard Gaussian PLDA model is revised according to the duration-dependent posterior distribution of the i-vector.

Overall, these works only consider the duration variations of the target domain. Nevertheless, the performance is more likely to deteriorate due to short utterances with limited discriminative information. Speaker verification of short utterances can be viewed as a task in a domain with a limited amount of long utterances with enough linguistic content. The differences in the linguistic content of short utterances might be found in the development data with long-length i-vectors. Therefore, transfer learning can also be adopted for learning valuable information from other domains with long utterances. In this study, we further extend the framework of transfer learning for the target domain with short utterances. Using the source domain with a large amount of long utterances, the proposed transfer learning method would train an appropriate PLDA model for the target domain with a limited amount of long utterances.

This paper is organized as follows. First, the baseline i-vector and the general theory of standard Gaussian PLDA are introduced, and then the re-estimation and scoring formulas of PLDA are given. After that, we describe in detail the proposed transfer learning method for PLDA, including the objective function and re-estimation formulas. Experiments based on the NIST SRE and Switchboard corpus are then conducted to verify the effectiveness of this proposed method.

Section snippets

The i-vector

In the state-of-the-art i-vector speaker verification system, an i-vector x is a fixed-length vector, which is decomposed by a total variability matrix T into a single low dimensional subspace. M=m+Txwhere M is a speaker- and session-dependent Gaussian mean supervector, m is the speaker- and session-independent mean supervector of the UBM, and x is a hidden variable that is defined as the mean of the posterior distribution of the Baum–Welch statistics for an utterance. If we are given a UBM

Transfer learning

Due to the problem of scarce data, the PLDA that is directly optimized on the limited development data from the target domain may lead to an over-fitting solution. This complication can be avoided by adding a regularization factor in the optimization function. Furthermore, this regularization factor can be designed to utilize the development data of another domain. In view of the distribution similarity between the data in the source domain and the target domain, we can utilize the information

Experiments

To evaluate the effectiveness of the proposed transfer learning method for the target domain in cases with limited development data or short utterances, experiments were conducted based on the NIST SRE and Switchboard (SWB) corpus. We extracted 32 dimensional MFCC with appended delta coefficients from each speech utterance. The total variability subspace of dimension 400 was estimated using Baum–Welch statistics, and the PLDA was trained with a speaker subspace of dimension 120. All of the

Conclusions

This study addresses the problem of domain mismatch in PLDA-based speaker verification. Generally, PLDA requires immense amounts of development data from many speakers. This makes it difficult to learn the PLDA parameters in the target domain with scarce data or other domains with different distributions. In this paper, we designed a transfer learning method from a source domain with sufficient development data to a new target domain. The proposed method is based on the KL divergence, which

References (41)

  • Definition, T., Conditions, T., 2010. The NIST year 2010 speaker recognition evaluation plan. Available at...
  • N. Dehak et al.

    Front-end factor analysis for speaker verification

    Audio Speech Lang. Process. IEEE Trans.

    (2011)
  • L. Deng et al.

    Machine learning paradigms for speech recognition: An overview

    Audio Speech Lang. Process. IEEE Trans.

    (2013)
  • D. Garcia-Romero et al.

    Analysis of i-vector length normalization in speaker recognition systems

    Proc. INTERSPEECH, Florence, Italy, Aug.

    (2011)
  • D. Garcia-Romero et al.

    Supervised domain adaptation for i-vector based speaker recognition

    International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Florence, Italy

    (2014)
  • D. Garcia-Romero et al.

    Unsupervised domain adaptation for i-vector speaker recognition

    Proc. Odyssey, The Speaker and Language Recognition Workshop, Joensuu, Finland, Jun

    (2014)
  • O. Glembek et al.

    Domain adaptation via within-class covariance correction in i-vector based speaker recognition systems

    International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Florence, Italy

    (2014)
  • T. Hasan et al.

    Duration mismatch compensation for i-vector based speaker recognition systems

    Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, Vancouver, Canada

    (2013)
  • Q.Y. Hong et al.

    Modified-prior protectPLDA and score calibration for duration mismatch compensation in speaker recognition system

    Proc. INTERSPEECH, Dresden, Germany

    (2015)
  • Q.Y. Hong et al.

    A transfer learning method for protectPLDA-based speaker verification

    International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Shanghai, China

    (2016)
  • Cited by (12)

    View all citing articles on Scopus
    View full text