Transfer learning for PLDA-based speaker verification

doi:10.1016/j.specom.2017.05.004

Speech Communication

Volume 92, September 2017, Pages 90-99

https://doi.org/10.1016/j.specom.2017.05.004 Get rights and content

Abstract

Currently, the majority of the state-of-the-art speaker verification systems are based on i-vector and PLDA; however, PLDA requires a huge volume of development data from multiple different speakers. This makes it difficult to learn PLDA parameters for a domain with scarce data. In this paper, we study and extend an effective transfer learning method based on Bayesian joint probability, in which the Kullback–Leibler (KL) divergence between the source domain and the target domain is added as a regularization factor. This method utilizes the development data from the source domain to help find the optimal PLDA parameters for the target domain. Specifically, speaker verification of short utterances can be viewed as a task in the domain with a limited amount of long utterances. Therefore, transfer learning for PLDA can also be adopted to learn discriminative information from other domains with a great deal of long utterances. Experimental results based on the NIST SRE and Switchboard corpus demonstrate that the proposed method offers a significant performance gain when compared with the traditional PLDA.

Introduction

The task of speaker verification entails confirming the identity of a speaker when given a speech utterance. The robustness of this verification process is affected by many factors, such as channel variation, utterance duration, noise level, speakers emotion, and so on. In the past ten years, many machine learning approaches, including support vector machine (SVM) (Campbell, Sturim, Reynolds, 2006, Campbell, Sturim, Reynolds, Solomonoff, 2006), joint factor analysis (JFA) (Kenny, 2006, Matrouf, Scheffer, Fauve, Bonastre, 2007), and i-vector (Dehak et al., 2011), have been proposed. Currently, the state-of-the-art speaker verification systems are based on i-vector and probabilistic linear discriminant analysis (PLDA) (Prince, Elder, 2007, Garcia-Romero, Espy-Wilson, 2011, Kenny, 2010). With the posterior estimation of the hidden variables on the Baum–Welch statistics from the Gaussian components of a Universal Background Model (UBM) (Reynolds et al., 2000), each speech utterance can be represented as a fixed-length low-dimension vector, i.e. i-vector. Based on the i-vector, the task of speaker verification becomes a problem of machine learning. Therefore, many machine learning approaches, such as PLDA, can be applied in this area.

PLDA was first proposed for face verification (Prince and Elder, 2007), and it has gained popularity as an elegant classification tool for finding target classes in recent NIST Speaker Recognition Evaluation (SRE) challenges. Despite this success, it still requires tens of thousands of labeled development sessions from many speakers. For the NIST evaluation, this is not a problem since there are sufficient data provided, but this makes it ill-suited for practical applications. Even if we have sufficient development data to obtain a well-optimized PLDA for a source domain, it is not suitable for direct use on a new target domain with different distributions. This is because the PLDA represents the distribution of development data and works well under the assumption that the development and evaluation data have the same distribution. Several studies have shown that when the development data and evaluation data are from different domains, the performance of speaker verification significantly deteriorates due to domain mismatch (Villalba, Lleida, 2012, Garcia-Romero, McCree, 2014, Garcia-Romero, McCree, Shum, Brummer, Vaquero, 2014, Glembek, Ma, Matejka, Zhang, Plchot, Burget, Matsoukas, 2014, Aronowitz, 2014, Aronowitz, 2014, Kanagasundaram, Dean, Sridharan, 2015, Singer, Reynolds, 2015, Wang, Yamamoto, Koshinaka, 2016).

To minimize the performance gap among different domains, Villalba and Lleida (2012) applied variational Bayes method for a two-covariance model (Villalba and Lleida, 2012). Garcia-Romero et al. proposed several adaptive approaches with similar performances, in which the PLDA interpolation approach did not require keeping i-vectors of the source domain to retrain the PLDA (Garcia-Romero, McCree, 2014, Garcia-Romero, McCree, Shum, Brummer, Vaquero, 2014). Aronowitz introduced IDVC (inter-dataset variability compensation) method to compensate for mismatched shifts in the i-vector space (Aronowitz, 2014b) and the PLDA hyper-parameters (Aronowitz, 2014a). Kanagasundaram et al. proposed an unsupervised inter-dataset variability approach to compensate for this mismatch, but only a linear discriminant analysis (LDA) projection was applied prior to PLDA modeling (Kanagasundaram et al., 2015). The domain mismatch during i-vector length normalization was further analyzed (Singer and Reynolds, 2015), and an adaptive whitening method was proposed with a library of whiteners to achieve a performance that is comparable to IDVC approach (Aronowitz, 2014b). Wang et al. (2016) proposed a framework for the maximum likelihood linear transformation (MLLT) to infer the relationship between the datasets from two domains in training PLDA with two-covariance representation. The reported results showed that the fused system performed better than the in-domain PLDA.

The above domain adaptation methods improved speaker verification systems by remedying the problem of domain mismatch. To some extent, these works can be viewed as transfer learning methods. As shown in Fig. 1, when given sufficient development data for the source domain, the transfer learning task requires finding a matched PLDA classifier for the target domain, which has limited development data or different distribution. Transfer learning (Pan, Yang, 2010, Deng, Li, 2013) addresses the cross domain learning problems and has been successfully applied in many fields, such as face verification (Cao et al., 2013), spoken language understanding (Jeong and Lee, 2009), and visual categorization (Shao et al., 2015). Based on prior knowledge of the source domain, the objective of transfer learning is to reduce the mismatch between training and test conditions through finding a mapping solution for the classifier to make it fit the specifics of the target domain. Many transfer learning techniques have been proposed, such as nonnegative matrix tri-factorization (Shao, Zhu, Li, 2015, Wang, Nie, Huang, Ding, 2011, Li, Ding, 2006) and dimensionality reduction (Pan et al., 2008). In the area of automatic speech recognition (ASR), maximum a posteriori (MAP) and maximum likelihood linear regression (MLLR) provide two examples of commonly used homogeneous adaptation techniques (Deng and Li, 2013).

The Kullback–Leibler (KL) divergence is defined in order to measure the distance or difference between two distributions; therefore, it is very suitable to adopt the KL divergence as the optimization objective to minimize the domain mismatch. In Cao et al. (2013), this measure was used in transfer learning for face verification between the distributions of the source domain and the target domain. To improve speaker verification in situations with scarce numbers of speakers and sessions, we have proposed a novel transfer learning method (Hong et al., 2016), in which the KL regularization factor was added to the objective function of PLDA. Experimental results based on the NIST SRE and Switchboard corpus showed that our method improved the verification performance greatly when compared with the Gaussian PLDA. Furthermore, it was more effective at reducing the performance gap than PLDA interpolation (Garcia-Romero and McCree, 2014).

For speaker verification, domain mismatches might also occur due to utterance variations. Since the statistics for extracting i-vectors are accumulated over-time, the duration of speech utterance has great influence on the significance of i-vector features (Nautsch et al., 2015). For short utterances with sparse statistics, the performance of speaker verification deteriorates greatly due to limited discriminative information. Many studies have been conducted to investigate the influence of duration, including (Mandasari, Saeidi, Van Leeuwen, 2015, Mandasari, Saeidi, McLaren, Van Leeuwen, 2013, Nautsch, Rathgeb, Busch, Reininger, Kasper, 2014, Hasan, Saeidi, Hansen, Van Leeuwen, 2013, Kenny, Stafylakis, Ouellet, Alam, Dumouchel, 2013, Vogt, Baker, Sridharan, Kanagasundaram, Vogt, Dean, Sridharan, Mason, 2011, Sarkar, Matrouf, Bousquet, Bonastre, 2012, Kanagasundaram, Dean, Sridharan, Gonzalez-Dominguez, Gonzalez-Rodriguez, Ramos, 2014, Hong, Li, Li, Huang, Wan, Zhang, 2015), which introduce the improved methods of score calibration and duration modeling. In Mandasari et al. (2015, 2013), a quality measure function (QMF) for duration is adopted to adjust the score distributions that were shifted by the duration mismatches. Duration modeling is usually based on the PLDA, and duration variation could propagate uncertainty in a PLDA classifier, especially for short utterances. In (Kenny, Stafylakis, Ouellet, Alam, Dumouchel, 2013, Vogt, Baker, Sridharan, Kanagasundaram, Vogt, Dean, Sridharan, Mason, 2011, Sarkar, Matrouf, Bousquet, Bonastre, 2012, Kanagasundaram, Dean, Sridharan, Gonzalez-Dominguez, Gonzalez-Rodriguez, Ramos, 2014), the duration variability of the i-vector was included in the PLDA model and the performance was improved accordingly. In Hong et al. (2015), we also proposed an effective modified-prior PLDA framework to deal with the duration variation. As shorter utterances tend to have large covariance, the probability distribution function of the i-vector can be modified with a duration scaled covariance matrix during the PLDA training process. Then, the formulation of the likelihood for the standard Gaussian PLDA model is revised according to the duration-dependent posterior distribution of the i-vector.

Overall, these works only consider the duration variations of the target domain. Nevertheless, the performance is more likely to deteriorate due to short utterances with limited discriminative information. Speaker verification of short utterances can be viewed as a task in a domain with a limited amount of long utterances with enough linguistic content. The differences in the linguistic content of short utterances might be found in the development data with long-length i-vectors. Therefore, transfer learning can also be adopted for learning valuable information from other domains with long utterances. In this study, we further extend the framework of transfer learning for the target domain with short utterances. Using the source domain with a large amount of long utterances, the proposed transfer learning method would train an appropriate PLDA model for the target domain with a limited amount of long utterances.

This paper is organized as follows. First, the baseline i-vector and the general theory of standard Gaussian PLDA are introduced, and then the re-estimation and scoring formulas of PLDA are given. After that, we describe in detail the proposed transfer learning method for PLDA, including the objective function and re-estimation formulas. Experiments based on the NIST SRE and Switchboard corpus are then conducted to verify the effectiveness of this proposed method.

Section snippets

The i-vector

In the state-of-the-art i-vector speaker verification system, an i-vector x is a fixed-length vector, which is decomposed by a total variability matrix T into a single low dimensional subspace. $\begin{matrix} M = m + T x \end{matrix}$ where M is a speaker- and session-dependent Gaussian mean supervector, m is the speaker- and session-independent mean supervector of the UBM, and x is a hidden variable that is defined as the mean of the posterior distribution of the Baum–Welch statistics for an utterance. If we are given a UBM

Transfer learning

Due to the problem of scarce data, the PLDA that is directly optimized on the limited development data from the target domain may lead to an over-fitting solution. This complication can be avoided by adding a regularization factor in the optimization function. Furthermore, this regularization factor can be designed to utilize the development data of another domain. In view of the distribution similarity between the data in the source domain and the target domain, we can utilize the information

Experiments

To evaluate the effectiveness of the proposed transfer learning method for the target domain in cases with limited development data or short utterances, experiments were conducted based on the NIST SRE and Switchboard (SWB) corpus. We extracted 32 dimensional MFCC with appended delta coefficients from each speech utterance. The total variability subspace of dimension 400 was estimated using Baum–Welch statistics, and the PLDA was trained with a speaker subspace of dimension 120. All of the

Conclusions

This study addresses the problem of domain mismatch in PLDA-based speaker verification. Generally, PLDA requires immense amounts of development data from many speakers. This makes it difficult to learn the PLDA parameters in the target domain with scarce data or other domains with different distributions. In this paper, we designed a transfer learning method from a source domain with sufficient development data to a new target domain. The proposed method is based on the KL divergence, which

References (41)

M. Jeong et al.
Multi-domain spoken language understanding with transfer learning
Speech Commun.
(2009)
A. Kanagasundaram et al.
Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques
Speech Commun.
(2014)
M.I. Mandasari et al.
Quality measures based calibration with duration and noise dependency for speaker recognition
Speech Commun.
(2015)
D.A. Reynolds et al.
Speaker verification using adapted gaussian mixture models
Digital Signal Processing
(2000)
H. Aronowitz
Compensating inter-dataset variability in protectPLDA hyper-parameters for robust speaker recognition
Proc. Speaker Odyssey, Joensuu, Finland
(2014)
H. Aronowitz
Inter dataset variability compensation for speaker recognition
International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Florence, Italy
(2014)
W. Campbell et al.
$S V M$ based speaker verification using a protectGMM supervector kernel and protectNAP variability compensation
International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2006, Toulouse, France
(2006)
W.M. Campbell et al.
Support vector machines using protectGMM supervectors for speaker verification
IEEE Signal Process. Lett.
(2006)
X. Cao et al.
A practical transfer learning algorithm for face verification
Proceedings of the IEEE International Conference on Computer Vision
(2013)
S. Cumani et al.
Pairwise discriminative speaker verification in the i-vector space
Audio Speech Lang. Process. IEEE Trans.
(2013)

Definition, T., Conditions, T., 2010. The NIST year 2010 speaker recognition evaluation plan. Available at...

N. Dehak et al.

Front-end factor analysis for speaker verification

Audio Speech Lang. Process. IEEE Trans.

(2011)

L. Deng et al.

Machine learning paradigms for speech recognition: An overview

Audio Speech Lang. Process. IEEE Trans.

(2013)

D. Garcia-Romero et al.

Analysis of i-vector length normalization in speaker recognition systems

Proc. INTERSPEECH, Florence, Italy, Aug.

(2011)

D. Garcia-Romero et al.

Supervised domain adaptation for i-vector based speaker recognition

International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Florence, Italy

(2014)

D. Garcia-Romero et al.

Unsupervised domain adaptation for i-vector speaker recognition

Proc. Odyssey, The Speaker and Language Recognition Workshop, Joensuu, Finland, Jun

(2014)

O. Glembek et al.

Domain adaptation via within-class covariance correction in i-vector based speaker recognition systems

International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Florence, Italy

(2014)

T. Hasan et al.

Duration mismatch compensation for i-vector based speaker recognition systems

Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, Vancouver, Canada

(2013)

Q.Y. Hong et al.

Modified-prior protectPLDA and score calibration for duration mismatch compensation in speaker recognition system

Proc. INTERSPEECH, Dresden, Germany

(2015)

Q.Y. Hong et al.

A transfer learning method for protectPLDA-based speaker verification

International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Shanghai, China

(2016)

Cited by (12)

Deep transfer learning for automatic speech recognition: Towards better generalization
2023, Knowledge-Based Systems
Automatic speech recognition (ASR) has recently become an important challenge when using deep learning (DL). It requires large-scale training datasets and high computational and storage resources. Moreover, DL techniques and machine learning (ML) approaches in general, hypothesize that training and testing data come from the same domain, with the same input feature space and data distribution characteristics. This assumption, however, is not applicable in some real-world artificial intelligence (AI) applications. Moreover, there are situations where gathering real data is challenging, expensive, or rarely occurring, which cannot meet the data requirements of DL models. deep transfer learning (DTL) has been introduced to overcome these issues, which helps develop high-performing models using real datasets that are small or slightly different but related to the training data. This paper presents a comprehensive survey of DTL-based ASR frameworks to shed light on the latest developments and helps academics and professionals understand current challenges. Specifically, after presenting the DTL background, a well-designed taxonomy is adopted to inform the state-of-the-art. A critical analysis is then conducted to identify the limitations and advantages of each framework. Moving on, a comparative study is introduced to highlight the current challenges before deriving opportunities for future research.
Deep Transfer Learning for Automatic Speech Recognition: Towards Better Generalization
2023, arXiv
ECAPA-Based Speaker Verification of Virtual Assistants: A Transfer Learning Approach
2023, 2023 14th International Conference on Computing Communication and Networking Technologies, ICCCNT 2023
Utterance-level Feature Extraction in Text-independent Speaker Recognition: A Review
2022, Zidonghua Xuebao/Acta Automatica Sinica
Speaker-Phrase-Specific Adaptation of PLDA Model for Improved Performance in Text-Dependent Speaker Verification
2021, Circuits, Systems, and Signal Processing
An Always-on Ultra-Low Power Speaker Verification Accelerator based on Binary Weighted Neural Network with System Co-optimization
2021, Proceedings of International Conference on ASIC

View all citing articles on Scopus

View full text

Transfer learning for PLDA-based speaker verification

Abstract

Introduction

Section snippets

The i-vector

Transfer learning

Experiments

Conclusions

Speech Commun.

Speech Commun.

Speech Commun.

Digital Signal Processing

Compensating inter-dataset variability in protectPLDA hyper-parameters for robust speaker recognition

Proc. Speaker Odyssey, Joensuu, Finland

Inter dataset variability compensation for speaker recognition

International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Florence, Italy

SVM based speaker verification using a protectGMM supervector kernel and protectNAP variability compensation

International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2006, Toulouse, France

Support vector machines using protectGMM supervectors for speaker verification

IEEE Signal Process. Lett.

A practical transfer learning algorithm for face verification

Proceedings of the IEEE International Conference on Computer Vision

Pairwise discriminative speaker verification in the i-vector space

Audio Speech Lang. Process. IEEE Trans.

Front-end factor analysis for speaker verification

Audio Speech Lang. Process. IEEE Trans.

Machine learning paradigms for speech recognition: An overview

Audio Speech Lang. Process. IEEE Trans.

Analysis of i-vector length normalization in speaker recognition systems

Proc. INTERSPEECH, Florence, Italy, Aug.

Supervised domain adaptation for i-vector based speaker recognition

International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Florence, Italy

Unsupervised domain adaptation for i-vector speaker recognition

Proc. Odyssey, The Speaker and Language Recognition Workshop, Joensuu, Finland, Jun

Domain adaptation via within-class covariance correction in i-vector based speaker recognition systems

International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Florence, Italy

Duration mismatch compensation for i-vector based speaker recognition systems

Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, Vancouver, Canada

Modified-prior protectPLDA and score calibration for duration mismatch compensation in speaker recognition system

Proc. INTERSPEECH, Dresden, Germany

A transfer learning method for protectPLDA-based speaker verification

International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Shanghai, China

$S V M$ based speaker verification using a protectGMM supervector kernel and protectNAP variability compensation