A bilevel framework for joint optimization of session compensation and classification for speaker identification
Introduction
Research in speaker recognition has made significant advancements in recent years [1], [2]. It is mainly focused on identifying a person from his/her voice characteristics. As one key techniques for multimedia data analysis, speaker recognition can be widely used in access control, transaction authentication, law enforcement, speech data management, personalization and audio monitoring, etc. Speaker recognition includes speaker identification (SID) and speaker verification (SV). The former is to determine an unknown speaker's identity, while the latter is to verify if a person speaks a given utterance or not. In this paper, we focus on speaker identification.
In SID, although the Gaussian mixture model (GMM) [3] and the Gaussian mixture model-universal background model (GMM-UBM) [4] are still the most landmark approaches [5], they fail to deal with the session variability. To solve this problem, the joint factor analysis (JFA) [6] is formulated by modeling speaker and session variability in GMM supervectors. After that, the i-vector approach [7] has been used as the front-end in the state-of-the-art systems. It can reduce the dimensionality of the supervector by using factor analysis (FA). In [8], a speed-up i-vector extraction method without the need to evaluate the full posterior covariance is proposed. There are also some other novel methods to obtain different representations of an utterance, such as the discrete Karhunen-Loève transform representation for short sequences of speech frames [9], [10], the s-vector [11], and the x-vector [12]. In the back-end processing, session compensation and classification need to be considered. Session compensation methods are used to reduce the impact of session variability caused by sources of inter-session variation, such as different channels, languages, acoustic environments, or speaking styles, etc. After session compensation, the cosine distance scoring (CDS) [7] or the probabilistic linear discriminant analysis (PLDA) [13] can be applied as the back-end classifier. In [14], a deep belief networks (DBN) based universal model is proposed to model each target speaker.
In the framework of i-vector approach, several session compensation techniques have been proposed. Among them, the linear discriminant analysis (LDA) [15] is a typical session compensation method that learns a transformation matrix to project the i-vectors into a subspace with less session variability by maximizing the ratio of between-class separation and within-class variation. In addition, there are also some other session compensation methods have been proposed to solve the session variability. In [16], a source-normalized LDA which uses i-vectors from multiple speech sources is proposed. In [17], the authors present an alternative non-parametric discriminant analysis (NDA) technique, where both the within- and between-speaker variation can be measured on a local basis by using the nearest neighbor rule. In [18], neural networks are employed to capture the nonlinear relationship of i-vectors. In [19] and [20], the authors compute the within-speaker scatter in a pairwise manner and then scale it by an affinity matrix, so that the within-class local structure can be preserved.
The above mentioned methods provide different representations of i-vectors, which can be used for classification. These hand-crafted techniques for session compensation are effective but targeted. Category information needs to be considered when designing the methods for session compensation. Meanwhile, the category information is necessary for classifier training. So it is meaningful to jointly optimize both the stages by the connection through category information. However, in current methods, the session compensation and classifier are optimized independently. In addition, session compensation is applied to reduce the impact of session variability for a better classification result, and the classifier is employed in better distinguishing the session-compensated representation. Therefore, it is reasonable to propose a universal method, which considers the supervision of classifier while applying session compensation, and the trait of session-compensated representation while training the classifier at the same time.
In this paper, we consider a joint optimization problem of session compensation and classifier for SID. This problem can be converted to a back-end “end-to-end” problem. It is interesting to note that the nature of the problem is an optimization problem (classification task), which contains another optimization problem (session compensation) as a constraint. Here, the session compensation and the subsequent classification stages are integrated and optimized jointly under a discriminative criterion of classifier. These two stages can be considered as a combination of upper level classifier and lower level session-compensating. Equivalently, for the upper level classification, the solution of the corresponding lower level session compensation problem can provide the optimal response for the upper level decision. This can be achieved by a bilevel optimization [21], [22].
By applying the bilevel framework to solve the classification problem, a transform of the input features is usually jointly learned with a classifier. In particular, a stochastic gradient descent algorithm is used to solve feature extraction problem by using task-driven dictionary learning [23]. Some works such as [24] and [25], also extend this dictionary learning framework in other research field. In this study, a session-compensated representation of i-vector is required, and also can be adopted to address the identification task. In addition, a lower level method without the constrain of Gaussian hypothesis is required to cover the disadvantage of the LDA. Meanwhile, since the SID is a multiclass task rather than the binary-classification SV (uses the PLDA or CDS to provide the similarity between two utterances), an upper level classifier which can directly provide the classification results is required. Therefore, we introduce a bilevel framework to jointly optimize the session compensation and multiclass classifier by the connection through category information (i.e. labels). In this framework, we use sparse coding (SC) to obtain the session-compensated representation by learning an overcomplete dictionary in the lower level model. Since there is no Gaussian assumption in SC, it can deal with the i-vectors without the constrain of Gaussian distribution. In the upper level model, we employ two classifiers which are the softmax classifier and linear multiclass SVM respectively. Moreover, we present a joint optimization of the dictionary and classifier parameters under the conditions of discriminative criterion for classifier and sparse constraint for dictionary. The main contributions of this paper are summarized as follows:
(1) It gives an idea to move from methods that require traditional hand-crafted session-compensated features, to an “end-to-end” architecture that can obtain the features required for the SID.
(2) It proposes a joint optimization framework of session compensation and classifier, which can apply session compensation under the supervision of classifier to achieve a better performance.
(3) It presents an adaptive moment estimation (Adam) [26] optimization in training the bilevel framework.
(4) Compared with the LDA, our methods can deal with the i-vectors without the constrain of Gaussian distribution.
The remainder of this paper is organized as follows. A brief overview of the typical i-vector framework based SID system. The formulated bilevel method for SID is presented in Section 3. The evaluation of the proposed methods and results discussion are reported in Section 4, and conclusions of this study are given in Section 5.
Section snippets
Typical i-vector framework based speaker identification
In this section, we briefly introduce the i-vector framework based speaker identification system. An i-vector is obtained by mapping a sequence of feature vectors, which represent a speech utterance to a fixed-length vector. Several methods can be used to implement this mapping such as the methods in [27], [28] and our method [29]. Here, we introduce one traditional approach proposed in [7], and it is used as a baseline in this study. This approach first computes the Mel-frequency cepstral
Bilevel framework for speaker identification
In this section, we propose a joint optimization framework of session compensation and classifier based on the bilevel method. In this framework, we use sparse coding to obtain the session-compensated representation by learning a dictionary, and employ the softmax classifier and SVM in classifying respectively. We call them sparse coding - softmax (SC-softmax) and sparse coding - support vector machine (SC-SVM) respectively.
Evaluation and analysis
The performance of the proposed methods is evaluated on three databases: the King-ASR-010 [43], VoxCeleb [44] and RSR2015 [45]. The data of these three databases is recorded under various sessions which include different inter-session variability factors, i.e. non-channel variability, complex session variability and channel variability. To observe the performance of the system, we refer to the metrics in [44] and report the accuracies from top-1 to top-5.
Conclusions
In this paper, we made a systematic investigation to jointly optimize session compensation and classifier in SID. We proposed a bilevel framework to integrate the session-compensated representation obtained and the subsequent classification stages by jointly learning a dictionary and classifier parameters. Besides, we refined the bilevel framework into two methods called SC-softmax and SC-SVM to improve the performance. In addition, compared with typical session compensation techniques, such as
Conflict of interest statement
There is no conflict of interest.
Acknowledgement
This research is supported by the National Natural Science Foundation of China under grant No. U1736210 and 61471145.
Chen Chen received the M.E. degree in computer science and technology from Harbin Institute of Technology. She is a Ph.D. candidate of Harbin Institute of Technology. Her research interests include speaker recognition and speech recognition. (Email: [email protected])
References (48)
Speaker identification and verification using Gaussian mixture speaker models
Speech Commun.
(1995)- et al.
Speaker verification using adapted Gaussian mixture models
Digit. Signal Process.
(2000) A fast and scalable hybrid FA/PPCA-based framework for speaker recognition
Digit. Signal Process.
(2014)- et al.
Text-dependent speaker verification: classifiers, databases and RSR2015
Speech Commun.
(2014) - et al.
Speaker recognition by machines and humans: a tutorial review
IEEE Signal Process. Mag.
(2015) - et al.
Deep neural network based discriminative training for i-vector/PLDA speaker verification
- et al.
Text-independent speaker identification using robust statistics estimation
Speech Commun.
(2017) - et al.
Joint factor analysis versus eigenchannels in speaker recognition
IEEE Trans. Audio Speech Lang. Process.
(2007) - et al.
Front-end factor analysis for speaker verification
IEEE Trans. Audio Speech Lang. Process.
(2011) - et al.
Generalizing i-vector estimation for rapid speaker recognition
IEEE/ACM Trans. Audio Speech Lang. Process.
(2018)
An investigation on the accuracy of truncated DKLT representation for speaker identification with short sequences of speech frames
IEEE Trans. Cybern.
Speaker Identification in Noisy Conditions Using Short Sequences of Speech Frames
S-vector: a discriminative representation derived from i-vector for speaker verification
X-vectors: robust DNN embeddings for speaker recognition
Probabilistic linear discriminant analysis for inferences about identity
Deep learning backend for single and multisession i-vector speaker recognition
IEEE/ACM Trans. Audio Speech Lang. Process.
Discriminant NAP for SVM speaker recognition
Source-normalized LDA for robust speaker recognition using i-vectors from multiple speech sources
IEEE Trans. Audio Speech Lang. Process.
Nearest neighbor discriminant analysis for robust speaker recognition
Neural networks based channel compensation for i-vector speaker verification
Locally weighted linear discriminant analysis for robust speaker verification
Modelling and compensation for language mismatch in speaker verification
Speech Commun.
An overview of bilevel optimization
Ann. Oper. Res.
Differentiable sparse coding
Cited by (0)
Chen Chen received the M.E. degree in computer science and technology from Harbin Institute of Technology. She is a Ph.D. candidate of Harbin Institute of Technology. Her research interests include speaker recognition and speech recognition. (Email: [email protected])
Wei Wang received the Ph.D. degrees in the School of Computer Science from the Harbin Institute of Technology, Harbin, China in 2016. She is a lecturers in the school of Computer Science and Technology, Harbin Institute of Technology (Weihai). Her research interests include speaker recognition and language recognition. (Email: [email protected])
Yongjun He received the Ph.D. degrees in the School of Computer Science from the Harbin Institute of Technology, Harbin, China in 2008. Currently, he is an associate professor in the school of Computer Science and Technology, Harbin University of Science and Technology. His research interests include speech speaker recognition, speech recognition and machine learning. (Email: [email protected])
Jiqing Han (corresponding author) received the Ph.D. degree in computer science and technology from Harbin Institute of Technology. He is a professor and doctoral supervisor of School of Computer Science and Technology. He is a committee member of Automatic Discipline, National Natural Science Foundation of China, a committee member of National Science and Technology Awards of China, the vice Chairman of Society of Speech Processing, Association for Chinese Information Processing, etc. His main research fields are speech signal processing and audio information retrieval. He has won three Second-Prize and two Third-Prize awards of Science and Technology of Ministry/Province. (Email: [email protected])