Elsevier

Digital Signal Processing

Volume 89, June 2019, Pages 104-115
Digital Signal Processing

A bilevel framework for joint optimization of session compensation and classification for speaker identification

https://doi.org/10.1016/j.dsp.2019.03.008Get rights and content

Abstract

The i-vector framework based system is one of the most popular systems in speaker identification (SID). In this system, session compensation is usually employed first and then the classifier. For any session-compensated representation of i-vector, there is a corresponding identification result, so that both the stages are related. However, in current SID systems, session compensation and classifier are usually optimized independently. An incomplete knowledge about the session compensation to the identification task may lead to involving uncertainties. In this paper, we propose a bilevel framework to jointly optimize session compensation and classifier to enhance the relationship between the two stages. In this framework, we use the sparse coding (SC) to obtain the session-compensated feature by learning an overcomplete dictionary, and employ the softmax classifier and support vector machine (SVM) in classifying respectively. Moreover, we present a joint optimization of the dictionary and classifier parameters under a discriminative criterion for classifier with conditions for SC. In addition, the proposed methods are evaluated on the King-ASR-010, VoxCeleb and RSR2015 databases. Compared with typical session compensation techniques, such as linear discriminant analysis (LDA) and nonparametric discriminant analysis (NDA), our methods can be more robust to complex session variability. Moreover, compared with the typical classifiers in i-vector framework, i.e. the cosine distance scoring (CDS) and probabilistic linear discriminant analysis (PLDA), our methods can be more suitable for SID (multiclass task).

Introduction

Research in speaker recognition has made significant advancements in recent years [1], [2]. It is mainly focused on identifying a person from his/her voice characteristics. As one key techniques for multimedia data analysis, speaker recognition can be widely used in access control, transaction authentication, law enforcement, speech data management, personalization and audio monitoring, etc. Speaker recognition includes speaker identification (SID) and speaker verification (SV). The former is to determine an unknown speaker's identity, while the latter is to verify if a person speaks a given utterance or not. In this paper, we focus on speaker identification.

In SID, although the Gaussian mixture model (GMM) [3] and the Gaussian mixture model-universal background model (GMM-UBM) [4] are still the most landmark approaches [5], they fail to deal with the session variability. To solve this problem, the joint factor analysis (JFA) [6] is formulated by modeling speaker and session variability in GMM supervectors. After that, the i-vector approach [7] has been used as the front-end in the state-of-the-art systems. It can reduce the dimensionality of the supervector by using factor analysis (FA). In [8], a speed-up i-vector extraction method without the need to evaluate the full posterior covariance is proposed. There are also some other novel methods to obtain different representations of an utterance, such as the discrete Karhunen-Loève transform representation for short sequences of speech frames [9], [10], the s-vector [11], and the x-vector [12]. In the back-end processing, session compensation and classification need to be considered. Session compensation methods are used to reduce the impact of session variability caused by sources of inter-session variation, such as different channels, languages, acoustic environments, or speaking styles, etc. After session compensation, the cosine distance scoring (CDS) [7] or the probabilistic linear discriminant analysis (PLDA) [13] can be applied as the back-end classifier. In [14], a deep belief networks (DBN) based universal model is proposed to model each target speaker.

In the framework of i-vector approach, several session compensation techniques have been proposed. Among them, the linear discriminant analysis (LDA) [15] is a typical session compensation method that learns a transformation matrix to project the i-vectors into a subspace with less session variability by maximizing the ratio of between-class separation and within-class variation. In addition, there are also some other session compensation methods have been proposed to solve the session variability. In [16], a source-normalized LDA which uses i-vectors from multiple speech sources is proposed. In [17], the authors present an alternative non-parametric discriminant analysis (NDA) technique, where both the within- and between-speaker variation can be measured on a local basis by using the nearest neighbor rule. In [18], neural networks are employed to capture the nonlinear relationship of i-vectors. In [19] and [20], the authors compute the within-speaker scatter in a pairwise manner and then scale it by an affinity matrix, so that the within-class local structure can be preserved.

The above mentioned methods provide different representations of i-vectors, which can be used for classification. These hand-crafted techniques for session compensation are effective but targeted. Category information needs to be considered when designing the methods for session compensation. Meanwhile, the category information is necessary for classifier training. So it is meaningful to jointly optimize both the stages by the connection through category information. However, in current methods, the session compensation and classifier are optimized independently. In addition, session compensation is applied to reduce the impact of session variability for a better classification result, and the classifier is employed in better distinguishing the session-compensated representation. Therefore, it is reasonable to propose a universal method, which considers the supervision of classifier while applying session compensation, and the trait of session-compensated representation while training the classifier at the same time.

In this paper, we consider a joint optimization problem of session compensation and classifier for SID. This problem can be converted to a back-end “end-to-end” problem. It is interesting to note that the nature of the problem is an optimization problem (classification task), which contains another optimization problem (session compensation) as a constraint. Here, the session compensation and the subsequent classification stages are integrated and optimized jointly under a discriminative criterion of classifier. These two stages can be considered as a combination of upper level classifier and lower level session-compensating. Equivalently, for the upper level classification, the solution of the corresponding lower level session compensation problem can provide the optimal response for the upper level decision. This can be achieved by a bilevel optimization [21], [22].

By applying the bilevel framework to solve the classification problem, a transform of the input features is usually jointly learned with a classifier. In particular, a stochastic gradient descent algorithm is used to solve feature extraction problem by using task-driven dictionary learning [23]. Some works such as [24] and [25], also extend this dictionary learning framework in other research field. In this study, a session-compensated representation of i-vector is required, and also can be adopted to address the identification task. In addition, a lower level method without the constrain of Gaussian hypothesis is required to cover the disadvantage of the LDA. Meanwhile, since the SID is a multiclass task rather than the binary-classification SV (uses the PLDA or CDS to provide the similarity between two utterances), an upper level classifier which can directly provide the classification results is required. Therefore, we introduce a bilevel framework to jointly optimize the session compensation and multiclass classifier by the connection through category information (i.e. labels). In this framework, we use sparse coding (SC) to obtain the session-compensated representation by learning an overcomplete dictionary in the lower level model. Since there is no Gaussian assumption in SC, it can deal with the i-vectors without the constrain of Gaussian distribution. In the upper level model, we employ two classifiers which are the softmax classifier and linear multiclass SVM respectively. Moreover, we present a joint optimization of the dictionary and classifier parameters under the conditions of discriminative criterion for classifier and sparse constraint for dictionary. The main contributions of this paper are summarized as follows:

(1) It gives an idea to move from methods that require traditional hand-crafted session-compensated features, to an “end-to-end” architecture that can obtain the features required for the SID.

(2) It proposes a joint optimization framework of session compensation and classifier, which can apply session compensation under the supervision of classifier to achieve a better performance.

(3) It presents an adaptive moment estimation (Adam) [26] optimization in training the bilevel framework.

(4) Compared with the LDA, our methods can deal with the i-vectors without the constrain of Gaussian distribution.

The remainder of this paper is organized as follows. A brief overview of the typical i-vector framework based SID system. The formulated bilevel method for SID is presented in Section 3. The evaluation of the proposed methods and results discussion are reported in Section 4, and conclusions of this study are given in Section 5.

Section snippets

Typical i-vector framework based speaker identification

In this section, we briefly introduce the i-vector framework based speaker identification system. An i-vector is obtained by mapping a sequence of feature vectors, which represent a speech utterance to a fixed-length vector. Several methods can be used to implement this mapping such as the methods in [27], [28] and our method [29]. Here, we introduce one traditional approach proposed in [7], and it is used as a baseline in this study. This approach first computes the Mel-frequency cepstral

Bilevel framework for speaker identification

In this section, we propose a joint optimization framework of session compensation and classifier based on the bilevel method. In this framework, we use sparse coding to obtain the session-compensated representation by learning a dictionary, and employ the softmax classifier and SVM in classifying respectively. We call them sparse coding - softmax (SC-softmax) and sparse coding - support vector machine (SC-SVM) respectively.

Evaluation and analysis

The performance of the proposed methods is evaluated on three databases: the King-ASR-010 [43], VoxCeleb [44] and RSR2015 [45]. The data of these three databases is recorded under various sessions which include different inter-session variability factors, i.e. non-channel variability, complex session variability and channel variability. To observe the performance of the system, we refer to the metrics in [44] and report the accuracies from top-1 to top-5.

Conclusions

In this paper, we made a systematic investigation to jointly optimize session compensation and classifier in SID. We proposed a bilevel framework to integrate the session-compensated representation obtained and the subsequent classification stages by jointly learning a dictionary and classifier parameters. Besides, we refined the bilevel framework into two methods called SC-softmax and SC-SVM to improve the performance. In addition, compared with typical session compensation techniques, such as

Conflict of interest statement

There is no conflict of interest.

Acknowledgement

This research is supported by the National Natural Science Foundation of China under grant No. U1736210 and 61471145.

Chen Chen received the M.E. degree in computer science and technology from Harbin Institute of Technology. She is a Ph.D. candidate of Harbin Institute of Technology. Her research interests include speaker recognition and speech recognition. (Email: [email protected])

References (48)

  • G. Biagetti et al.

    An investigation on the accuracy of truncated DKLT representation for speaker identification with short sequences of speech frames

    IEEE Trans. Cybern.

    (2017)
  • G. Biagetti et al.

    Speaker Identification in Noisy Conditions Using Short Sequences of Speech Frames

    (2018)
  • Y.Z. Işik et al.

    S-vector: a discriminative representation derived from i-vector for speaker verification

  • D. Snyder et al.

    X-vectors: robust DNN embeddings for speaker recognition

  • S. Prince et al.

    Probabilistic linear discriminant analysis for inferences about identity

  • O. Ghahabi et al.

    Deep learning backend for single and multisession i-vector speaker recognition

    IEEE/ACM Trans. Audio Speech Lang. Process.

    (2017)
  • R. Vogt et al.

    Discriminant NAP for SVM speaker recognition

  • M. Mclaren et al.

    Source-normalized LDA for robust speaker recognition using i-vectors from multiple speech sources

    IEEE Trans. Audio Speech Lang. Process.

    (2012)
  • S. Sadjadi et al.

    Nearest neighbor discriminant analysis for robust speaker recognition

  • W. Rao et al.

    Neural networks based channel compensation for i-vector speaker verification

  • A. Misra et al.

    Locally weighted linear discriminant analysis for robust speaker verification

  • A. Misra et al.

    Modelling and compensation for language mismatch in speaker verification

    Speech Commun.

    (2017)
  • B. Colson et al.

    An overview of bilevel optimization

    Ann. Oper. Res.

    (2007)
  • D. Bradley et al.

    Differentiable sparse coding

  • Cited by (0)

    Chen Chen received the M.E. degree in computer science and technology from Harbin Institute of Technology. She is a Ph.D. candidate of Harbin Institute of Technology. Her research interests include speaker recognition and speech recognition. (Email: [email protected])

    Wei Wang received the Ph.D. degrees in the School of Computer Science from the Harbin Institute of Technology, Harbin, China in 2016. She is a lecturers in the school of Computer Science and Technology, Harbin Institute of Technology (Weihai). Her research interests include speaker recognition and language recognition. (Email: [email protected])

    Yongjun He received the Ph.D. degrees in the School of Computer Science from the Harbin Institute of Technology, Harbin, China in 2008. Currently, he is an associate professor in the school of Computer Science and Technology, Harbin University of Science and Technology. His research interests include speech speaker recognition, speech recognition and machine learning. (Email: [email protected])

    Jiqing Han (corresponding author) received the Ph.D. degree in computer science and technology from Harbin Institute of Technology. He is a professor and doctoral supervisor of School of Computer Science and Technology. He is a committee member of Automatic Discipline, National Natural Science Foundation of China, a committee member of National Science and Technology Awards of China, the vice Chairman of Society of Speech Processing, Association for Chinese Information Processing, etc. His main research fields are speech signal processing and audio information retrieval. He has won three Second-Prize and two Third-Prize awards of Science and Technology of Ministry/Province. (Email: [email protected])

    View full text