Exploring kernel discriminant analysis for speaker verification with limited test data
Introduction
Speaker verification (SV) deals with confirmation of an identity claim of a person using his/her spoken utterance [1]. Based on the content of speech, it can be broadly categorized as text-dependent and text-independent SV. The former requires the users to reproduce the same phrase of around 2–3 s during training and testing, which is global across the speakers. The latter does not put any constraint on the users regarding the speech content and the typical speech required for training and testing are of the order of 2–3 min and 20–30 s, respectively. The text-independent SV has a lot of advantages over the text-dependent SV, where no constraint on speech to be spoken by the user being the most important. Additionally, it captures the speaker characteristics in a more generic way with a large vocabulary involved from the speakers. In this regard, text-independent SV has more scope towards deployable systems. However, the recent trend for deploying an SV system into practice requires short utterance based framework. For many such scenarios like security systems or ID verification systems, it is desired to reduce utterance duration for user convenience and computation purposes. However, one time sufficient speech can be taken from the users for training the speaker models. This work focuses on a text-independent SV framework having sufficient train with short test utterance condition preferably of less than 10 s speech from the outlook of application oriented systems.
The field of SV has witnessed a breakthrough with the development of i-vector based speaker modeling [2]. Later, in [3], [4], the efficacy of i-vector based SV systems under short utterance conditions are also explored. In [2], various channel/session compensation techniques have been explored at the back-end, out of which linear discriminant analysis (LDA) followed by within class covariance normalization (WCCN) has provided better results. Further in [3], LDA followed by WCCN has given comparable results under the condition of sufficient train with short test utterances ( ≤ 10 s) to that obtained with Gaussian probabilistic linear discriminant analysis (GPLDA). In these works, the role of LDA is to reduce the dimension of i-vectors alongwith minimizing the intra-speaker variability and maximizing the separation between the speakers and that of WCCN is to reduce the session variability among the i-vectors.
It is known that LDA is a powerful feature extraction technique when classification is the task [5]. However, LDA as a dimensionality reduction technique can only transform the feature vectors onto a single hyperplane. Hence, LDA may not be a better option for many pattern recognition tasks as it results in loss of important information when the data are not linearly separable [6], [7]. To address this problem, many researchers have worked on kernel based discriminant analysis techniques [8], [9], [10], [11]. The crux of kernel based discriminant analysis method is to map the data onto higher dimensional spaces where the classes are more separable, perform LDA in this space and dimensionally reduce to the desired dimensional space, thus separating the classes well. KDA has already been successfully used in various pattern recognition areas, a few of which are face recognition [12], [13], [14], facial-expression recognition [15], hand-written digit recognition [9], [16], human activity recognition [17] and speech recognition [18], [19]. In [18], [19] KDA is utilized to remove the speaker dependent part of the features in order to make them robust to speech variations for speech recognition task. Kim et al. [20] use a multi-modal KDA in a SV system at the front-end feature extraction level for improving the robustness of features extracted.
The i-vectors are found to vary with speaker, session and phonetic content of the utterance and the variabilities are even more with short durational utterances [21]. It is hypothesized that the compensation techniques that minimize these variabilities handling data points linearly, may not be suitable as these variabilities may be non-linear in nature. Also, its impact is expected to be more for short duration utterances due to larger variation with the effect of different phonetic content in a text-independent SV framework. In this work, the efficacy of an i-vector based SV system is explored when KDA is used at back-end for improving the intra-speaker variability instead of the existing techniques. Since, KDA uses non-linear techniques to reduce dimensionality it is assumed that it preserves important information that is lost while using other linear techniques. In this context, KDA is expected to perform well and an SV system is developed over the i-vector based framework which uses KDA to non-linearly map the i-vectors to perform discriminant analysis. Further higher dimension of KDA is explored expecting an improvement as it maps the data points into a higher dimensional space to perform discriminant analysis. The proposed setup for SV using KDA is compared with existing and recent approaches for handling short utterances for highlighting the scope of this work under SV with limited test data conditions.
The rest of the work is organized as: Section 2 details the mathematical formulation of KDA. Section 3 provides the description the proposed framework using KDA to have channel/session compensation in i-vector based speaker modeling. The steps involved in implementing the system and the compensation strategies are mentioned in Section 4. In Section 5 the results and observations of the experiments performed are discussed with focus towards the short test utterance based scenario. Finally, the work is concluded in Section 6 with future scope and directions.
Section snippets
Kernel discriminant analysis
KDA performs a non-linear mapping from the actual feature space to a high dimensional space and then implements LDA on the mapped features. In this way, both non-linearity in the data and the class separation problems are addressed simultaneously. The mathematical formulation of KDA can be discussed as follows:
Suppose the training data points are given as belonging to ‘C’ classes and there is a non-linear mapping function φ, which transforms the data points to a space Γ. The
Kernel discriminant analysis for speaker verification
The i-vectors, apart from possessing dominant speaker information, also contain information about session, channel and the phonetic content used, which are required to be eliminated for robust speaker modeling. In this perspective, LDA followed by WCCN or GPLDA has proven its efficacy as mentioned in [2], [3]. But, it is hypothesized that these techniques may not apropos in capturing the speaker dependent information as it tends to eliminate salient information when the classes are not linearly
System description
An i-vector based SV system as outlined in [2] has been developed over standard NIST SRE 2003 dataset [23]. It contains data from 356 speakers data comprising of 212 female and 144 male speakers. The NIST SRE 2003 database is considered for initial explanation, as the study is based on the use of short test utterances with sufficient train data for training under clean condition in a co-operative scenario for practical deployable systems. The utterances are processed using 20 ms Hamming window
Experimental results and analysis
Table 1 reports the performance in terms of EER and DCF obtained for full and truncated short test utterance cases (2–10 s) with the classical baseline method LDA followed by WCCN and then its comparison to the GPLDA based setup on the i-vector based SV system developed over NIST SRE 2003 database as explained in Section 4. After fine-tuning, LDA-WCCN and LDA-GPLDA of 150 dimension are considered for the study. It is observed that LDA followed by GPLDA is better capable of handling
Conclusion
This work presents a novel framework based on a kernel based discriminant analysis technique and its efficacy over the conventional channel/session compensation methods for an i-vector based speaker modeling. The potency of this technique is demonstrated for SV alongwith consideration of short test utterance scenario from the outlook of practical systems on standard NIST SRE 2003 and NIST SRE 2008 database. Based on the studies made with existing LDA followed by WCCN and GPLDA based techniques
Acknowledgment
This work is in part supported by a project grant 12(6)/2012-ESD for the project entitled “Development of Speech-Based Multi-level Person Authentication System” funded by the Department of Electronics and Information Technology (DeitY), Govt. of India.
References (26)
- et al.
An overview of text-independent speaker recognition: from features to supervectors
Speech Commun.
(2010) - et al.
Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques
Speech Commun.
(2014) - et al.
Front-end factor analysis for speaker verification
IEEE Trans. Audio Speech Lang. Process.
(2011) - et al.
I-vector based speaker recognition on short utterances
Proceedings of 12th Annual Conference of the International Speech Communication Association (Interspeech)
(2011) - R.K. Das, S.R.M. Prasanna, Speaker Verification for Variable Duration Segments and the Effect of Session Variability,...
- et al.
Pattern Classification
(2012) Local fisher discriminant analysis for supervised dimensionality reduction
Proceedings of the 23rd International Conference on Machine Learning
(2006)- et al.
Discriminant analysis for dimensionality reduction: an overview of recent developments
Biometrics: Theory, Methods, and Applications
(2010) - et al.
Generalized discriminant analysis using a kernel approach
Neural Comput.
(2000) - et al.
Fisher discriminant analysis with kernels
Proceedings of IEEE Workshop on Neural Networks for Signal Processing IX
(1999)
Discriminant Analysis and Statistical Pattern Recognition
Statistical Learning Theory
Face recognition using kernel-based fisher discriminant analysis
Proceedings of the fifth IEEE International Conference on Automatic Face and Gesture Recognition
Cited by (8)
HiLAM-aligned kernel discriminant analysis for text-dependent speaker verification
2021, Expert Systems with ApplicationsCitation Excerpt :Moreover, in TDSV task, the pass-phrases or passwords are usually short in duration. And the non-linearity is even more prominent in case of short durational speech signal (Das, Manam, & Prasanna, 2017). Thus, on projecting the online i-vectors, which are derived from short speech extracts, on the PLDA subspace, there occurs substantial loss of important information in the process of marginalization of unwanted variability.
Class mean vector component and discriminant analysis
2020, Pattern Recognition LettersSpeaker verification normalization sequence kernel based on Gaussian mixture model super-vector and Bhattacharyya distance
2021, Journal of Low Frequency Noise Vibration and Active ControlVoice liveness detection under feature fusion and cross-environment scenario
2020, Multimedia Tools and ApplicationsOptimization of machine learning algorithms for predicting infected COVID-19 in isolated DNA
2020, International Journal of Intelligent Engineering and Systems