Elsevier

Pattern Recognition Letters

Volume 98, 15 October 2017, Pages 26-31
Pattern Recognition Letters

Exploring kernel discriminant analysis for speaker verification with limited test data

https://doi.org/10.1016/j.patrec.2017.08.004Get rights and content

Highlights

  • A novel framework for channel/session compensation in i-vector speaker modeling.

  • Explore non-linearity in channel/session information at i-vector framework.

  • Effectiveness of kernel discriminant analysis (KDA) with higher dimension.

  • Significance of KDA for speaker verification with limited test data.

Abstract

Speaker verification (SV) with limited test data condition is desirable for practical application oriented systems. The i-vector based speaker modeling has shown its significance for SV tasks, but its performance degrades as the utterance becomes shorter. The i-vectors apart from being compact and dominant speaker representations, bear channel and session information, which has to be compensated for robust speaker modeling. The conventional techniques for channel/session compensation include linear discriminant analysis (LDA) followed by within class covariance normalization (WCCN) and Gaussian probabilistic linear discriminant analysis (GPLDA) that eliminate the channel/session variation across the i-vectors by assuming these are linearly separable. In this work, a novel method for channel/session compensation is proposed using kernel discriminant analysis (KDA) that projects the i-vectors into a higher dimensional space and performs discriminant analysis to remove the unwanted information for speaker modeling. The SV studies are performed on standard NIST speaker recognition evaluation (SRE) 2003 and 2008 databases that convey the significance of the proposed compensation over the conventional methods, which is greater on using short test utterances. The achieved improvements are hypothesized due to the non-linearities of channel/session information in the i-vector domain.

Introduction

Speaker verification (SV) deals with confirmation of an identity claim of a person using his/her spoken utterance [1]. Based on the content of speech, it can be broadly categorized as text-dependent and text-independent SV. The former requires the users to reproduce the same phrase of around 2–3 s during training and testing, which is global across the speakers. The latter does not put any constraint on the users regarding the speech content and the typical speech required for training and testing are of the order of 2–3  min and 20–30 s, respectively. The text-independent SV has a lot of advantages over the text-dependent SV, where no constraint on speech to be spoken by the user being the most important. Additionally, it captures the speaker characteristics in a more generic way with a large vocabulary involved from the speakers. In this regard, text-independent SV has more scope towards deployable systems. However, the recent trend for deploying an SV system into practice requires short utterance based framework. For many such scenarios like security systems or ID verification systems, it is desired to reduce utterance duration for user convenience and computation purposes. However, one time sufficient speech can be taken from the users for training the speaker models. This work focuses on a text-independent SV framework having sufficient train with short test utterance condition preferably of less than 10 s speech from the outlook of application oriented systems.

The field of SV has witnessed a breakthrough with the development of i-vector based speaker modeling [2]. Later, in [3], [4], the efficacy of i-vector based SV systems under short utterance conditions are also explored. In [2], various channel/session compensation techniques have been explored at the back-end, out of which linear discriminant analysis (LDA) followed by within class covariance normalization (WCCN) has provided better results. Further in [3], LDA followed by WCCN has given comparable results under the condition of sufficient train with short test utterances ( ≤  10 s) to that obtained with Gaussian probabilistic linear discriminant analysis (GPLDA). In these works, the role of LDA is to reduce the dimension of i-vectors alongwith minimizing the intra-speaker variability and maximizing the separation between the speakers and that of WCCN is to reduce the session variability among the i-vectors.

It is known that LDA is a powerful feature extraction technique when classification is the task [5]. However, LDA as a dimensionality reduction technique can only transform the feature vectors onto a single hyperplane. Hence, LDA may not be a better option for many pattern recognition tasks as it results in loss of important information when the data are not linearly separable [6], [7]. To address this problem, many researchers have worked on kernel based discriminant analysis techniques [8], [9], [10], [11]. The crux of kernel based discriminant analysis method is to map the data onto higher dimensional spaces where the classes are more separable, perform LDA in this space and dimensionally reduce to the desired dimensional space, thus separating the classes well. KDA has already been successfully used in various pattern recognition areas, a few of which are face recognition [12], [13], [14], facial-expression recognition [15], hand-written digit recognition [9], [16], human activity recognition [17] and speech recognition [18], [19]. In [18], [19] KDA is utilized to remove the speaker dependent part of the features in order to make them robust to speech variations for speech recognition task. Kim et al. [20] use a multi-modal KDA in a SV system at the front-end feature extraction level for improving the robustness of features extracted.

The i-vectors are found to vary with speaker, session and phonetic content of the utterance and the variabilities are even more with short durational utterances [21]. It is hypothesized that the compensation techniques that minimize these variabilities handling data points linearly, may not be suitable as these variabilities may be non-linear in nature. Also, its impact is expected to be more for short duration utterances due to larger variation with the effect of different phonetic content in a text-independent SV framework. In this work, the efficacy of an i-vector based SV system is explored when KDA is used at back-end for improving the intra-speaker variability instead of the existing techniques. Since, KDA uses non-linear techniques to reduce dimensionality it is assumed that it preserves important information that is lost while using other linear techniques. In this context, KDA is expected to perform well and an SV system is developed over the i-vector based framework which uses KDA to non-linearly map the i-vectors to perform discriminant analysis. Further higher dimension of KDA is explored expecting an improvement as it maps the data points into a higher dimensional space to perform discriminant analysis. The proposed setup for SV using KDA is compared with existing and recent approaches for handling short utterances for highlighting the scope of this work under SV with limited test data conditions.

The rest of the work is organized as: Section 2 details the mathematical formulation of KDA. Section 3 provides the description the proposed framework using KDA to have channel/session compensation in i-vector based speaker modeling. The steps involved in implementing the system and the compensation strategies are mentioned in Section 4. In Section 5 the results and observations of the experiments performed are discussed with focus towards the short test utterance based scenario. Finally, the work is concluded in Section 6 with future scope and directions.

Section snippets

Kernel discriminant analysis

KDA performs a non-linear mapping from the actual feature space to a high dimensional space and then implements LDA on the mapped features. In this way, both non-linearity in the data and the class separation problems are addressed simultaneously. The mathematical formulation of KDA can be discussed as follows:

Suppose the training data points are given as X={x1,x2,,xn} belonging to ‘C’ classes and there is a non-linear mapping function φ, which transforms the data points to a space Γ. The

Kernel discriminant analysis for speaker verification

The i-vectors, apart from possessing dominant speaker information, also contain information about session, channel and the phonetic content used, which are required to be eliminated for robust speaker modeling. In this perspective, LDA followed by WCCN or GPLDA has proven its efficacy as mentioned in [2], [3]. But, it is hypothesized that these techniques may not apropos in capturing the speaker dependent information as it tends to eliminate salient information when the classes are not linearly

System description

An i-vector based SV system as outlined in [2] has been developed over standard NIST SRE 2003 dataset [23]. It contains data from 356 speakers data comprising of 212 female and 144 male speakers. The NIST SRE 2003 database is considered for initial explanation, as the study is based on the use of short test utterances with sufficient train data for training under clean condition in a co-operative scenario for practical deployable systems. The utterances are processed using 20 ms Hamming window

Experimental results and analysis

Table 1 reports the performance in terms of EER and DCF obtained for full and truncated short test utterance cases (2–10 s) with the classical baseline method LDA followed by WCCN and then its comparison to the GPLDA based setup on the i-vector based SV system developed over NIST SRE 2003 database as explained in Section 4. After fine-tuning, LDA-WCCN and LDA-GPLDA of 150 dimension are considered for the study. It is observed that LDA followed by GPLDA is better capable of handling

Conclusion

This work presents a novel framework based on a kernel based discriminant analysis technique and its efficacy over the conventional channel/session compensation methods for an i-vector based speaker modeling. The potency of this technique is demonstrated for SV alongwith consideration of short test utterance scenario from the outlook of practical systems on standard NIST SRE 2003 and NIST SRE 2008 database. Based on the studies made with existing LDA followed by WCCN and GPLDA based techniques

Acknowledgment

This work is in part supported by a project grant 12(6)/2012-ESD for the project entitled “Development of Speech-Based Multi-level Person Authentication System” funded by the Department of Electronics and Information Technology (DeitY), Govt. of India.

References (26)

  • T. Kinnunen et al.

    An overview of text-independent speaker recognition: from features to supervectors

    Speech Commun.

    (2010)
  • A. Kanagasundaram et al.

    Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques

    Speech Commun.

    (2014)
  • N. Dehak et al.

    Front-end factor analysis for speaker verification

    IEEE Trans. Audio Speech Lang. Process.

    (2011)
  • A. Kanagasundaram et al.

    I-vector based speaker recognition on short utterances

    Proceedings of 12th Annual Conference of the International Speech Communication Association (Interspeech)

    (2011)
  • R.K. Das, S.R.M. Prasanna, Speaker Verification for Variable Duration Segments and the Effect of Session Variability,...
  • R.O. Duda et al.

    Pattern Classification

    (2012)
  • M. Sugiyama

    Local fisher discriminant analysis for supervised dimensionality reduction

    Proceedings of the 23rd International Conference on Machine Learning

    (2006)
  • J. Ye et al.

    Discriminant analysis for dimensionality reduction: an overview of recent developments

    Biometrics: Theory, Methods, and Applications

    (2010)
  • G. Baudat et al.

    Generalized discriminant analysis using a kernel approach

    Neural Comput.

    (2000)
  • B. Schölkopf et al.

    Fisher discriminant analysis with kernels

    Proceedings of IEEE Workshop on Neural Networks for Signal Processing IX

    (1999)
  • G. McLachlan

    Discriminant Analysis and Statistical Pattern Recognition

    (2004)
  • V.N. Vapnik

    Statistical Learning Theory

    (1998)
  • Q. Liu et al.

    Face recognition using kernel-based fisher discriminant analysis

    Proceedings of the fifth IEEE International Conference on Automatic Face and Gesture Recognition

    (2002)
  • Cited by (8)

    • HiLAM-aligned kernel discriminant analysis for text-dependent speaker verification

      2021, Expert Systems with Applications
      Citation Excerpt :

      Moreover, in TDSV task, the pass-phrases or passwords are usually short in duration. And the non-linearity is even more prominent in case of short durational speech signal (Das, Manam, & Prasanna, 2017). Thus, on projecting the online i-vectors, which are derived from short speech extracts, on the PLDA subspace, there occurs substantial loss of important information in the process of marginalization of unwanted variability.

    • Optimization of machine learning algorithms for predicting infected COVID-19 in isolated DNA

      2020, International Journal of Intelligent Engineering and Systems
    View all citing articles on Scopus
    View full text