Elsevier

Speech Communication

Volume 24, Issue 3, June 1998, Pages 193-209
Speech Communication

Text-independent speaker recognition using non-linear frame likelihood transformation

https://doi.org/10.1016/S0167-6393(98)00010-7Get rights and content

Abstract

When the reference speakers are represented by Gaussian mixture model (GMM), the conventional approach is to accumulate the frame likelihoods over the whole test utterance and compare the results as in speaker identification or apply a threshold as in speaker verification. In this paper we describe a method, where frame likelihoods are transformed into new scores according to some non-linear function prior to their accumulation. We have studied two families of such functions. First one, actually, performs likelihood normalization – a technique widely used in speaker verification, but applied here at frame level. The second kind of functions transforms the likelihoods into weights according to some criterion. We call this transformation weighting models rank (WMR). Both kinds of transformations require frame likelihoods from all (or subset of all) reference models to be available. For this, every frame of the test utterance is input to the required reference models in parallel and then the likelihood transformation is applied. The new scores are further accumulated over the whole test utterance in order to obtain an utterance level score for a given speaker model. We have found out that the normalization of these utterance scores also has the effect for speaker verification. The experiments using two databases – TIMIT corpus and NTT database for speaker recognition – showed better speaker identification rates and significant reduction of speaker verification equal error rates (EER) when the frame likelihood transformation was used.

Résumé

Quand les locuteurs de référence sont représentés par un modèle de mélange de gaussiennes, l'approche conventionnelle est d'accumuler les probabilités de trame sur l'énoncé de test entier et de comparer les résultats pour l'identification du locuteur ou d'appliquer un seuil pour la vérification du locuteur. Dans cet article, nous décrivons une méthode dans laquelle les probabilités de trame sont transformées, avant d'être sommées, en de nouveaux scores, suivant une certaine fonction non-linéaire. Nous avons étudié deux familles de fonctions. La première effectue de fait une normalisation des probabilités – une technique largement utilisée en vérification du locuteur –, mais qui est appliquée ici au niveau des états. Le deuxième type de fonctions transforme les probabilités en poids, suivant un certain critère. Nous appelons cette transformation “Weighting Models Rank” (WMR). Les deux types de transformations requièrent de pouvoir disposer de tous (ou d'un sous-ensemble de tous) les modèles de référence. Pour obtenir ceci, chaque trame de l'énoncé d'entrée est incorporée en parallèle dans les modèles de référence requis, puis la transformation des probabilités est appliquée. Les nouveaux scores sont ensuite accumulés sur l'ensemble de l'énoncé pour obtenir un score de l'énoncé pour un modèle de locuteur donné. Nous avons trouvé que la normalisation de ces scores d'énoncés est également efficace pour la vérification du locuteur. Des expériences ont été menées sur deux bases de données – TIMIT et la base de données de NTT pour la reconnaissance du locuteur. Les résultats montrent des taux d'identification du locuteur plus élevés et une réduction notable du taux d'égale erreur (EER) en vérification du locuteur quand les transformations des probabilités de trames sont utilisées.

Introduction

Speaker recognition has been a research topic for many years and various types of speaker models have been studied. Hidden Markov models (HMM) have become the most popular statistical tool for this task. The best results have been obtained using continuous HMM (CHMM) for modeling the speaker characteristics (Savic and Gupta, 1990, Furui, 1991, Rosenberg et al., 1991, Rosenberg et al., 1994, Matsui and Furui, 1992). For the text-independent task, where the temporal sequence modeling capability of the HMM is not required, one state CHMM, also called a Gaussian mixture model (GMM), has been widely used as a speaker model (Tseng et al., 1992, Reynolds and Rose, 1995, Gish and Schmidt, 1994, Bimbot et al., 1995, Matsui and Furui, 1995). In accordance with (Matsui and Furui, 1992) our previous study (Markov and Nakagawa, 1995) showed that GMM can perform even better than CHMM with multi-states.

The objective of the speaker identification is to find a speaker model λi given the set of reference models Λ={λ1,…,λN} and sequence of test vectors (or frames) X={x1,…,xT} which gives the maximum a posteriori probability P(λ|X). This requires the calculation of all P(λj|X),j=1,…,N, and finding the maximum among them. In speaker verification, only the claimant speaker's model λc is used and P(λc|X) is compared with a predetermined threshold in order to accept or reject X as being uttered from the claimant speaker.

In most of the tasks, it is possible to use the likelihood p(X|λ) instead of P(λ|X) which does not require prior probabilities P(λ) to be known. Another simplifying assumption is that the sequence of vectors, X, are independent and identically distributed random variables. This allows to express p(X|λ) as (Duda and Hart, 1973)p(X|λ)=t=1Tp(xt|λ),where p(xt|λ) is the likelihood of single frame xt given model λ. This is a fundamental equation of statistical theory and is widely used in speech recognition. Generally speaking, p(X|λ) is an utterance level score of X given model λ obtained from frame level scores p(xt|λ) using Eq. (1). Obviously, another ways of defining such scores can exist.

Our approach is based on the following definition of the utterance level score:Sc(X|λ)=t=1TSc(xt|λ)=t=1Tf(p(xt|λ)),where f() is some function of frame likelihoods p(xt|λ) that transforms them into new scores Sc(xt|λ). Actually, when this function is of the type f(x)=x, Eq. (2)becomes equivalent to Eq. (1). As it will be discussed in Section 6.1any linear type of f() does not lead to reduction of the recognition errors. That is why we have considered non-linear likelihood transformations.

The first family of such functions we have experimented with essentially performs likelihood normalization, but now applied at the frame level. The likelihood normalization approach has been successfully used at the utterance level for speaker verification (Reynolds, 1995aRosenberg et al., 1992Matsui and Furui, 1995Higgins et al., 1991) but is usually not used for speaker identification purposes. This is simply because, as shown in Section 6.2, when applied only once at the utterance level likelihoods, it is a meaningless operation. Gish and Schmidt (1994) have shown that when the speaker scores are computed over relatively short time intervals (segments of the utterances) likelihood normalization may be successful. In their system each speaker is represented by multiple uni-modal Gaussian models (a special case of a GMM) trained on data from different sessions, and only the best model's score for each speaker over a given segment is taken into account. The segment scores are further normalized in order to obtain meaningful comparison between segments. Our method, however, differs from this study in two main points. First, in our system each speaker is represented by only one GMM and, second, likelihood normalization is done on each frame instead of short time intervals.

The second family of likelihood transformations converts the frame likelihood p(xt|λ) into one of a set of predetermined weights wj,j=1,…,N. This type of transformation requires likelihoods from all reference models p(xt|λj) given the current frame to be calculated and sorted. Here we introduce the variable rλ called rank of the model, which corresponds to the position of its likelihood in the sorted list and is an integer number ranging from 1 to N. Weights are function of the ranks rλ,w(rλ)=g(rλ),where g() is some function of integer argument. Obviously, we can calculate all possible weights w in advance knowing the form of g() and the number of reference speakers N. Since weights and models ranks are involved in this type of likelihood transformation, we call it the weighting models rank (WMR) technique.

The rest of the paper is organized as follows. Section 2gives brief description of the GMM we used. Section 3provides details of speaker identification and verification tasks. Section 4explains in detail our likelihood transformation approach. Section 5describes our speech databases and summarizes our experimental results. In Section 6we present some discussions and analysis of our method. Finally, we draw some conclusions in Section 7.

Section snippets

Gaussian mixture model

A GMM is a weighted sum of M component densities and is given by the form (Reynolds and Rose, 1995)p(x|λ)=i=1Mcibi(x),where x is a d-dimensional random vector, bi(x),i=1,…,M, is the component density and ci,i=1,…,M, is the mixture weight. Each component density is a d-variate Gaussian function of the formbi(x)=1(2π)d/2|Σi|1/2exp12(x−μi)TΣi−1(x−μi),with mean vector μi and covariance matrix Σi. The mixture weights satisfy the constraint thati=1Mci=1.

The complete Gaussian mixture model is

Speaker identification

Given a sample of a speech utterance, speaker identification is to decide to whom of a group of N known speakers this utterance belongs. In the closed set problem, it is assured that it belongs to one of the registered speakers.

As mentioned in Section 1, in the identification task, the aim is to find the speaker i whose model λi maximizes a posteriori probability P(λi|X),1⩽i⩽N, which according to the Bayes' rule isP(λi|X)=p(X|λi)P(λi)p(X).Furthermore, due to lack of prior knowledge, we assume

Likelihood normalization

As we stated in Section 1, the first family of frame likelihood transformation functions performs the essentially likelihood normalization.

Given a single frame likelihood p(xt|λi) from the ith speaker model, the likelihood transformation is done using the following general function form:f(p(xti))=p(xti)1Bb=1Bp(xtb),where p(xt|λb) are the frame likelihoods from the background speaker models given the same frame xt. Different choices of the background speaker set give different

Experiments

We evaluated our speaker recognition system using several types of GMMs with both full and diagonal covariance matrices. As a baseline system, we used the conventional maximum likelihood testing approach based on Eq. (1)or Eq. (8).

Linear versus non-linear frame likelihood transformation

When considering the type of the likelihood transformation function f() of Eq. (2), it is very important to choose the right one. Since it is not quite obvious why the linear type of f() is not appropriate, below we prove that the linear transformation of the frame likelihoods does not change the recognition rate.

Consider the linear transformation function f(x)=ax+b and the frame likelihood p(xt|λi) of ith speaker model at time t. Then, the transformed likelihood isf(p(xti))=ap(xti)+b.In

Conclusion

We have developed and experimented a non-linear frame likelihood transformation method, which allowed as to apply successfully likelihood normalization technique for the speaker identification task. For the speaker verification, the combination of frame and utterance level likelihood normalization was also successful. Another new technique, WMR transformation, was experimented with as well. Both approaches showed better results in the speaker identification and speaker verification compared to

Unlinked References

Doddington, 1985, Furui, 1978, Linde et al., 1980, Soong et al., 1987, Tishby, 1991

References (31)

  • Gish, H., Schmidt, M., 1994. Text-independent speaker identification. IEEE Signal Processing Magazine, October, pp....
  • Y. Linde et al.

    An algorithm for vector quantizer design

    IEEE Transactions on Communications

    (1980)
  • Lleida, E., Rose, R., 1996. Efficient decoding and training procedures for utterance verification in continuous speech...
  • Markov, K., Nakagawa, S., 1995. Text-independent speaker identification on TIMIT database. In: Proceedings of the...
  • Markov, K., Nakagawa, S., 1996a. Text-independent speaker recognition system using frame level likelihood processing....
  • Cited by (37)

    • On the complementary role of DNN multi-level enhancement for noisy robust speaker recognition in an i-vector framework

      2020, IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences
    • An overview of automatic speaker verification system

      2018, Advances in Intelligent Systems and Computing
    View all citing articles on Scopus
    View full text