Text-independent speaker recognition using non-linear frame likelihood transformation
Introduction
Speaker recognition has been a research topic for many years and various types of speaker models have been studied. Hidden Markov models (HMM) have become the most popular statistical tool for this task. The best results have been obtained using continuous HMM (CHMM) for modeling the speaker characteristics (Savic and Gupta, 1990, Furui, 1991, Rosenberg et al., 1991, Rosenberg et al., 1994, Matsui and Furui, 1992). For the text-independent task, where the temporal sequence modeling capability of the HMM is not required, one state CHMM, also called a Gaussian mixture model (GMM), has been widely used as a speaker model (Tseng et al., 1992, Reynolds and Rose, 1995, Gish and Schmidt, 1994, Bimbot et al., 1995, Matsui and Furui, 1995). In accordance with (Matsui and Furui, 1992) our previous study (Markov and Nakagawa, 1995) showed that GMM can perform even better than CHMM with multi-states.
The objective of the speaker identification is to find a speaker model λi given the set of reference models Λ={λ1,…,λN} and sequence of test vectors (or frames) X={x1,…,xT} which gives the maximum a posteriori probability P(λ|X). This requires the calculation of all and finding the maximum among them. In speaker verification, only the claimant speaker's model λc is used and P(λc|X) is compared with a predetermined threshold in order to accept or reject X as being uttered from the claimant speaker.
In most of the tasks, it is possible to use the likelihood p(X|λ) instead of P(λ|X) which does not require prior probabilities P(λ) to be known. Another simplifying assumption is that the sequence of vectors, X, are independent and identically distributed random variables. This allows to express p(X|λ) as (Duda and Hart, 1973)where p(xt|λ) is the likelihood of single frame xt given model λ. This is a fundamental equation of statistical theory and is widely used in speech recognition. Generally speaking, p(X|λ) is an utterance level score of X given model λ obtained from frame level scores p(xt|λ) using Eq. (1). Obviously, another ways of defining such scores can exist.
Our approach is based on the following definition of the utterance level score:where is some function of frame likelihoods p(xt|λ) that transforms them into new scores Sc(xt|λ). Actually, when this function is of the type f(x)=x, Eq. (2)becomes equivalent to Eq. (1). As it will be discussed in Section 6.1any linear type of does not lead to reduction of the recognition errors. That is why we have considered non-linear likelihood transformations.
The first family of such functions we have experimented with essentially performs likelihood normalization, but now applied at the frame level. The likelihood normalization approach has been successfully used at the utterance level for speaker verification (Reynolds, 1995aRosenberg et al., 1992Matsui and Furui, 1995Higgins et al., 1991) but is usually not used for speaker identification purposes. This is simply because, as shown in Section 6.2, when applied only once at the utterance level likelihoods, it is a meaningless operation. Gish and Schmidt (1994) have shown that when the speaker scores are computed over relatively short time intervals (segments of the utterances) likelihood normalization may be successful. In their system each speaker is represented by multiple uni-modal Gaussian models (a special case of a GMM) trained on data from different sessions, and only the best model's score for each speaker over a given segment is taken into account. The segment scores are further normalized in order to obtain meaningful comparison between segments. Our method, however, differs from this study in two main points. First, in our system each speaker is represented by only one GMM and, second, likelihood normalization is done on each frame instead of short time intervals.
The second family of likelihood transformations converts the frame likelihood p(xt|λ) into one of a set of predetermined weights . This type of transformation requires likelihoods from all reference models p(xt|λj) given the current frame to be calculated and sorted. Here we introduce the variable rλ called rank of the model, which corresponds to the position of its likelihood in the sorted list and is an integer number ranging from 1 to N. Weights are function of the ranks rλ,where is some function of integer argument. Obviously, we can calculate all possible weights w in advance knowing the form of and the number of reference speakers N. Since weights and models ranks are involved in this type of likelihood transformation, we call it the weighting models rank (WMR) technique.
The rest of the paper is organized as follows. Section 2gives brief description of the GMM we used. Section 3provides details of speaker identification and verification tasks. Section 4explains in detail our likelihood transformation approach. Section 5describes our speech databases and summarizes our experimental results. In Section 6we present some discussions and analysis of our method. Finally, we draw some conclusions in Section 7.
Section snippets
Gaussian mixture model
A GMM is a weighted sum of M component densities and is given by the form (Reynolds and Rose, 1995)where x is a d-dimensional random vector, bi(x),i=1,…,M, is the component density and ci,i=1,…,M, is the mixture weight. Each component density is a d-variate Gaussian function of the formwith mean vector μi and covariance matrix Σi. The mixture weights satisfy the constraint that
The complete Gaussian mixture model is
Speaker identification
Given a sample of a speech utterance, speaker identification is to decide to whom of a group of N known speakers this utterance belongs. In the closed set problem, it is assured that it belongs to one of the registered speakers.
As mentioned in Section 1, in the identification task, the aim is to find the speaker whose model maximizes a posteriori probability which according to the Bayes' rule isFurthermore, due to lack of prior knowledge, we assume
Likelihood normalization
As we stated in Section 1, the first family of frame likelihood transformation functions performs the essentially likelihood normalization.
Given a single frame likelihood p(xt|λi) from the ith speaker model, the likelihood transformation is done using the following general function form:where p(xt|λb) are the frame likelihoods from the background speaker models given the same frame xt. Different choices of the background speaker set give different
Experiments
We evaluated our speaker recognition system using several types of GMMs with both full and diagonal covariance matrices. As a baseline system, we used the conventional maximum likelihood testing approach based on Eq. (1)or Eq. (8).
Linear versus non-linear frame likelihood transformation
When considering the type of the likelihood transformation function of Eq. (2), it is very important to choose the right one. Since it is not quite obvious why the linear type of is not appropriate, below we prove that the linear transformation of the frame likelihoods does not change the recognition rate.
Consider the linear transformation function f(x)=ax+b and the frame likelihood p(xt|λi) of ith speaker model at time t. Then, the transformed likelihood isIn
Conclusion
We have developed and experimented a non-linear frame likelihood transformation method, which allowed as to apply successfully likelihood normalization technique for the speaker identification task. For the speaker verification, the combination of frame and utterance level likelihood normalization was also successful. Another new technique, WMR transformation, was experimented with as well. Both approaches showed better results in the speaker identification and speaker verification compared to
Unlinked References
Doddington, 1985, Furui, 1978, Linde et al., 1980, Soong et al., 1987, Tishby, 1991
References (31)
- et al.
Second-order statistical measures for text-independent speaker identification
Speech Communication
(1995) Speaker-dependent-feature extraction, recognition and processing techniques
Speech Communication
(1991)- et al.
Speaker verification using randomized phrase prompting
Digital Signal Processing
(1991) - et al.
Likelihood normalization for speaker verification using a phoneme- and speaker-independent model
Speech Communication
(1995) Speaker identification and verification using Gaussian mixture speaker models
Speech Communication
(1995)- et al.
Maximum likelihood estimation from incomplete data
Journal of the Royal Statistical Society B
(1979) Speaker recognition – Identifying people by their voices
Proceedings of the IEEE
(1985)- Duda, R., Hart, P., 1973. Pattern Classification and Scene Analysis, Wiley, New York, p....
- Fukunaga, K., 1990. Introduction to Statistical Pattern Recognition, Academic Press, New...
- Furui, S., 1978. A study on personal characteristics in speech sound, Ph.D. Thesis, University of Tokyo (in...
An algorithm for vector quantizer design
IEEE Transactions on Communications
Cited by (37)
Speaker identification based on the frame linear predictive coding spectrum technique
2009, Expert Systems with ApplicationsA discriminative training approach for text-independent speaker recognition
2005, Signal ProcessingOn the complementary role of DNN multi-level enhancement for noisy robust speaker recognition in an i-vector framework
2020, IEICE Transactions on Fundamentals of Electronics, Communications and Computer SciencesAn overview of automatic speaker verification system
2018, Advances in Intelligent Systems and ComputingSingle-channel dereverberation by feature mapping using cascade neural networks for robust distant speaker identification and speech recognition
2014, Eurasip Journal on Audio, Speech, and Music Processing