An alternative normalization scheme in HMM-based text-dependent speaker verification
Introduction
Recently, it has been proposed (Charlet, 1997) to interpret an HMM-based text-dependent speaker verification system as a 2-step classifier. The first step is the alignment task that is supposed to align each test frame on the part of the model (probability density function) that corresponds to the same speech event. The second step is the scoring task, that is supposed to measure what makes the speaker specificity for a given acoustical event. Such a distinction is motivated by the fact that speaker recognition aims at finding speaker-related information while speech-related information is predominant in the speech signal. Although this distinction between alignment and score might appear somewhat artificial, it enables to develop an efficient framework to study acoustical parameterization of speaker. Indeed, when separating alignment task and scoring task, two distinct feature sets can be used, each of these optimized for a particular task, one for the alignment, the other for the scoring task. The problem of finding a feature set optimized for the alignment concerns speech recognition. The search for an optimized feature set for the scoring task has been treated in (Charlet, 1997), in which it is shown that among a set of potential acoustical features, a subset of these or a weighting can be found so as to optimize a certain criterion, e.g., the error rate for speaker verification. Such a principle is based on the assumption that the alignment is correct, that means that the speech test portion and the model on which it is aligned correspond to the same speech event.
Unfortunately, alignment on the claimed-speaker model is not always correct for test utterances in a practical applicative context, for instance in telephone applications. This is due to the fact that the claimed-speaker model is often trained with very few training data and might not be trained enough to cope with various call conditions and variations in speech pronunciation. Consequently the alignment is often incorrect and makes the acoustical score difficult to interpret. That is why we propose to perform alignment on the speaker-independent model of the password, because such a model is trained with a lot of speakers and condition calls, and enable a more robust alignment. This implies a particular training procedure for the claimed-speaker model and we show that it enables a new way to get a normalized score, which has practical and theoretical interest.
The paper is organized in the following way. We first investigate on the speaker characteristic information that might be present in the alignment and we show that such information is also contained in the alignment on the speaker-independent model. Then, the normalization scheme is proposed and its advantages are shown. The proposed system is then evaluated and compared to classical HMM-based system. Finally, experiments on the integration of the speaker-characteristic alignment information in the decision making are reported.
Section snippets
Speaker characteristic alignment
In this section, we are interested in studying the speaker-characteristic information that is contained in the alignment. Previous work (Forsyth and Jack, 1993) has shown that there was speaker-specific duration information that can be captured in the Viterbi alignment. We wonder if the alignment on the speaker-independent model still contains such information. To investigate that, we have developed an elementary duration modeling. As we consider that a small amount of training data (here three
Database description and experimental setup
For experimental evaluation, we use a telephone speech database collected over long distance telephone lines, that contains a set of 55 true speakers (male and female) and a distinct set of 600 impostors. People who participate in the collect make phone calls from any place of their choice (but very often from home or office), and they were asked to try to phone every week. The database recording spans more than one year.
The speech data consist of five short sentences (average duration of each
Integrating alignment information in the decision making
We have seen in Section 2.1 that the alignment contains some part of speaker information, especially when using the distorsion measure da. Here we evaluate the correlation between the alignment information captured in da(X) and the acoustical information captured in S(X), so as to determine if there is a potential interest to integrate the alignment score in the decision making. The correlation coefficient is computed as follows:where Xj is
Conclusion
In this paper, we propose a normalization framework for HMM-based text-dependent speaker verification in which the claimed-speaker model score and the background model score are computed for a common alignment made on a speaker-independent model of the password. Such a system preserves a part of speaker-specific information contained in the alignment and makes the normalization score more consistent. This approach focuses on a frame by frame comparison between speaker model and background
References (10)
- et al.
Optimizing feature let for speaker verification
Pattern Recognition Letters
(1997) Discriminating observation probability (DOP) HMM for speaker verification
Speech Communication
(1995)- Charlet, D. 1999. Integrating time-alignment information in the decision making for text-dependent HMM-based speaker...
- Forsyth, M., Jack, M. 1993. Duration modelling and multiple codebooks in semi-continuous HMM for speaker-verification....
- et al.
Speaker verification using randomized phrase prompting
Digital Signal Processing
(1991)
Cited by (8)
Text-dependent speaker verification: Classifiers, databases and RSR2015
2014, Speech CommunicationCitation Excerpt :With HMM, granularity of models can be tailor-made to represent the temporal structure of the speech utterances. Systems based on phone models offer the finest granularity and thus can be used for any lexical content (Matsui and Furui, 1993; Che et al., 1996; Charlet and Jouvet, 1997; Nakagawa et al., 2004) while HMMs modeling words (Rosenberg et al., 1991; Yoma and Pegoraro, 2002; Kato and Shimizu, 2003) or entire utterances (Rosenberg et al., 2000; Forsyth, 1995; Subramanya et al., 2007; Charlet et al., 2000; Larcher et al., 2013b), which granularity is less, are restrained to limited lexicon. Research is also carried out to improve the robustness of such models to channel and speaker variability.
A Study of Voice Print Recognition Technology
2021, 2021 International Wireless Communications and Mobile Computing, IWCMC 2021Speaker Verification Systems: A Comprehensive Review
2020, Advances in Intelligent Systems and ComputingCombining spectral and prosodic features in HMM-based single utterance speaker verification
2015, BIOSIGNALS 2015 - 8th International Conference on Bio-Inspired Systems and Signal Processing, Proceedings; Part of 8th International Joint Conference on Biomedical Engineering Systems and Technologies, BIOSTEC 2015Imposture classification for text-dependent speaker verification
2014, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - ProceedingsNeural network based speaker verification for security systems
2012, 2012 20th Telecommunications Forum, TELFOR 2012 - Proceedings