An alternative normalization scheme in HMM-based text-dependent speaker verification

doi:10.1016/S0167-6393(99)00072-2

Speech Communication

Volume 31, Issues 2–3, June 2000, Pages 113-120

https://doi.org/10.1016/S0167-6393(99)00072-2 Get rights and content

Abstract

This paper proposes a normalization scheme for HMM-based text-dependent speaker verification in which the claimed-speaker model score and the background model score are computed for a common alignment made on the speaker-independent model of the password. It is shown that such a normalization preserves some speaker-specific information contained in the alignment and makes the normalization score more consistent in emphasizing remarkable parts of the claimed-speaker model. A special training procedure is proposed. Experiments on a large-scale and realistic telephone database are reported. Finally, first experiments about the integration of an information based on alignment in the decision making are presented. All these results show the interest of the method and encourage further investigation on the speaker modeling in such an approach.

Résumé

Cet article propose un principe de normalisation pour la vérification du locuteur en mode dépendant du texte fondée sur une modélisation HMM, principe dans lequel le score sur le modèle du locuteur prétendu et le score sur le modèle de normalisation sont calculés pour un même alignement, effectué sur le modèle du mot de passe indépendant du locuteur. On montre que cette normalisation préserve une part de l'information caractéristique du locuteur contenue dans l'alignement, et augmente la pertinence du score normalisé en insistant sur les parties remarquables du modèle du locuteur. Une procédure d'apprentissage spécifique est proposée. Des évaluations sur une base de donné téléphonique réaliste sont décrites. Enfin, les premiéres expériences sur l'intégration de l'information contenue dans l'alignement temporel dans la prise de décision sont présentées. Tous ces résultats montrent l'intérêt de l'approche et encourage de futures recherches sur la caractérisation du locuteur dans une telle approche.

Introduction

Recently, it has been proposed (Charlet, 1997) to interpret an HMM-based text-dependent speaker verification system as a 2-step classifier. The first step is the alignment task that is supposed to align each test frame on the part of the model (probability density function) that corresponds to the same speech event. The second step is the scoring task, that is supposed to measure what makes the speaker specificity for a given acoustical event. Such a distinction is motivated by the fact that speaker recognition aims at finding speaker-related information while speech-related information is predominant in the speech signal. Although this distinction between alignment and score might appear somewhat artificial, it enables to develop an efficient framework to study acoustical parameterization of speaker. Indeed, when separating alignment task and scoring task, two distinct feature sets can be used, each of these optimized for a particular task, one for the alignment, the other for the scoring task. The problem of finding a feature set optimized for the alignment concerns speech recognition. The search for an optimized feature set for the scoring task has been treated in (Charlet, 1997), in which it is shown that among a set of potential acoustical features, a subset of these or a weighting can be found so as to optimize a certain criterion, e.g., the error rate for speaker verification. Such a principle is based on the assumption that the alignment is correct, that means that the speech test portion and the model on which it is aligned correspond to the same speech event.

Unfortunately, alignment on the claimed-speaker model is not always correct for test utterances in a practical applicative context, for instance in telephone applications. This is due to the fact that the claimed-speaker model is often trained with very few training data and might not be trained enough to cope with various call conditions and variations in speech pronunciation. Consequently the alignment is often incorrect and makes the acoustical score difficult to interpret. That is why we propose to perform alignment on the speaker-independent model of the password, because such a model is trained with a lot of speakers and condition calls, and enable a more robust alignment. This implies a particular training procedure for the claimed-speaker model and we show that it enables a new way to get a normalized score, which has practical and theoretical interest.

The paper is organized in the following way. We first investigate on the speaker characteristic information that might be present in the alignment and we show that such information is also contained in the alignment on the speaker-independent model. Then, the normalization scheme is proposed and its advantages are shown. The proposed system is then evaluated and compared to classical HMM-based system. Finally, experiments on the integration of the speaker-characteristic alignment information in the decision making are reported.

Section snippets

Speaker characteristic alignment

In this section, we are interested in studying the speaker-characteristic information that is contained in the alignment. Previous work (Forsyth and Jack, 1993) has shown that there was speaker-specific duration information that can be captured in the Viterbi alignment. We wonder if the alignment on the speaker-independent model still contains such information. To investigate that, we have developed an elementary duration modeling. As we consider that a small amount of training data (here three

Database description and experimental setup

For experimental evaluation, we use a telephone speech database collected over long distance telephone lines, that contains a set of 55 true speakers (male and female) and a distinct set of 600 impostors. People who participate in the collect make phone calls from any place of their choice (but very often from home or office), and they were asked to try to phone every week. The database recording spans more than one year.

The speech data consist of five short sentences (average duration of each

Integrating alignment information in the decision making

We have seen in Section 2.1 that the alignment contains some part of speaker information, especially when using the distorsion measure d_a. Here we evaluate the correlation between the alignment information captured in d_a(X) and the acoustical information captured in S(X), so as to determine if there is a potential interest to integrate the alignment score in the decision making. The correlation coefficient is computed as follows: $r= ∑_{j} (S(X_{j})− S ̄)(d_{a} (X_{j})− d_{a} ̄) ∑_{j} (S(X_{j})− S ̄)^{2} ∑_{j} (d_{a} (X_{j})− d_{a} ̄)^{2},$ where X_j is

Conclusion

In this paper, we propose a normalization framework for HMM-based text-dependent speaker verification in which the claimed-speaker model score and the background model score are computed for a common alignment made on a speaker-independent model of the password. Such a system preserves a part of speaker-specific information contained in the alignment and makes the normalization score more consistent. This approach focuses on a frame by frame comparison between speaker model and background

References (10)

D. Charlet et al.
Optimizing feature let for speaker verification
Pattern Recognition Letters
(1997)
M. Forsyth
Discriminating observation probability (DOP) HMM for speaker verification
Speech Communication
(1995)
Charlet, D. 1999. Integrating time-alignment information in the decision making for text-dependent HMM-based speaker...
Forsyth, M., Jack, M. 1993. Duration modelling and multiple codebooks in semi-continuous HMM for speaker-verification....
H.L. Higgins et al.
Speaker verification using randomized phrase prompting
Digital Signal Processing
(1991)

There are more references available in the full text version of this article.

Cited by (8)

Text-dependent speaker verification: Classifiers, databases and RSR2015
2014, Speech Communication
Citation Excerpt :
With HMM, granularity of models can be tailor-made to represent the temporal structure of the speech utterances. Systems based on phone models offer the finest granularity and thus can be used for any lexical content (Matsui and Furui, 1993; Che et al., 1996; Charlet and Jouvet, 1997; Nakagawa et al., 2004) while HMMs modeling words (Rosenberg et al., 1991; Yoma and Pegoraro, 2002; Kato and Shimizu, 2003) or entire utterances (Rosenberg et al., 2000; Forsyth, 1995; Subramanya et al., 2007; Charlet et al., 2000; Larcher et al., 2013b), which granularity is less, are restrained to limited lexicon. Research is also carried out to improve the robustness of such models to channel and speaker variability.
The RSR2015 database, designed to evaluate text-dependent speaker verification systems under different durations and lexical constraints has been collected and released by the Human Language Technology (HLT) department at Institute for Infocomm Research (I²R) in Singapore. English speakers were recorded with a balanced diversity of accents commonly found in Singapore. More than 151 h of speech data were recorded using mobile devices. The pool of speakers consists of 300 participants (143 female and 157 male speakers) between 17 and 42 years old making the RSR2015 database one of the largest publicly available database targeted for text-dependent speaker verification. We provide evaluation protocol for each of the three parts of the database, together with the results of two speaker verification system: the HiLAM system, based on a three layer acoustic architecture, and an i-vector/PLDA system. We thus provide a reference evaluation scheme and a reference performance on RSR2015 database to the research community. The HiLAM outperforms the state-of-the-art i-vector system in most of the scenarios.
A Study of Voice Print Recognition Technology
2021, 2021 International Wireless Communications and Mobile Computing, IWCMC 2021
Speaker Verification Systems: A Comprehensive Review
2020, Advances in Intelligent Systems and Computing
Combining spectral and prosodic features in HMM-based single utterance speaker verification
2015, BIOSIGNALS 2015 - 8th International Conference on Bio-Inspired Systems and Signal Processing, Proceedings; Part of 8th International Joint Conference on Biomedical Engineering Systems and Technologies, BIOSTEC 2015
Imposture classification for text-dependent speaker verification
2014, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Neural network based speaker verification for security systems
2012, 2012 20th Telecommunications Forum, TELFOR 2012 - Proceedings

View all citing articles on Scopus

View full text