An alternative normalization scheme in HMM-based text-dependent speaker verification

https://doi.org/10.1016/S0167-6393(99)00072-2Get rights and content

Abstract

This paper proposes a normalization scheme for HMM-based text-dependent speaker verification in which the claimed-speaker model score and the background model score are computed for a common alignment made on the speaker-independent model of the password. It is shown that such a normalization preserves some speaker-specific information contained in the alignment and makes the normalization score more consistent in emphasizing remarkable parts of the claimed-speaker model. A special training procedure is proposed. Experiments on a large-scale and realistic telephone database are reported. Finally, first experiments about the integration of an information based on alignment in the decision making are presented. All these results show the interest of the method and encourage further investigation on the speaker modeling in such an approach.

Résumé

Cet article propose un principe de normalisation pour la vérification du locuteur en mode dépendant du texte fondée sur une modélisation HMM, principe dans lequel le score sur le modèle du locuteur prétendu et le score sur le modèle de normalisation sont calculés pour un même alignement, effectué sur le modèle du mot de passe indépendant du locuteur. On montre que cette normalisation préserve une part de l'information caractéristique du locuteur contenue dans l'alignement, et augmente la pertinence du score normalisé en insistant sur les parties remarquables du modèle du locuteur. Une procédure d'apprentissage spécifique est proposée. Des évaluations sur une base de donné téléphonique réaliste sont décrites. Enfin, les premiéres expériences sur l'intégration de l'information contenue dans l'alignement temporel dans la prise de décision sont présentées. Tous ces résultats montrent l'intérêt de l'approche et encourage de futures recherches sur la caractérisation du locuteur dans une telle approche.

Introduction

Recently, it has been proposed (Charlet, 1997) to interpret an HMM-based text-dependent speaker verification system as a 2-step classifier. The first step is the alignment task that is supposed to align each test frame on the part of the model (probability density function) that corresponds to the same speech event. The second step is the scoring task, that is supposed to measure what makes the speaker specificity for a given acoustical event. Such a distinction is motivated by the fact that speaker recognition aims at finding speaker-related information while speech-related information is predominant in the speech signal. Although this distinction between alignment and score might appear somewhat artificial, it enables to develop an efficient framework to study acoustical parameterization of speaker. Indeed, when separating alignment task and scoring task, two distinct feature sets can be used, each of these optimized for a particular task, one for the alignment, the other for the scoring task. The problem of finding a feature set optimized for the alignment concerns speech recognition. The search for an optimized feature set for the scoring task has been treated in (Charlet, 1997), in which it is shown that among a set of potential acoustical features, a subset of these or a weighting can be found so as to optimize a certain criterion, e.g., the error rate for speaker verification. Such a principle is based on the assumption that the alignment is correct, that means that the speech test portion and the model on which it is aligned correspond to the same speech event.

Unfortunately, alignment on the claimed-speaker model is not always correct for test utterances in a practical applicative context, for instance in telephone applications. This is due to the fact that the claimed-speaker model is often trained with very few training data and might not be trained enough to cope with various call conditions and variations in speech pronunciation. Consequently the alignment is often incorrect and makes the acoustical score difficult to interpret. That is why we propose to perform alignment on the speaker-independent model of the password, because such a model is trained with a lot of speakers and condition calls, and enable a more robust alignment. This implies a particular training procedure for the claimed-speaker model and we show that it enables a new way to get a normalized score, which has practical and theoretical interest.

The paper is organized in the following way. We first investigate on the speaker characteristic information that might be present in the alignment and we show that such information is also contained in the alignment on the speaker-independent model. Then, the normalization scheme is proposed and its advantages are shown. The proposed system is then evaluated and compared to classical HMM-based system. Finally, experiments on the integration of the speaker-characteristic alignment information in the decision making are reported.

Section snippets

Speaker characteristic alignment

In this section, we are interested in studying the speaker-characteristic information that is contained in the alignment. Previous work (Forsyth and Jack, 1993) has shown that there was speaker-specific duration information that can be captured in the Viterbi alignment. We wonder if the alignment on the speaker-independent model still contains such information. To investigate that, we have developed an elementary duration modeling. As we consider that a small amount of training data (here three

Database description and experimental setup

For experimental evaluation, we use a telephone speech database collected over long distance telephone lines, that contains a set of 55 true speakers (male and female) and a distinct set of 600 impostors. People who participate in the collect make phone calls from any place of their choice (but very often from home or office), and they were asked to try to phone every week. The database recording spans more than one year.

The speech data consist of five short sentences (average duration of each

Integrating alignment information in the decision making

We have seen in Section 2.1 that the alignment contains some part of speaker information, especially when using the distorsion measure da. Here we evaluate the correlation between the alignment information captured in da(X) and the acoustical information captured in S(X), so as to determine if there is a potential interest to integrate the alignment score in the decision making. The correlation coefficient is computed as follows:r=j(S(Xj)−S̄)(da(Xj)−dā)j(S(Xj)−S̄)2j(da(Xj)−dā)2,where Xj is

Conclusion

In this paper, we propose a normalization framework for HMM-based text-dependent speaker verification in which the claimed-speaker model score and the background model score are computed for a common alignment made on a speaker-independent model of the password. Such a system preserves a part of speaker-specific information contained in the alignment and makes the normalization score more consistent. This approach focuses on a frame by frame comparison between speaker model and background

References (10)

There are more references available in the full text version of this article.

Cited by (8)

  • Text-dependent speaker verification: Classifiers, databases and RSR2015

    2014, Speech Communication
    Citation Excerpt :

    With HMM, granularity of models can be tailor-made to represent the temporal structure of the speech utterances. Systems based on phone models offer the finest granularity and thus can be used for any lexical content (Matsui and Furui, 1993; Che et al., 1996; Charlet and Jouvet, 1997; Nakagawa et al., 2004) while HMMs modeling words (Rosenberg et al., 1991; Yoma and Pegoraro, 2002; Kato and Shimizu, 2003) or entire utterances (Rosenberg et al., 2000; Forsyth, 1995; Subramanya et al., 2007; Charlet et al., 2000; Larcher et al., 2013b), which granularity is less, are restrained to limited lexicon. Research is also carried out to improve the robustness of such models to channel and speaker variability.

  • A Study of Voice Print Recognition Technology

    2021, 2021 International Wireless Communications and Mobile Computing, IWCMC 2021
  • Speaker Verification Systems: A Comprehensive Review

    2020, Advances in Intelligent Systems and Computing
  • Combining spectral and prosodic features in HMM-based single utterance speaker verification

    2015, BIOSIGNALS 2015 - 8th International Conference on Bio-Inspired Systems and Signal Processing, Proceedings; Part of 8th International Joint Conference on Biomedical Engineering Systems and Technologies, BIOSTEC 2015
  • Imposture classification for text-dependent speaker verification

    2014, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
  • Neural network based speaker verification for security systems

    2012, 2012 20th Telecommunications Forum, TELFOR 2012 - Proceedings
View all citing articles on Scopus
View full text