Elsevier

Speech Communication

Volume 22, Issue 4, September 1997, Pages 369-384
Speech Communication

Telephone speech recognition based on Bayesian adaptation of hidden Markov models

https://doi.org/10.1016/S0167-6393(97)00033-2Get rights and content

Abstract

This paper presents an adaptation method of speech hidden Markov models (HMMs) for telephone speech recognition. Our goal is to automatically adapt the HMM parameters so that the adapted HMM parameters can match with the telephone environment. In this study, two kinds of transformation-based adaptations are investigated. One is the bias transformation and the other is the affine transformation. A Bayesian estimation technique which incorporates prior knowledge into the transformation is applied for estimating the transformation parameters. Experiments show that the proposed approach can be successfully employed for self adaptation as well as supervised adaptation. Besides, the performance of telephone speech recognition using Bayesian adaptation is shown to be superior to that using maximum-likelihood adaptation. The affine transformation is also demonstrated to be significantly better than the bias transformation.

Résumé

Cet article présente une méthode d'adaptation des modèles de Markov cachés (HMMs) pour la reconnaissance de parole téléphonique. Notre but est d'adapter automatiquement les paramètres HMM à l'environnement téléphonique. Dans cet article, on étudie deux types de d'adaptation basées sur des transformations. L'une est la transformation par biais et l'autre la transformation affine. Pour estimer les paramètres de la transformation, on applique une technique d'estimation Bayésienne qui incorpore la connaissance a priori dans la transformation. Les expériences montrent que l'approche proposée peut être appliquée avec succès tant pour l'auto-adaptation que pour l'adaptation supervisée. De plus, on montre que les performances de la reconnaissance de parole téléphonique utilisant l'adaptation Bayésienne sont supérieures à celles utilisant l'adaptation par maximum de vraisemblance. On montre enfin que la transformation affine est également nettement plus efficace que la transformation par biais.

Introduction

Because the telecommunication technology is rapidly growing in recent years, people can easily inquire or reserve a variety of information through the telephone services. To achieve the automation of telephone services, it is essential to develop robust speech recognition techniques under telephone environments (Juang, 1991; Gong, 1995; Lee, 1997). In telephone networks, a major problem of speech recognition comes from the acoustic mismatch between training and testing environments. The mismatch due to speaker, telephone handset, transmission line and ambient noise usually causes serious degradation of recognition performance. To improve the performance, a large amount of telephone data containing all the acoustic variabilities of telephone environments should be collected for generating robust speech models. However, this approach is impractical in real application. Thus, several robust algorithms such as codeword-dependent cepstral normalization (CDCN) (Acero and Stern, 1990), relative spectral (RASTA) method (Hermansky and Morgan, 1994), and signal bias removal (SBR) (Rahim and Juang, 1996) were developed for reducing the acoustic mismatch of training and testing environments without model retraining. In addition, a practical approach for telephone speech recognition is to adapt a given set of speech hidden Markov models (HMMs) so that the adapted HMM parameters are acoustically close to the real telephone environment. Using the adapted HMM parameters, the recognition performance can be significantly improved. In practice, the adaptation approaches can be employed in two cases: (1) self adaptation, and (2) supervised adaptation. Based on the self adaptation, the adaptation of HMM parameters is performed on the testing data in an unsupervised manner (Chien et al., 1995a; Sankar and Lee, 1996). On the other hand, the adaptation can be also performed in a supervised manner. In supervised adaptation, the HMM parameters are adapted to a new speaker and telephone channel by using some adaptation data for which the true transcriptions are given (Takahashi and Sagayama, 1994; Takagi et al., 1995). The telephone speech is then recognized without adaptation in testing phase. Besides, the speaker adaptation techniques which adapt the existing speaker-independent (SI) HMM parameters to a new speaker are also feasible for telephone speech recognition. The difference of speakers (i.e., vocal tract system) may be equated to the channel mismatch of telephone utterances.

By the assumption of linear channel mismatch, the quantity of channel mismatch can be approximately characterized by a cepstral bias. Therefore, we usually apply a bias transformation for adapting the HMM parameters by adding a bias (Cox and Bridle, 1989). However, the bias transformation may be insufficient for modeling the variabilities of telephone environments. Thus, a more sophisticated adaptation using the affine transformation (also called the linear regression transformation) is developed. According to the affine transformation, the HMM parameters are linearly scaled and then shifted by a bias. In (Digalakis et al., 1995), a constrained transformation of HMM parameters was presented for speaker adaptation. Their constrained transformation was in a form of an affine transformation. In (Leggetter and Woodland, 1995), a maximum likelihood linear regression (MLLR) approach was proposed for adapting the continuous-density HMM (CDHMM) parameters. They employed the maximum likelihood (ML) theory for calculating the linear regression transformation which adapted the HMM mean vectors. Furthermore, an ML-based stochastic matching method for decreasing the acoustic mismatch between testing features and given HMM parameters was presented (Sankar and Lee, 1996). In their studies, the affine transformation was considered as a feature transformation function for transforming the testing features to match with the given HMM parameters.

As mentioned above, the parameters of transformation-based adaptation in previous works were estimated via the ML theory and have been successfully applied for model adaptation. However, if we can adequately incorporate the prior knowledge of transformation parameters into the adaptation, the recognition performance may be further improved. Accordingly, we are motivated to apply the maximum a posteriori (MAP) theory for estimating the transformation parameters and use them for adapting the HMM parameters to a target telephone environment. Using the MAP theory, the transformation parameters are optimally estimated by maximizing the posterior density which consists of a likelihood function and a prior density. In this study, a bias transformation y=x+b1 with parameter θb=b1 and an affine transformation y=Ax+b2 with parameters θa=(A, b2) are considered as the model transformation functions for transforming the original sampled data x into its adapted version y. Generally, the estimated scaling matrix A should not be an identity matrix so that the extra unknown effect appeared in telephone utterances can be compensated. In the experiments, we compare the performance of ML adaptation and MAP adaptation. The bias transformation and the affine transformation are served as the transformation functions. Results show that the best performance is achieved by applying the MAP adaptation using the affine transformation. When the adaptation data is increased, the ML adaptation and MAP adaptation have comparable performance. Besides, we find that the proposed approach is applicable to the self adaptation and the supervised adaptation. By performing self adaptation and supervised adaptation simultaneously, the performance can be further improved.

This paper is organized as follows. In Section 2, the adaptation approaches using bias and affine transformations are described. In Section 3, the formulas of ML and MAP estimation of various transformation parameters are derived. In Section 4, we address the experimental setup and databases. In Section 5, the estimation of prior parameters and the histograms of transformation parameters are illustrated. The experiments of telephone speech recognition using three adaptation techniques are reported in Section 6. Finally, the conclusions are given in Section 7.

Section snippets

Transformation functions for model adaptation

When the testing data mismatches with a given set of HMM parameters, two approaches are available for suppressing the mismatch. One is to remove the mismatch sources from the testing data, e.g., an operation of feature bias removal (Chien et al., 1995b; Rahim and Juang, 1996). The other is to adapt the given HMM parameters to the testing environment. In (Chien et al., 1996; Sankar and Lee, 1996), the experiments were reported that the performance of model adaptation was better than that of

Bayesian estimation technique

According to the MAP principle, the transformation parameters are estimated by maximizing the posterior density which is composed of a likelihood function and a prior density. If the prior knowledge of transformation parameters is deterministic or noninformative, i.e., the prior density equals to a constant, the MAP estimation is then reduced to the ML estimation which only maximizes the likelihood function. Theoretically, if the observed data is limited and the prior statistics is reliable,

Experimental setup and databases

Our task is to recognize the Mandarin speech (Chien, 1997). Mandarin is a syllabic and tonal language. Each Chinese character corresponds to a Mandarin syllable. Without considering the tonal information, the total number of Mandarin syllables is 408. In general, each Mandarin syllable is composed of an initial part and a final part. Some Mandarin syllables only have a final part. For these syllables, a null initial is needed to be modeled. The initial part corresponds to a consonant and the

Prior statistics

In theory, the prior statistics can be empirically estimated from a large amount of speech data which covers all the variabilities of model adaptation. The resulting MAP estimation will be more reliable. However, it is not easy to collect enough training data in real application. Thus, we sampled 80 telephone utterances from DB3 to estimate the prior statistics. To extensively cover the acoustical properties of speakers, each sampled utterance was spoken by a different speaker. These 80

Experimental results

A multispeaker speech recognition task for 250 Chinese names was conducted to demonstrate the performance of proposed method. In the experiments, the clean speech models are served as the model parameters. The speech recognizer without model adaptation is referred to the baseline system. The results of cepstral mean normalization (CMN) (Atal, 1974; Furui, 1981) method and matched condition are also included for comparison. As shown in Table 1, the recognition rate improves from the baseline

Conclusion

We propose the technique of transformation-based adaptation based on the framework of MAP adaptation and successfully apply it to telephone speech recognition. In this study, the bias transformation and the affine transformation serve as the transformation functions. The parameters of the transformation functions are derived such that the posterior likelihood is maximized. To demonstrate the performance of our method, we conduct a series of experiments for comparative studies. The results of ML

Acknowledgements

The authors acknowledge the substantial contribution of Dr. Chin-Hui Lee, Head of Dialogue Systems Research Department at Bell Laboratories, Murray Hill, USA. They also thank the anonymous reviewers for their critical comments. This work has been partially support by the Telecommunication Laboratory, Chunghwa Telecom, Taiwan, ROC, under contract TL-85-5203.

References (33)

  • Chien, J.T., Lee, L.M., Wang, H.C., 1996. Estimation of channel bias for telephone speech recognition. In: Proc....
  • Chien, J.T., Wang, H.C., Lee, C.H., 1997. Bayesian affine transformation of HMM parameters for instantaneous and...
  • A.P. Dempster et al.

    Maximum likelihood from incomplete data via the EM algorithm

    J. Roy. Statist. Soc. Ser. B

    (1977)
  • Cox, S.J., Bridle, J.S., 1989. Unsupervised speaker adaptation by probabilistic spectrum fitting. In: IEEE Proc....
  • V.V. Digalakis et al.

    Speaker adaptation using constrained estimation of Gaussian mixtures

    IEEE Trans. Speech Audio Process.

    (1995)
  • S. Furui

    Cepstral analysis technique for automatic speaker verification

    IEEE Trans. Acoustic Speech Signal Process.

    (1981)
  • Cited by (0)

    1

    Jen-Tzung Chien is now an assistant professor at Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan, ROC.

    View full text