A unified framework of HMM adaptation with joint compensation of additive and convolutive distortions

https://doi.org/10.1016/j.csl.2009.02.001Get rights and content

Abstract

In this paper, we present our recent development of a model-domain environment robust adaptation algorithm, which demonstrates high performance in the standard Aurora 2 speech recognition task. The algorithm consists of two main steps. First, the noise and channel parameters are estimated using multi-sources of information including a nonlinear environment-distortion model in the cepstral domain, the posterior probabilities of all the Gaussians in speech recognizer, and truncated vector Taylor series (VTS) approximation. Second, the estimated noise and channel parameters are used to adapt the static and dynamic portions (delta and delta–delta) of the HMM means and variances. This two-step algorithm enables joint compensation of both additive and convolutive distortions (JAC). The hallmark of our new approach is the use of a nonlinear, phase-sensitive model of acoustic distortion that captures phase asynchrony between clean speech and the mixing noise.

In the experimental evaluation using the standard Aurora 2 task, the proposed Phase-JAC/VTS algorithm achieves 93.32% word accuracy using the clean-trained complex HMM backend as the baseline system for the unsupervised model adaptation. This represents high recognition performance on this task without discriminative training of the HMM system. The experimental results show that the phase term, which was missing in all previous HMM adaptation work, contributes significantly to the achieved high recognition accuracy.

Introduction

Environment robustness in speech recognition remains an outstanding and difficult problem despite many years of research and investment (Peinado and Segura, 2006). The difficulty arises due to many possible types of distortions, including additive and convolutive distortions and their mixes, which are not easy to predict accurately during recognizers’ development. As a result, the speech recognizer trained using clean speech often degrades its performance significantly when used under noisy environments if no compensation is applied (Lee, 1998, Gong, 1995).

Different methodologies have been proposed in the past for environment robustness in speech recognition over the past two decades. There are two main classes of approaches. In the first, feature-domain class where no HMM information is exploited, the distorted speech features are enhanced with advanced signal processing methods. Spectral subtraction (SS) (Boll, 1979) is widely used as a simple technique to reduce additive noise in the spectral domain. Cepstral mean normalization (CMN) (Atal, 1974) removes the mean vector in the acoustic features of the utterance in order to reduce or eliminate the convolutive channel effect. As an extension to CMN, Cepstral variance normalization (CVN) (Molau et al., 2003) also adjusts the feature variance to improve automatic speech recognition (ASR) robustness. Relative spectra (RASTA) (Hermansky and Morgan, 1994) employs a long span of speech signals in order to remove or reduce the acoustic distortion. All these traditional feature-domain methods are relatively simple, and are shown to have achieved medium-level distortion reduction. In recent years, new feature-domain methods have been proposed using more advanced signal processing techniques to achieve more significant performance improvement in noise robustness ASR tasks than the traditional methods. Examples include feature space nonlinear transformation techniques (Molau et al., 2003, Padmanabhan and Dharanipragada, 2001), the ETSI advanced front end (AFE) (Macho et al., 2002) and stereo-based piecewise linear compensation for environments (SPLICE) (Deng et al., 2000). In Padmanabhan and Dharanipragada (2001), a piecewise-linear approximation to a nonlinear transformation is used to map the features in the training space to the testing space. This is extended in Molau et al. (2003) with further combination with other normalization technologies such as feature space rotation and vocal tract length normalization to get satisfactory results. AFE (Macho et al., 2002) integrates several noise robustness methods to remove additive noise with two-stage Mel-warped Wiener filtering (Agarwal and Cheng, 1999) and SNR-dependent waveform processing (Macho and Cheng, 2001), and mitigates the channel effect with blind equalization (Mauuary, 1998). SPLICE (Deng et al., 2000) assumes the distorted cepstrum is distributed according to a mixture of Gaussian, and is cleaned by removing the correction vector determined by the parameters in these Gaussians. Although these feature-based algorithms obtain satisfactory results, they usually perform worse than the model-based algorithms, which utilize the power of modeling.

The other, model-based class of techniques operates on the model (HMM) domain to adapt or adjust the model parameters so that the system becomes better matched to the distorted environment. The most straight forward way is to train models from the distorted speech. It is usually expensive to acquire sufficient amounts of distorted speech signals. Hence, multi-style training (Lippmann et al., 1987) is designed to add the different kinds of distortions on clean speech signals, and train models from these artificially distorted signals. However, this method requires the knowledge of all the distorted environments and needs to retrain models. In order to overcome these difficulties, model-domain methods have been developed that directly adapt the models trained with clean speech to the distorted environments. Signal bias removal method (Rahim and Juang, 1996) estimates the channel mean in a maximum likelihood estimation (MLE) manner, and removes this channel mean from the Gaussian means in the HMMs. Maximum likelihood linear regression (MLLR) (Leggetter and Woodland, 1995, Cui and Alwan, 2005, Saon et al., 2001a) has also been used to adapt the clean-trained model to the distorted environments. However, to achieve better performance the MLLR method often requires significantly more than one transformation matrix, and this inevitably results in demanding requirements for the amount of the adaptation data. Further, parallel model combination (PMC) method (Gales and Young, 1992) relies on one set of speech models and another set of noise models to achieve the goal of model adaptation using approximate log-normal distributions. Channel distortion is not considered in the basic PMC framework. As an extension, PMC can address both the noise and channel distortions in (Gales, 1995).

Differing from the several model-domain adaptation methods discussed above, the methods of joint compensation of additive and convolutive distortions (JAC) (Moreno, 1996, Kim et al., 1998, Acero et al., 2000, Gong, 2005) have shown their advantages by using a distortion model for noise and channel and using linearized vector Taylor series (VTS) approximation. The JAC-based algorithm proposed in Moreno (1996) directly used VTS to estimate the noise and channel mean but adapted the features instead of the models. In that work, no dynamic (delta and delta–delta) portions of the features were compensated either. The work in Acero et al. (2000), on the other hand, proposed a framework to adjust both the static and dynamic portions of HMM parameters given the known noise and channel parameters. However, while it was mentioned in Acero et al. (2000) that the iterative expectation maximization (EM) algorithm (Dempster et al., 1977) can be used for the estimation of the noise and channel parameters, no actual algorithm was developed and reported.

A similar JAC-based model adaptation method was proposed in Kim et al. (1998), where both the static mean and variance parameters in the cepstral domain are adjusted using the VTS approximation technique. In that work, however, noise was estimated on the frame-by-frame basis. This process is complex and computationally costly and the resulting estimate may not be reliable. Furthermore, no adaptation was made for the delta or dynamic portions of HMM parameters, which is known to be important for high performance robust speech recognition.

JAC developed in Gong (2005) directly estimates the noise and channel distortion parameters in the log-spectral domain, adjusts the acoustic HMM parameters in the same log-spectral domain, and then converts the parameters to the cepstral domain. However, no strategy for HMM variance adaptation has been given in Gong (2005) and the techniques for estimating the distortion parameters involve a number of approximations, as analyzed in the later section.

Finally, the JAC method in Liao and Gales (2006) also adapts all the static and dynamic HMM parameters. The recent study on uncertainty decoding (Liao and Gales, 2007) also intended to jointly compensate for the additive and convolutive distortions.

In all the previous JAC/VTS work for HMM adaptation, the environment-distortion model makes the simplifying assumption of instantaneous phase synchrony (phase-insensitive) between the clean speech and the mixing noise. This assumption was relaxed in the work reported in Deng et al. (2004), where a new phase term was introduced to account for the random nature of the phase asynchrony. And it was shown in Deng et al. (2004) that when the noise magnitude is estimated accurately, the Gaussian-distributed phase term plays a key role in recovering clean speech features by removing the noise and the cross term between the noise and speech.

However, in contrast to the JAC/VTS approach that implements robustness in the model (HMM) domain, the approach of Deng et al. (2004) was implemented in the feature-domain (i.e., feature enhancement instead of HMM adaptation), producing inferior recognition results than the model-domain approach despite the use of a more accurate environment-distortion model (phase-sensitive versus phase-insensitive models).

The research presented in this paper extends and integrates our earlier two sets of work: HMM adaptation with the phase-insensitive environment-distortion model (Acero et al., 2000, Li et al., 2007) and feature enhancement with the phase-sensitive environment-distortion model (Deng et al., 2004). The new algorithm developed and presented in this paper implements environment robustness via HMM adaptation taking into account phase asynchrony between clean speech and the mixing noise. That is, it incorporates the same phase term in Deng et al. (2004) into the rigorous formulation of JAC/VTS of Li et al. (2007). We hence name our new algorithm as Phase-JAC/VTS. In this work, both the static and dynamic mean and variance of the noise vector and the mean vector of the channel are rigorously estimated on an utterance-by-utterance basis using VTS. In addition to the novel phase-sensitive model adaptation, our algorithm differs from previous JAC methods in two parts: dynamic noise mean estimation and the noise variance estimation.

The rest of the paper is organized as follows. In Section 2, we present our new Phase-JAC/VTS algorithm and its implementation steps. Experimental evaluation of the algorithm is provided in Section 3, where we show that our new algorithm can achieve 93.32% word recognition accuracy averaged over all distortion conditions on the Aurora 2 task with the standard complex back-end, clean-trained model and standard MFCCs. We summarize our study and draw conclusions in Section 4.

Section snippets

JAC/VTS adaptation algorithm

In this section, we first derive the adaptation formulas for the HMM means and variances in the MFCC (both static and dynamic) domain using VTS approximation assuming that the estimates of the additive and convolutive parameters are known. We then derive the algorithm which jointly estimates the additive and convolutive distortion parameters based on VTS approximation. A summary description follows on the implementation steps of the entire algorithm which were used in our experiments.

Experiments

The effectiveness of the Phase-JAC/VTS algorithm presented in Section 2 has been evaluated on the standard Aurora 2 task of recognizing digit strings in noise and channel distorted environments. The clean training set, which consists of 8440 clean utterances, is used to train the baseline MLE HMMs. The test material consists of three sets of distorted utterances. The data in set-A and set-B contain eight different types of additive noise, while set-C contain two different types of noise plus

Conclusion

In this paper, we have presented our recent development of the Phase-JAC/VTS algorithm for HMM adaptation and demonstrated its effectiveness in the standard Aurora 2 environment robust speech recognition task. The algorithm consists of two main steps. First, the noise and channel parameters are estimated using a nonlinear environment-distortion model in the cepstral domain, the speech recognizer’s “feedback” information (the posterior probabilities of all the Gaussians in speech recognizer),

Acknowledgements

We would like to thank Dr. Jasha Droppo at Microsoft research for the help in setting up the experimental platform. We also appreciate the anonymous reviewers for suggestions making the paper quality better.

References (38)

  • X. Cui et al.

    Noise robust speech recognition using feature compensation based on polynomial regression of utterance SNR

    IEEE Trans. Speech Audio Process.

    (2005)
  • A. Dempster et al.

    Maximum likelihood from incomplete data via the EM algorithm

    J. R. Stat. Soc. B

    (1977)
  • Deng, L., 2007. Roles of high-fidelity acoustic modeling in robust speech recognition. In: Proc. IEEE ASRU. pp....
  • Deng, L., Acero, A., Plumpe, M., Huang, X., 2000. Large vocabulary speech recognition under adverse acoustic...
  • L. Deng et al.

    Enhancement of log-spectra of speech using a phase-sensitive model of the acoustic environment

    IEEE Trans. Speech Audio Process.

    (2004)
  • Gales, M.J.F., 1995. Model-Based Techniques for Noise Robust Speech Recognition, Ph.D. Thesis. Cambridge...
  • Gales, M.J.F., Young, S., 1992. An improved approach to the hidden Markov model decomposition of speech and noise. In:...
  • Y. Gong

    A method of joint compensation of additive and convolutive distortions for speaker-independent speech recognition

    IEEE Trans. Speech Audio Process.

    (2005)
  • Gopinath, R.A., Gales, M.J.F., Gopalakrishnan, P.S., Balakrishnan-Aiyer, S., Picheny, M.A., 1995. Robust speech...
  • Cited by (86)

    • Monaural multi-talker speech recognition using factorial speech processing models

      2018, Speech Communication
      Citation Excerpt :

      The best alpha value is selected for subsequent experiments, and alpha candidates are selected much like the work of Li et al. (2009). As Fig. 6 indicates the alpha is selected to be 2 while this value is not valid regarding its support set (the reader is referred to Li et al. (2009) or Van Dalen (2011) for explanation of this theoretical contradiction). To select the appropriate feature type, a different combination of MFCC features are selected for evaluation on the development set; Table 1 lists their results.

    • Phase term modeling for enhanced feature-space VTS

      2017, Speech Communication
      Citation Excerpt :

      Its simplest form was proposed in the middle 90s (Moreno, 1996; Moreno et al., 1996) and since then many improvements and generalizations have been made. The scope of VTS comprises the compensation of the acoustic features distortion (Kim et al., 1998; Stouten et al., 2003; Li et al., 2012b), the acoustic models adaptation to the environment (Kim et al., 1998; Acero et al., 2000; Li et al., 2007, 2009, 2012a) and various forms of noise adaptive training (Hu and Huo, 2007; Kalinli et al., 2010; Li et al., 2012b). Besides, VTS is often used in combination with other approaches, for instance Join Uncertainty Decoding (JUD) (Liao and Gales, 2005, 2006; Liao, 2007) and Support Vector Machines (SVM) (Gales and Flego, 2014).

    • A comparative study of noise estimation algorithms for nonlinear compensation in robust speech recognition

      2017, Speech Communication
      Citation Excerpt :

      The ML baseline yields WER of 41.57%. At runtime, the acoustic models are compensated in an unsupervised, utterance-by-utterance manner, similar to Li et al. (2009), as follows: For each utterance, initialize the additive noise parameters using the first and last 20 frames, and set the channel mean vector to 0.

    • Spectrum enhancement with sparse coding for robust speech recognition

      2015, Digital Signal Processing: A Review Journal
    View all citing articles on Scopus
    View full text