A unified framework of HMM adaptation with joint compensation of additive and convolutive distortions

doi:10.1016/j.csl.2009.02.001

Computer Speech & Language

Volume 23, Issue 3, July 2009, Pages 389-405

https://doi.org/10.1016/j.csl.2009.02.001 Get rights and content

Abstract

In this paper, we present our recent development of a model-domain environment robust adaptation algorithm, which demonstrates high performance in the standard Aurora 2 speech recognition task. The algorithm consists of two main steps. First, the noise and channel parameters are estimated using multi-sources of information including a nonlinear environment-distortion model in the cepstral domain, the posterior probabilities of all the Gaussians in speech recognizer, and truncated vector Taylor series (VTS) approximation. Second, the estimated noise and channel parameters are used to adapt the static and dynamic portions (delta and delta–delta) of the HMM means and variances. This two-step algorithm enables joint compensation of both additive and convolutive distortions (JAC). The hallmark of our new approach is the use of a nonlinear, phase-sensitive model of acoustic distortion that captures phase asynchrony between clean speech and the mixing noise.

In the experimental evaluation using the standard Aurora 2 task, the proposed Phase-JAC/VTS algorithm achieves 93.32% word accuracy using the clean-trained complex HMM backend as the baseline system for the unsupervised model adaptation. This represents high recognition performance on this task without discriminative training of the HMM system. The experimental results show that the phase term, which was missing in all previous HMM adaptation work, contributes significantly to the achieved high recognition accuracy.

Introduction

Environment robustness in speech recognition remains an outstanding and difficult problem despite many years of research and investment (Peinado and Segura, 2006). The difficulty arises due to many possible types of distortions, including additive and convolutive distortions and their mixes, which are not easy to predict accurately during recognizers’ development. As a result, the speech recognizer trained using clean speech often degrades its performance significantly when used under noisy environments if no compensation is applied (Lee, 1998, Gong, 1995).

Different methodologies have been proposed in the past for environment robustness in speech recognition over the past two decades. There are two main classes of approaches. In the first, feature-domain class where no HMM information is exploited, the distorted speech features are enhanced with advanced signal processing methods. Spectral subtraction (SS) (Boll, 1979) is widely used as a simple technique to reduce additive noise in the spectral domain. Cepstral mean normalization (CMN) (Atal, 1974) removes the mean vector in the acoustic features of the utterance in order to reduce or eliminate the convolutive channel effect. As an extension to CMN, Cepstral variance normalization (CVN) (Molau et al., 2003) also adjusts the feature variance to improve automatic speech recognition (ASR) robustness. Relative spectra (RASTA) (Hermansky and Morgan, 1994) employs a long span of speech signals in order to remove or reduce the acoustic distortion. All these traditional feature-domain methods are relatively simple, and are shown to have achieved medium-level distortion reduction. In recent years, new feature-domain methods have been proposed using more advanced signal processing techniques to achieve more significant performance improvement in noise robustness ASR tasks than the traditional methods. Examples include feature space nonlinear transformation techniques (Molau et al., 2003, Padmanabhan and Dharanipragada, 2001), the ETSI advanced front end (AFE) (Macho et al., 2002) and stereo-based piecewise linear compensation for environments (SPLICE) (Deng et al., 2000). In Padmanabhan and Dharanipragada (2001), a piecewise-linear approximation to a nonlinear transformation is used to map the features in the training space to the testing space. This is extended in Molau et al. (2003) with further combination with other normalization technologies such as feature space rotation and vocal tract length normalization to get satisfactory results. AFE (Macho et al., 2002) integrates several noise robustness methods to remove additive noise with two-stage Mel-warped Wiener filtering (Agarwal and Cheng, 1999) and SNR-dependent waveform processing (Macho and Cheng, 2001), and mitigates the channel effect with blind equalization (Mauuary, 1998). SPLICE (Deng et al., 2000) assumes the distorted cepstrum is distributed according to a mixture of Gaussian, and is cleaned by removing the correction vector determined by the parameters in these Gaussians. Although these feature-based algorithms obtain satisfactory results, they usually perform worse than the model-based algorithms, which utilize the power of modeling.

The other, model-based class of techniques operates on the model (HMM) domain to adapt or adjust the model parameters so that the system becomes better matched to the distorted environment. The most straight forward way is to train models from the distorted speech. It is usually expensive to acquire sufficient amounts of distorted speech signals. Hence, multi-style training (Lippmann et al., 1987) is designed to add the different kinds of distortions on clean speech signals, and train models from these artificially distorted signals. However, this method requires the knowledge of all the distorted environments and needs to retrain models. In order to overcome these difficulties, model-domain methods have been developed that directly adapt the models trained with clean speech to the distorted environments. Signal bias removal method (Rahim and Juang, 1996) estimates the channel mean in a maximum likelihood estimation (MLE) manner, and removes this channel mean from the Gaussian means in the HMMs. Maximum likelihood linear regression (MLLR) (Leggetter and Woodland, 1995, Cui and Alwan, 2005, Saon et al., 2001a) has also been used to adapt the clean-trained model to the distorted environments. However, to achieve better performance the MLLR method often requires significantly more than one transformation matrix, and this inevitably results in demanding requirements for the amount of the adaptation data. Further, parallel model combination (PMC) method (Gales and Young, 1992) relies on one set of speech models and another set of noise models to achieve the goal of model adaptation using approximate log-normal distributions. Channel distortion is not considered in the basic PMC framework. As an extension, PMC can address both the noise and channel distortions in (Gales, 1995).

Differing from the several model-domain adaptation methods discussed above, the methods of joint compensation of additive and convolutive distortions (JAC) (Moreno, 1996, Kim et al., 1998, Acero et al., 2000, Gong, 2005) have shown their advantages by using a distortion model for noise and channel and using linearized vector Taylor series (VTS) approximation. The JAC-based algorithm proposed in Moreno (1996) directly used VTS to estimate the noise and channel mean but adapted the features instead of the models. In that work, no dynamic (delta and delta–delta) portions of the features were compensated either. The work in Acero et al. (2000), on the other hand, proposed a framework to adjust both the static and dynamic portions of HMM parameters given the known noise and channel parameters. However, while it was mentioned in Acero et al. (2000) that the iterative expectation maximization (EM) algorithm (Dempster et al., 1977) can be used for the estimation of the noise and channel parameters, no actual algorithm was developed and reported.

A similar JAC-based model adaptation method was proposed in Kim et al. (1998), where both the static mean and variance parameters in the cepstral domain are adjusted using the VTS approximation technique. In that work, however, noise was estimated on the frame-by-frame basis. This process is complex and computationally costly and the resulting estimate may not be reliable. Furthermore, no adaptation was made for the delta or dynamic portions of HMM parameters, which is known to be important for high performance robust speech recognition.

JAC developed in Gong (2005) directly estimates the noise and channel distortion parameters in the log-spectral domain, adjusts the acoustic HMM parameters in the same log-spectral domain, and then converts the parameters to the cepstral domain. However, no strategy for HMM variance adaptation has been given in Gong (2005) and the techniques for estimating the distortion parameters involve a number of approximations, as analyzed in the later section.

Finally, the JAC method in Liao and Gales (2006) also adapts all the static and dynamic HMM parameters. The recent study on uncertainty decoding (Liao and Gales, 2007) also intended to jointly compensate for the additive and convolutive distortions.

In all the previous JAC/VTS work for HMM adaptation, the environment-distortion model makes the simplifying assumption of instantaneous phase synchrony (phase-insensitive) between the clean speech and the mixing noise. This assumption was relaxed in the work reported in Deng et al. (2004), where a new phase term was introduced to account for the random nature of the phase asynchrony. And it was shown in Deng et al. (2004) that when the noise magnitude is estimated accurately, the Gaussian-distributed phase term plays a key role in recovering clean speech features by removing the noise and the cross term between the noise and speech.

However, in contrast to the JAC/VTS approach that implements robustness in the model (HMM) domain, the approach of Deng et al. (2004) was implemented in the feature-domain (i.e., feature enhancement instead of HMM adaptation), producing inferior recognition results than the model-domain approach despite the use of a more accurate environment-distortion model (phase-sensitive versus phase-insensitive models).

The research presented in this paper extends and integrates our earlier two sets of work: HMM adaptation with the phase-insensitive environment-distortion model (Acero et al., 2000, Li et al., 2007) and feature enhancement with the phase-sensitive environment-distortion model (Deng et al., 2004). The new algorithm developed and presented in this paper implements environment robustness via HMM adaptation taking into account phase asynchrony between clean speech and the mixing noise. That is, it incorporates the same phase term in Deng et al. (2004) into the rigorous formulation of JAC/VTS of Li et al. (2007). We hence name our new algorithm as Phase-JAC/VTS. In this work, both the static and dynamic mean and variance of the noise vector and the mean vector of the channel are rigorously estimated on an utterance-by-utterance basis using VTS. In addition to the novel phase-sensitive model adaptation, our algorithm differs from previous JAC methods in two parts: dynamic noise mean estimation and the noise variance estimation.

The rest of the paper is organized as follows. In Section 2, we present our new Phase-JAC/VTS algorithm and its implementation steps. Experimental evaluation of the algorithm is provided in Section 3, where we show that our new algorithm can achieve 93.32% word recognition accuracy averaged over all distortion conditions on the Aurora 2 task with the standard complex back-end, clean-trained model and standard MFCCs. We summarize our study and draw conclusions in Section 4.

Section snippets

JAC/VTS adaptation algorithm

In this section, we first derive the adaptation formulas for the HMM means and variances in the MFCC (both static and dynamic) domain using VTS approximation assuming that the estimates of the additive and convolutive parameters are known. We then derive the algorithm which jointly estimates the additive and convolutive distortion parameters based on VTS approximation. A summary description follows on the implementation steps of the entire algorithm which were used in our experiments.

Experiments

The effectiveness of the Phase-JAC/VTS algorithm presented in Section 2 has been evaluated on the standard Aurora 2 task of recognizing digit strings in noise and channel distorted environments. The clean training set, which consists of 8440 clean utterances, is used to train the baseline MLE HMMs. The test material consists of three sets of distorted utterances. The data in set-A and set-B contain eight different types of additive noise, while set-C contain two different types of noise plus

Conclusion

In this paper, we have presented our recent development of the Phase-JAC/VTS algorithm for HMM adaptation and demonstrated its effectiveness in the standard Aurora 2 environment robust speech recognition task. The algorithm consists of two main steps. First, the noise and channel parameters are estimated using a nonlinear environment-distortion model in the cepstral domain, the speech recognizer’s “feedback” information (the posterior probabilities of all the Gaussians in speech recognizer),

Acknowledgements

We would like to thank Dr. Jasha Droppo at Microsoft research for the help in setting up the experimental platform. We also appreciate the anonymous reviewers for suggestions making the paper quality better.

References (38)

M.J.F. Gales
Maximum likelihood linear transformations for HMM-based speech recognition
Comput. Speech Lang.
(1998)
Y. Gong
Speech Recognition in Noisy Environments: A Survey
Speech Commun.
(1995)
D.Y. Kim et al.
Speech recognition in noisy environments using first order vector Taylor series
Speech Commun.
(1998)
C.-H. Lee
On stochastic feature and model compensation approaches to robust speech recognition
Speech Commun.
(1998)
C.J. Leggetter et al.
Maximum likelihood linear regression for speaker adaptation of continuous density HMMs
Comput. Speech Lang.
(1995)
A. Acero
Acoustical and Environmental Robustness in Automatic Speech Recognition
(1993)
Acero, A., Deng, L., Kristjansson, T., Zhang, J., 2000. HMM adaptation using vector Taylor series for noisy speech...
Agarwal, A., Cheng, Y.M., 1999. Two-stage Mel-warped Wiener filter for robust speech recognition. In: Proc. ASRU. pp....
B. Atal
Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification
J. Acoust. Soc. Am.
(1974)
S.F. Boll
Suppression of acoustic noise in speech using spectral subtraction
IEEE Trans. Acoust. Speech Signal Process.
(1979)

X. Cui et al.

Noise robust speech recognition using feature compensation based on polynomial regression of utterance SNR

IEEE Trans. Speech Audio Process.

(2005)

A. Dempster et al.

Maximum likelihood from incomplete data via the EM algorithm

J. R. Stat. Soc. B

(1977)

Deng, L., 2007. Roles of high-fidelity acoustic modeling in robust speech recognition. In: Proc. IEEE ASRU. pp....

Deng, L., Acero, A., Plumpe, M., Huang, X., 2000. Large vocabulary speech recognition under adverse acoustic...

L. Deng et al.

Enhancement of log-spectra of speech using a phase-sensitive model of the acoustic environment

IEEE Trans. Speech Audio Process.

(2004)

Gales, M.J.F., 1995. Model-Based Techniques for Noise Robust Speech Recognition, Ph.D. Thesis. Cambridge...

Gales, M.J.F., Young, S., 1992. An improved approach to the hidden Markov model decomposition of speech and noise. In:...

Y. Gong

A method of joint compensation of additive and convolutive distortions for speaker-independent speech recognition

IEEE Trans. Speech Audio Process.

(2005)

Gopinath, R.A., Gales, M.J.F., Gopalakrishnan, P.S., Balakrishnan-Aiyer, S., Picheny, M.A., 1995. Robust speech...

Cited by (86)

Monaural multi-talker speech recognition using factorial speech processing models
2018, Speech Communication
Citation Excerpt :
The best alpha value is selected for subsequent experiments, and alpha candidates are selected much like the work of Li et al. (2009). As Fig. 6 indicates the alpha is selected to be 2 while this value is not valid regarding its support set (the reader is referred to Li et al. (2009) or Van Dalen (2011) for explanation of this theoretical contradiction). To select the appropriate feature type, a different combination of MFCC features are selected for evaluation on the development set; Table 1 lists their results.
A Pascal challenge entitled monaural speech separation and recognition challenge was developed, targeting the problem of robust automatic speech recognition against speech-like noises which significantly degrade the performance of automatic speech recognition systems. In this challenge, two competing speakers say a simple command simultaneously and the objective is to recognize speech of the target speaker. Surprisingly, a team from IBM research could achieve performance better than human listeners on this task during the challenge. The IBM system consists of an intermediate speech separation and two single-talker speech recognition modules. This paper reconsiders the recognition task of this challenge based on gain adapted factorial speech processing models. It develops a joint-token passing algorithm for direct joint-decoding of target and masker speakers’ mixed-signals, simultaneously. It uses maximum uncertainty during the joint-decoding, which cannot be used in the two-phased IBM system. This paper provides a detailed derivation of inference on these models based on the general inference procedures of probabilistic graphical models. Additionally, it uses deep neural networks for joint-speaker identification and their gain estimation, which makes these two steps easier than before while producing competitive results for these steps. The proposed method of this work outperforms past super-human results and even the results recently achieved using deep neural networks by Microsoft research. It achieved 5.3% absolute task performance improvement compared to the first super-human system and 2.5% absolute task performance improvement compared to its recent competitor.
Feature joint-state posterior estimation in factorial speech processing models using deep neural networks
2017, Computers and Electrical Engineering
This paper proposes a new method for calculating joint-state posteriors of mixed-audio features using deep neural networks to be used in factorial speech processing models. The joint-state posterior information is required in factorial models to perform joint-decoding. The novelty of this work is its architecture which enables the network to infer joint-state posteriors from the pairs of state posteriors of stereo features. This paper defines an objective function to solve an underdetermined system of equations, which is used by the network for extracting joint-state posteriors. It develops the required expressions for fine-tuning the network in a unified way. The experiments compare the proposed network decoding results to those of the vector Taylor series method and show 2.3% absolute performance improvement in the monaural speech separation and recognition challenge. This achievement is substantial when we consider the simplicity of joint-state posterior extraction provided by deep neural networks.
Phase term modeling for enhanced feature-space VTS
2017, Speech Communication
Citation Excerpt :
Its simplest form was proposed in the middle 90s (Moreno, 1996; Moreno et al., 1996) and since then many improvements and generalizations have been made. The scope of VTS comprises the compensation of the acoustic features distortion (Kim et al., 1998; Stouten et al., 2003; Li et al., 2012b), the acoustic models adaptation to the environment (Kim et al., 1998; Acero et al., 2000; Li et al., 2007, 2009, 2012a) and various forms of noise adaptive training (Hu and Huo, 2007; Kalinli et al., 2010; Li et al., 2012b). Besides, VTS is often used in combination with other approaches, for instance Join Uncertainty Decoding (JUD) (Liao and Gales, 2005, 2006; Liao, 2007) and Support Vector Machines (SVM) (Gales and Flego, 2014).
This paper proposes a generalization of the Vector Taylor Series (VTS) approach for the compensation of speech feature distortions. It uses a phase term aware representation of the speech distortion model. It considers this term as a Gaussian random vector with unknown parameters in the same manner as it is conventionally done for additive noise. These parameters are estimated by means of the EM-algorithm. The explicit expressions for parameters update are derived. The minimum mean square error (MMSE) estimate of clean speech features is also obtained. Experiments carried out on the Aurora2 and Aurora4 databases show that the proposed approach outperforms the phase-insensitive version of feature-space VTS significantly for both GMM and DNN acoustic models. It is also shown that the combination of the proposed approach with the cepstral mean normalization (CMN) provides additional accuracy gains.
A comparative study of noise estimation algorithms for nonlinear compensation in robust speech recognition
2017, Speech Communication
Citation Excerpt :
The ML baseline yields WER of 41.57%. At runtime, the acoustic models are compensated in an unsupervised, utterance-by-utterance manner, similar to Li et al. (2009), as follows: For each utterance, initialize the additive noise parameters using the first and last 20 frames, and set the channel mean vector to 0.
Nonlinear compensation models make use of a nonlinear mismatch function, which characterizes the joint effects of additive and convolutional noise, to realize noise-robust speech recognition. Representative compensation models consist of vector Taylor series (VTS), data-driven parallel model combination (DPMC), and unscented transform (UT). The noise parameters of the compensation models, often estimated in the maximum likelihood (ML) sense, are known to play an important role on the system performance in noisy conditions. In this paper, we conduct a systematic comparison between two popular approaches for estimating the noise parameters. The first approach employs the Gauss-Newton method in a generalized EM framework to iteratively maximizing the EM auxiliary function. The second approach views the compensation models from a generative perspective, giving rise to an EM algorithm, analogous to the ML estimation for factor analysis (EM-FA). We demonstrate a close connection between these two approaches: they belong to the family of gradient-based methods except with different convergence rates. Note that the convergence property can be crucial to the noise estimation since model compensation may be frequently carried out in changing noisy environments for retaining desired performance. Furthermore, we present an in-depth discussion on the advantages and limitations of the two approaches, and illustrate how to extend these approaches to allow for adaptive training. The investigated noise estimation approaches are evaluated on several tasks. The first is to fit a GMM model to artificially corrupted samples, and then speech recognition are performed on the Aurora 2 and Aurora 4 tasks.
Analysis of speech quality measures for the task of estimating the reliability of speaker verification decisions
2016, Speech Communication
Despite the great advances made in the speaker recognition field, like joint factor analysis (JFA) and i-vectors, there are still situations where the quality of the speech signals involved in a speaker verification (SV) trial are not good enough to take reliable decisions. This fact motivated us to investigate speech quality measures that are related to the SV performance. We analyzed measures like signal-to-noise ratio (SNR), modulation index, number of speech frames, jitter, shimmer, or likelihood of the data given the universal background model (UBM), JFA and probabilistic linear discriminant analysis models. Besides, we introduce a novel and promising measure based on the vector Taylor series (VTS) paradigm, used to adapt a clean GMM to noisy speech. We used Bayesian networks to combine these measures and produce a probabilistic reliability measure. We applied it to detect trials badly classified. We trained our Bayesian network on NIST SRE08 distorted with noise and reverberation and evaluated on a distorted version of SRE10. We found that, for noise, the best measures were SNR and modulation index; and for reverberation, the UBM likelihood. VTS based measures performed well for both types of distortions.
Spectrum enhancement with sparse coding for robust speech recognition
2015, Digital Signal Processing: A Review Journal
Recently, a trend in speech recognition is to introduce sparse coding for noise robustness. Although several methods have been proposed, the performance of sparse coding in speech denoising is not so optimistic. One assumption with sparse coding is that the representation of speech over the speech dictionary is sparse, while that of the noise is dense. This assumption is obviously not sustained in the speech denoising scenario. Many noises are also sparse over the speech dictionary. In such a condition, the representation of noisy speech still contains noise components, resulting in degraded performance. To solve this problem, we first analyze the assumption of sparse coding and then propose a novel method to enhance speech spectrum. This method first finds out the atoms which represent the noise sparsely, and then selectively ignores them in the reconstruction of speech to reduce the residual noise. Speech features are then extracted from the enhanced spectrum for speech recognition. Experimental results show that the proposed method can improve the noise robustness of a speech recognition system substantially.

View all citing articles on Scopus

View full text

A unified framework of HMM adaptation with joint compensation of additive and convolutive distortions

Abstract

Introduction

Section snippets

JAC/VTS adaptation algorithm

Experiments

Conclusion

Acknowledgements

Comput. Speech Lang.

Speech Commun.

Speech Commun.

Speech Commun.

Comput. Speech Lang.

Acoustical and Environmental Robustness in Automatic Speech Recognition

Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification

J. Acoust. Soc. Am.

Suppression of acoustic noise in speech using spectral subtraction

IEEE Trans. Acoust. Speech Signal Process.

Noise robust speech recognition using feature compensation based on polynomial regression of utterance SNR

IEEE Trans. Speech Audio Process.

Maximum likelihood from incomplete data via the EM algorithm

J. R. Stat. Soc. B

Enhancement of log-spectra of speech using a phase-sensitive model of the acoustic environment

IEEE Trans. Speech Audio Process.

A method of joint compensation of additive and convolutive distortions for speaker-independent speech recognition

IEEE Trans. Speech Audio Process.