A unified framework of HMM adaptation with joint compensation of additive and convolutive distortions
Introduction
Environment robustness in speech recognition remains an outstanding and difficult problem despite many years of research and investment (Peinado and Segura, 2006). The difficulty arises due to many possible types of distortions, including additive and convolutive distortions and their mixes, which are not easy to predict accurately during recognizers’ development. As a result, the speech recognizer trained using clean speech often degrades its performance significantly when used under noisy environments if no compensation is applied (Lee, 1998, Gong, 1995).
Different methodologies have been proposed in the past for environment robustness in speech recognition over the past two decades. There are two main classes of approaches. In the first, feature-domain class where no HMM information is exploited, the distorted speech features are enhanced with advanced signal processing methods. Spectral subtraction (SS) (Boll, 1979) is widely used as a simple technique to reduce additive noise in the spectral domain. Cepstral mean normalization (CMN) (Atal, 1974) removes the mean vector in the acoustic features of the utterance in order to reduce or eliminate the convolutive channel effect. As an extension to CMN, Cepstral variance normalization (CVN) (Molau et al., 2003) also adjusts the feature variance to improve automatic speech recognition (ASR) robustness. Relative spectra (RASTA) (Hermansky and Morgan, 1994) employs a long span of speech signals in order to remove or reduce the acoustic distortion. All these traditional feature-domain methods are relatively simple, and are shown to have achieved medium-level distortion reduction. In recent years, new feature-domain methods have been proposed using more advanced signal processing techniques to achieve more significant performance improvement in noise robustness ASR tasks than the traditional methods. Examples include feature space nonlinear transformation techniques (Molau et al., 2003, Padmanabhan and Dharanipragada, 2001), the ETSI advanced front end (AFE) (Macho et al., 2002) and stereo-based piecewise linear compensation for environments (SPLICE) (Deng et al., 2000). In Padmanabhan and Dharanipragada (2001), a piecewise-linear approximation to a nonlinear transformation is used to map the features in the training space to the testing space. This is extended in Molau et al. (2003) with further combination with other normalization technologies such as feature space rotation and vocal tract length normalization to get satisfactory results. AFE (Macho et al., 2002) integrates several noise robustness methods to remove additive noise with two-stage Mel-warped Wiener filtering (Agarwal and Cheng, 1999) and SNR-dependent waveform processing (Macho and Cheng, 2001), and mitigates the channel effect with blind equalization (Mauuary, 1998). SPLICE (Deng et al., 2000) assumes the distorted cepstrum is distributed according to a mixture of Gaussian, and is cleaned by removing the correction vector determined by the parameters in these Gaussians. Although these feature-based algorithms obtain satisfactory results, they usually perform worse than the model-based algorithms, which utilize the power of modeling.
The other, model-based class of techniques operates on the model (HMM) domain to adapt or adjust the model parameters so that the system becomes better matched to the distorted environment. The most straight forward way is to train models from the distorted speech. It is usually expensive to acquire sufficient amounts of distorted speech signals. Hence, multi-style training (Lippmann et al., 1987) is designed to add the different kinds of distortions on clean speech signals, and train models from these artificially distorted signals. However, this method requires the knowledge of all the distorted environments and needs to retrain models. In order to overcome these difficulties, model-domain methods have been developed that directly adapt the models trained with clean speech to the distorted environments. Signal bias removal method (Rahim and Juang, 1996) estimates the channel mean in a maximum likelihood estimation (MLE) manner, and removes this channel mean from the Gaussian means in the HMMs. Maximum likelihood linear regression (MLLR) (Leggetter and Woodland, 1995, Cui and Alwan, 2005, Saon et al., 2001a) has also been used to adapt the clean-trained model to the distorted environments. However, to achieve better performance the MLLR method often requires significantly more than one transformation matrix, and this inevitably results in demanding requirements for the amount of the adaptation data. Further, parallel model combination (PMC) method (Gales and Young, 1992) relies on one set of speech models and another set of noise models to achieve the goal of model adaptation using approximate log-normal distributions. Channel distortion is not considered in the basic PMC framework. As an extension, PMC can address both the noise and channel distortions in (Gales, 1995).
Differing from the several model-domain adaptation methods discussed above, the methods of joint compensation of additive and convolutive distortions (JAC) (Moreno, 1996, Kim et al., 1998, Acero et al., 2000, Gong, 2005) have shown their advantages by using a distortion model for noise and channel and using linearized vector Taylor series (VTS) approximation. The JAC-based algorithm proposed in Moreno (1996) directly used VTS to estimate the noise and channel mean but adapted the features instead of the models. In that work, no dynamic (delta and delta–delta) portions of the features were compensated either. The work in Acero et al. (2000), on the other hand, proposed a framework to adjust both the static and dynamic portions of HMM parameters given the known noise and channel parameters. However, while it was mentioned in Acero et al. (2000) that the iterative expectation maximization (EM) algorithm (Dempster et al., 1977) can be used for the estimation of the noise and channel parameters, no actual algorithm was developed and reported.
A similar JAC-based model adaptation method was proposed in Kim et al. (1998), where both the static mean and variance parameters in the cepstral domain are adjusted using the VTS approximation technique. In that work, however, noise was estimated on the frame-by-frame basis. This process is complex and computationally costly and the resulting estimate may not be reliable. Furthermore, no adaptation was made for the delta or dynamic portions of HMM parameters, which is known to be important for high performance robust speech recognition.
JAC developed in Gong (2005) directly estimates the noise and channel distortion parameters in the log-spectral domain, adjusts the acoustic HMM parameters in the same log-spectral domain, and then converts the parameters to the cepstral domain. However, no strategy for HMM variance adaptation has been given in Gong (2005) and the techniques for estimating the distortion parameters involve a number of approximations, as analyzed in the later section.
Finally, the JAC method in Liao and Gales (2006) also adapts all the static and dynamic HMM parameters. The recent study on uncertainty decoding (Liao and Gales, 2007) also intended to jointly compensate for the additive and convolutive distortions.
In all the previous JAC/VTS work for HMM adaptation, the environment-distortion model makes the simplifying assumption of instantaneous phase synchrony (phase-insensitive) between the clean speech and the mixing noise. This assumption was relaxed in the work reported in Deng et al. (2004), where a new phase term was introduced to account for the random nature of the phase asynchrony. And it was shown in Deng et al. (2004) that when the noise magnitude is estimated accurately, the Gaussian-distributed phase term plays a key role in recovering clean speech features by removing the noise and the cross term between the noise and speech.
However, in contrast to the JAC/VTS approach that implements robustness in the model (HMM) domain, the approach of Deng et al. (2004) was implemented in the feature-domain (i.e., feature enhancement instead of HMM adaptation), producing inferior recognition results than the model-domain approach despite the use of a more accurate environment-distortion model (phase-sensitive versus phase-insensitive models).
The research presented in this paper extends and integrates our earlier two sets of work: HMM adaptation with the phase-insensitive environment-distortion model (Acero et al., 2000, Li et al., 2007) and feature enhancement with the phase-sensitive environment-distortion model (Deng et al., 2004). The new algorithm developed and presented in this paper implements environment robustness via HMM adaptation taking into account phase asynchrony between clean speech and the mixing noise. That is, it incorporates the same phase term in Deng et al. (2004) into the rigorous formulation of JAC/VTS of Li et al. (2007). We hence name our new algorithm as Phase-JAC/VTS. In this work, both the static and dynamic mean and variance of the noise vector and the mean vector of the channel are rigorously estimated on an utterance-by-utterance basis using VTS. In addition to the novel phase-sensitive model adaptation, our algorithm differs from previous JAC methods in two parts: dynamic noise mean estimation and the noise variance estimation.
The rest of the paper is organized as follows. In Section 2, we present our new Phase-JAC/VTS algorithm and its implementation steps. Experimental evaluation of the algorithm is provided in Section 3, where we show that our new algorithm can achieve 93.32% word recognition accuracy averaged over all distortion conditions on the Aurora 2 task with the standard complex back-end, clean-trained model and standard MFCCs. We summarize our study and draw conclusions in Section 4.
Section snippets
JAC/VTS adaptation algorithm
In this section, we first derive the adaptation formulas for the HMM means and variances in the MFCC (both static and dynamic) domain using VTS approximation assuming that the estimates of the additive and convolutive parameters are known. We then derive the algorithm which jointly estimates the additive and convolutive distortion parameters based on VTS approximation. A summary description follows on the implementation steps of the entire algorithm which were used in our experiments.
Experiments
The effectiveness of the Phase-JAC/VTS algorithm presented in Section 2 has been evaluated on the standard Aurora 2 task of recognizing digit strings in noise and channel distorted environments. The clean training set, which consists of 8440 clean utterances, is used to train the baseline MLE HMMs. The test material consists of three sets of distorted utterances. The data in set-A and set-B contain eight different types of additive noise, while set-C contain two different types of noise plus
Conclusion
In this paper, we have presented our recent development of the Phase-JAC/VTS algorithm for HMM adaptation and demonstrated its effectiveness in the standard Aurora 2 environment robust speech recognition task. The algorithm consists of two main steps. First, the noise and channel parameters are estimated using a nonlinear environment-distortion model in the cepstral domain, the speech recognizer’s “feedback” information (the posterior probabilities of all the Gaussians in speech recognizer),
Acknowledgements
We would like to thank Dr. Jasha Droppo at Microsoft research for the help in setting up the experimental platform. We also appreciate the anonymous reviewers for suggestions making the paper quality better.
References (38)
Maximum likelihood linear transformations for HMM-based speech recognition
Comput. Speech Lang.
(1998)Speech Recognition in Noisy Environments: A Survey
Speech Commun.
(1995)- et al.
Speech recognition in noisy environments using first order vector Taylor series
Speech Commun.
(1998) On stochastic feature and model compensation approaches to robust speech recognition
Speech Commun.
(1998)- et al.
Maximum likelihood linear regression for speaker adaptation of continuous density HMMs
Comput. Speech Lang.
(1995) Acoustical and Environmental Robustness in Automatic Speech Recognition
(1993)- Acero, A., Deng, L., Kristjansson, T., Zhang, J., 2000. HMM adaptation using vector Taylor series for noisy speech...
- Agarwal, A., Cheng, Y.M., 1999. Two-stage Mel-warped Wiener filter for robust speech recognition. In: Proc. ASRU. pp....
Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification
J. Acoust. Soc. Am.
(1974)Suppression of acoustic noise in speech using spectral subtraction
IEEE Trans. Acoust. Speech Signal Process.
(1979)
Noise robust speech recognition using feature compensation based on polynomial regression of utterance SNR
IEEE Trans. Speech Audio Process.
Maximum likelihood from incomplete data via the EM algorithm
J. R. Stat. Soc. B
Enhancement of log-spectra of speech using a phase-sensitive model of the acoustic environment
IEEE Trans. Speech Audio Process.
A method of joint compensation of additive and convolutive distortions for speaker-independent speech recognition
IEEE Trans. Speech Audio Process.
Cited by (86)
Monaural multi-talker speech recognition using factorial speech processing models
2018, Speech CommunicationCitation Excerpt :The best alpha value is selected for subsequent experiments, and alpha candidates are selected much like the work of Li et al. (2009). As Fig. 6 indicates the alpha is selected to be 2 while this value is not valid regarding its support set (the reader is referred to Li et al. (2009) or Van Dalen (2011) for explanation of this theoretical contradiction). To select the appropriate feature type, a different combination of MFCC features are selected for evaluation on the development set; Table 1 lists their results.
Feature joint-state posterior estimation in factorial speech processing models using deep neural networks
2017, Computers and Electrical EngineeringPhase term modeling for enhanced feature-space VTS
2017, Speech CommunicationCitation Excerpt :Its simplest form was proposed in the middle 90s (Moreno, 1996; Moreno et al., 1996) and since then many improvements and generalizations have been made. The scope of VTS comprises the compensation of the acoustic features distortion (Kim et al., 1998; Stouten et al., 2003; Li et al., 2012b), the acoustic models adaptation to the environment (Kim et al., 1998; Acero et al., 2000; Li et al., 2007, 2009, 2012a) and various forms of noise adaptive training (Hu and Huo, 2007; Kalinli et al., 2010; Li et al., 2012b). Besides, VTS is often used in combination with other approaches, for instance Join Uncertainty Decoding (JUD) (Liao and Gales, 2005, 2006; Liao, 2007) and Support Vector Machines (SVM) (Gales and Flego, 2014).
A comparative study of noise estimation algorithms for nonlinear compensation in robust speech recognition
2017, Speech CommunicationCitation Excerpt :The ML baseline yields WER of 41.57%. At runtime, the acoustic models are compensated in an unsupervised, utterance-by-utterance manner, similar to Li et al. (2009), as follows: For each utterance, initialize the additive noise parameters using the first and last 20 frames, and set the channel mean vector to 0.
Spectrum enhancement with sparse coding for robust speech recognition
2015, Digital Signal Processing: A Review Journal