Joint evaluation of multiple speech patterns for speech recognition and training

https://doi.org/10.1016/j.csl.2009.05.001Get rights and content

Abstract

We are addressing the novel problem of jointly evaluating multiple speech patterns for automatic speech recognition and training. We propose solutions based on both the non-parametric dynamic time warping (DTW) algorithm, and the parametric hidden Markov model (HMM). We show that a hybrid approach is quite effective for the application of noisy speech recognition. We extend the concept to HMM training wherein some patterns may be noisy or distorted. Utilizing the concept of “virtual pattern” developed for joint evaluation, we propose selective iterative training of HMMs. Evaluating these algorithms for burst/transient noisy speech and isolated word recognition, significant improvement in recognition accuracy is obtained using the new algorithms over those which do not utilize the joint evaluation strategy.

Introduction

Improving speech recognition performance in the presence of noise and interference continues to be a challenging problem. Automatic speech recognition (ASR) systems work well when the test and training conditions match. In real world environments there is often a mismatch between testing and training conditions. Various factors like additive noise, acoustic echo, and speaker accent, affect the speech recognition performance. Since ASR is a statistical pattern recognition problem, if the test patterns are unlike anything used to train the models, errors are bound to occur, due to feature vector mismatch. Various approaches to robustness have been proposed in the ASR literature contributing to mainly two topics: (i) reducing the variability in the feature vectors or (ii) modify the statistical model parameters to suit the noisy condition. While some of the techniques are quite effective, we would like to examine robustness from a different perspective. Considering the analogy of human communication over telephones, it is quite common to ask the person speaking to us, to repeat certain portions of their speech, because we do not understand it. This happens more often in the presence of background noise where the intelligibility of speech is affected significantly. Although exact nature of how humans decode multiple repetitions of speech is not known, it is quite possible that we use the combined knowledge of the multiple utterances and decode the unclear part of speech. Majority of ASR algorithms do not address this issue, except in very specific issues such as pronunciation modeling. We recognize that under very high noise conditions or bursty error channels, such as in packet communication where packets get dropped, it would be beneficial to take the approach of repeated utterances for robust ASR. We have formulated a set of algorithms for both joint evaluation/decoding for recognizing noisy test utterances as well as utilize the same formulation for selective training of hidden Markov models (HMMs), again for robust performance. Evaluating the algorithms on a speaker independent confusable word isolated word recognition (IWR) task under noisy conditions has shown significant improvement in performance over the baseline systems which do not utilize such joint evaluation strategy.

A simultaneous decoding algorithm using multiple utterances to derive one or more allophonic transcriptions for each word was proposed in Wu and Gupta (1999). The goal of a simultaneous decoding algorithm is to find one optimal allophone sequence W for all input utterances U1,U2,,Un. Assuming independence among Ui, according to the Bayes criterion, W can be computed asW=argmaxWP(W/U1,U2,,Un)=argmaxWP(U1,U2,,Un/W)P(W)=argmaxWP(U1/W)P(U2/W)P(Un/W)P(W)where P(X) stands for the probability of the event X occurring.

From an information theoretic approach, consider two speech sequences U1 and U2. The joint entropy H(U1,U2) will be higher than either of their individual entropies H(U1) or H(U2) (Shannon, 1948). We know that if U1 and U2 are completely independent of each other then the joint entropy H(U1,U2) will be equal to H(U1)+H(U2). If they are completely dependent H(U1,U2)=H(U1)=H(U2). When U1 and U2 come from the same class, there is a high degree of correlation between them. Particularly when parts of U1 or U2 is corrupted, then the joint entropy would have a higher difference with respect to either of the individual entropies. This is because the noise is random and uncorrelated with the speech signal. These properties extend to >2 sequences also. The goal of the present pattern recognition task is to exploit this higher information content for better speech recognition. We utilize the HMM framework and maximum-likelihood (ML) decoding to develop new algorithms to achieve this result.

One direct approach to simultaneous decoding is to use the N-best criteria (Nilsson, 1971, Schwartz and Chow, 1990, Soong and Hung, 1991). In this, an individual N-best list for each input utterance is generated independently using the N-best search algorithm of statistical decoding. These individual N-best lists are merged and re-scored using all the input utterances (Haeb-Umbach et al., 1995); based on their joint likelihoods the transcriptions are re-ordered. However this solution is suboptimal unless N is very large (Wu and Gupta, 1999). Simultaneous decoding for multiple input utterances can be done using a modified version of the tree-trellis search algorithm (Soong and Hung, 1991) (the same algorithm was used in Holter and Svendsen (1998)). A forward Viterbi beam search for each utterance is performed independently, and then a combined backward A search (Bahl et al., 1983) for all the utterances is applied simultaneously. A word-network-based algorithm is also developed for simultaneous decoding. This algorithm has been shown to be computationally very efficient (Wu and Gupta, 1999).

Multiple utterances of same speech unit has been typically used in pronunciation estimation. Pronunciation determined using only one recording of a word can be very unreliable. So for more reliability, modeling multiple recordings of a word is used. However commonly used decoding algorithms are not suited to discover a phoneme sequence that jointly maximizes the likelihood of all the inputs. To arrive at the same solution, various alternative techniques have been proposed. One method is to produce recognition lattices individually from each of the inputs, and identify the most likely path in the intersection of these lattices. Another generates N-best hypotheses from each of the audio inputs and re-scores the cumulative set jointly with all the recordings (Singh et al., 2002, Svendsen, 2004). Alternately, the pronunciations may be derived by voting amongst the recognition outputs from the individual recordings (Fiscus, 1997). While all of these procedures result in outputs that are superior to what might be obtained using only one recorded instance of the word, they nevertheless do not truly identify the most likely pronunciation for the given set of recordings, and thus remain suboptimal. So it is important to jointly estimate the pronunciation from multiple recordings.

Dealing with multiple speech patterns occurs naturally during the training stage. In most, the patterns are considered as just independent exemplars of a random process, whose parameters are being determined. There is some work in the literature to make the ML training process of statistical model, such as HMM (Rabiner, 1989), more robust or better discriminative. For example, it is more difficult to discriminate between the words “rock” and “rack”, than between the words “rock” and “elephant”. To address such issues, there has been attempts to increase the separability among similar confusable classes, using multiple training patterns.

In discriminative training, the focus is on increasing the separable distance between the models, generally their means. Therefore the model is changed. In selective training the models are not forced to fit the training data, but deemphasizes the data which does not fit the models well. In Arslan and Hansen, 1996, Arslan and Hansen, 1999, each training pattern is selectively weighted by a confidence measure in order to control the influence of outliers, for accent and language identification application. Adaptation methods for selective training, where the training speakers close to the test speaker are chosen based on the likelihood of speaker Gaussian mixture models (GMMs) given the adaptation data, is done in Yoshizawa et al. (2001). By combining precomputed HMM sufficient statistics for the training data of the selected speakers, the adapted model is constructed. In Huang et al. (2004), cohort models close to the test speaker are selected, transformed and combined linearly. Using the methods in Yoshizawa et al., 2001, Huang et al., 2004, it is not possible to select data from a large data pool, if the speaker label of each utterance is unknown or if there are only few utterances per speaker. This can be the case when data is collected automatically, e.g., the dialogue system for public use such as Takemaru-kun (Nishimura et al., 2003). Selective training of acoustic models by deleting single patterns from a data pool temporarily or alternating between successive deletion or addition of patterns has been proposed in Cincarek et al. (2005).

In this paper, we formulate the problem of increasing ASR performance given multiple utterances (patterns) of the same word. Given K test patterns (K2) of a word, we would like to improve the speech recognition accuracy over a single test pattern, for the case of both clean and noisy speech. We try to jointly recognize multiple speech patterns such that the unreliable or corrupt portions of speech are given less weight during recognition while the clean portions of speech are given a higher weight. We also find the HMM state sequence which best represents the K patterns. Although the work is done for isolated word recognition, it can also be extended to connected word and continuous speech recognition. To the best of our knowledge, the problem that we are formulating has not been addressed before in speech recognition.

Next, we propose a new method to selectively train HMMs by jointly evaluating multiple training patterns. In the selective training papers, the outlier patterns are considered unreliable and are given a very low (or zero) weighting. But it is possible that only some portions of these outlier data are unreliable. For example, if some training patterns are affected by burst/transient noise (e.g., bird call) then it would make sense to give a lesser weighting to only the affected portion. Using the above joint formulation, we propose a new method to train HMMs by selectively weighting regions of speech such that the unreliable regions in the patterns are given a lower weight. We introduce the concept of “virtual training patterns” and the HMM is trained using the virtual training patterns instead of the original training data. We thus address all the three main tasks of HMMs by jointly evaluating multiple speech patterns.

The outline of the paper is as follows: Sections 2 Multi pattern dynamic time warping (MPDTW), 3 Joint likelihood of multiple speech patterns give different approaches to solve the problem of joint recognition of multiple speech patterns. In Section 4, the new method of selectively training HMMs using multiple speech patterns jointly is proposed. Section 5 gives the experimental evaluations for the proposed algorithms, followed by its computational complexity differences study in Section 6. Conclusions are given in Section 7.

Section snippets

Multi pattern dynamic time warping (MPDTW)

The dynamic time warping (DTW) (Rabiner and Juang, 1993, Myers et al., 1980, Sakoe and Chiba, 1978) is a formulation to find a warping function that provides the least distortion between any two given patterns; the optimum solution is determined through the dynamic programming methodology. DTW can be viewed as a pattern dissimilarity measure with embedded time normalization and alignment. We extend this formulation to multiple patterns greater than two, resulting in the multi pattern dynamic

Joint likelihood of multiple speech patterns

Considering left-to-right stochastic models of speech patterns, we now propose a new method to recognize K patterns jointly by finding their joint multi pattern likelihood, i.e., P(O1:T11,O1:T22,,O1:TKK/λ), where λ is the HMM. We assume that the stochastic model is good, but some or all of the test patterns may be distorted due to burst/transient noise or even badly pronounced. We would like to jointly recognize them in an “intelligent” way such that the noisy or unreliable portions of speech

Selective HMM training (SHT)

This is the next part of our joint multi-pattern formulation for robust ASR. We have first addressed evaluation and decoding tasks of HMM for multiple patterns. Now we consider the benefits of joint multi-pattern likelihood in HMM training. Thus, we would have addressed all the three main tasks of HMM, to utilize the availability of multiple patterns belonging to the same class. In the usual HMM training, all the training data is utilized to arrive at a best possible parametric model. But, it

MPDTW experiments

We carried out the experiments (based on the formulation in Section 2) using the IISc-BPL database

Computational complexity differences

We study the computational complexity of the following algorithms.

Conclusions

We have formulated new algorithms for joint evaluation of the likelihood of multiple speech patterns, extending both the DTW and HMM framework. This is achieved both individually using DTW and a combination of both DTW and HMM. We also show that this joint formulation is useful in selective iterative training of HMMs, to provide better performance when the training patterns are noisy or distorted. These algorithms are evaluated in the context of IWR under burst noise conditions and are shown to

References (42)

  • M. Cooke et al.

    Robust automatic speech recognition with missing and unreliable acoustic data

    Speech Commun.

    (2001)
  • Arslan, L.M., Hansen, J.H.L., 1996. Improved HMM training and scoring strategies with application to accent...
  • L.M. Arslan et al.

    Selective training for hidden markov models with applications to speech classication

    IEEE Trans. Speech Audio Proc.

    (1999)
  • L.R. Bahl et al.

    A maximum likelihood approach to continuous speech recognition

    IEEE Trans. PAMI, PAMI-5

    (1983)
  • L.E. Baum et al.

    Statistical inference for probabilistic functions of finite state Markov chains

    Ann. Math. Stat.

    (1966)
  • L.E. Baum et al.

    An inequality with applications to statistical estimation for probabilistic functions of a Markov process and to a model for ecology

    Bull. Am. Meteorol. Soc.

    (1967)
  • L.E. Baum et al.

    Growth functions for transformations on manifolds

    Pac. J. Math.

    (1968)
  • L.E. Baum et al.

    A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains

    Ann. Math. Stat.

    (1970)
  • L.E. Baum

    An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes

    Inequalities

    (1972)
  • Bansal, D., Nair, N., Singh, R., Raj, B., 2009. A joint decoding algorithm for multi-example-based addition of words to...
  • Cincarek, T., Toda, T., Saruwatari, H., Shikano, K., 2005. Selective EM training of acoustic models based on sufficient...
  • Cooke, M.P., Green, P.G., Crawford, M.D., 1994. Handling missing data in speech recognition. Proc. Int. Conf. Spoken...
  • Fiscus, J.G., 1997. A post-processing system to yield reduced word error rates: recognizer output voting error...
  • A. Gersho et al.

    Vector Quantization and Signal Compression

    (1992)
  • L. Gillick et al.

    Some statistical issues in the comparison of speech recognition algorithms

    Proc. IEEE Int. Conf. Acoustics, Speech, Signal Proc.

    (1989)
  • Haeb-Umbach, R., Beyerlein, P., Thelen, E., 1995. Automatic transcription of unknown words in a speech recognition...
  • Holter, T., Svendsen, T., 1998. Maximum likelihood modeling of pronunciation variation. In: Proceedings of ESCA...
  • Huang, C., Chen, T., Chang, E., 2004. Transformation and combination of hidden markov models for speaker selection...
  • Itakura, F., Saito, S., 1968. An analysis–synthesis telephony based on maximum likelihood method. In: Proceedings of...
  • B.-H. Juang et al.

    The segmental K-means algorithm for estimating parameters of hidden Markov models

    IEEE Trans. Audio, Speech, Signal Proc.

    (1990)
  • E. Lleida et al.

    Utterance verification in continuous speech recognition: decoding and training procedures

    IEEE Trans. Speech Audio Proc.

    (2000)
  • Cited by (12)

    • Multi-Pattern Viterbi algorithm for joint decoding of multiple speech patterns

      2010, Signal Processing
      Citation Excerpt :

      In [3], we extended the HMM Forward and Backward algorithms [4] to multiple patterns and showed that it improved Isolated Word Recognition (IWR) accuracy. A complete treatment of using multiple speech patterns for both speech recognition and training was proposed in [5,6]. In literature, a different problem of decoding repetitions which are not identical, but which derive from reference to a finite set of entities, has been addressed recently [7–9].

    • Reliability modeling of speech recognition tasks

      2018, International Journal of Performability Engineering
    • Joint decoding of complementary utterances

      2014, 2014 IEEE Workshop on Spoken Language Technology, SLT 2014 - Proceedings
    View all citing articles on Scopus
    View full text