Regular Article
Structural maximum a posteriori linear regression for fast HMM adaptation

https://doi.org/10.1006/csla.2001.0181Get rights and content

Abstract

Transformation-based model adaptation techniques have been used for many years to improve robustness of speech recognition systems. While the estimation criterion used to estimate transformation parameters has been mainly based on maximum likelihood estimation (MLE), Bayesian versions of some of the most popular transformation-based adaptation methods have been recently introduced, like MAPLR, a maximum a posteriori(MAP) based version of the well-known maximum likelihood linear regression (MLLR) algorithm. This is in fact an attempt to constraint parameter estimation in order to obtain reliable estimation with a limited amount of data, not only to prevent overfitting the adaptation data but also to allow integration of prior knowledge into transformation-based adaptation techniques. Since such techniques require the estimation of a large number of transformation parameters when the amount of adaptation data is large, it is also required to define a large number of prior densities for these parameters. Robust estimation of these prior densities is therefore a crucial issue that directly affects the efficiency and effectiveness of the Bayesian techniques. This paper proposes to estimate these priors using the notion of hierarchical priors, embedded into the tree structure used to control transformation complexity. The proposed algorithm, called structural MAPLR (SMAPLR), has been evaluated on the Spoke3 1993 test set of the WSJ task. It is shown that SMAPLR reduces the risk of overtraining and exploits the adaptation data much more efficiently than MLLR, leading to a significant reduction of the word error rate for any amount of adaptation data.

References (39)

  • V.D. Diakoloukas et al.

    Maximum-likelihood stochastic-transformation adaptation of hidden Markov models

    IEEE Transactions on Speech and Audio Processing

    (1999)
  • V.V. Digalakis et al.

    Speaker adaptation using combined transformation and Bayesian methods

    IEEE Transactions on Speech and Audio Processing

    (1996)
  • H. Erdogan, Y. Gao, M. Picheny, Proceedings of the IEEE International Conference on Acoustics, Speech and Signal...
  • J.-L. Gauvain et al.

    Maximum a posteriori estimation for multivariate gaussian mixture observations of Markov chains

    IEEE Transactions on Speech and Audio Processing

    (1994)
  • A.K. Gupta et al.

    Elliptically Contoured Models in Statistics

    (1993)
  • Q. Huo et al.

    On-line adaptive learning of the continuous density hidden Markov model based on approximate recursive Bayes estimate

    IEEE Transactions on Speech and Audio Processing

    (1997)
  • Q. Huo et al.

    Online adaptive learning of continuous-density hidden Markov models based on multiple-stream prior evolution and posterior pooling

    IEEE Transactions on Speech and Audio Processing

    (2001)
  • R. Kuhn et al.

    Rapid speaker adaptation in eigenvoice space

    IEEE Transactions on Speech and Audio Processing

    (2000)
  • C.-H. Lee

    Adaptive classification and decision strategies for robust speech recognition

    Workshop on Robust Methods for Speech Recognition in Adverse Conditions, Tampere, Finland

    (1999)
  • Cited by (76)

    • Ensemble environment modeling using affine transform group

      2015, Speech Communication
      Citation Excerpt :

      The mapping function is then estimated to adapt the source model to the target model. Several estimation algorithms have been proposed such as linear and nonlinear stochastic matching approaches (Lee, 1998; Sankar and Lee, 1996; Suredran et al., 1999), signal bias removal (SBR) (Rahim and Juang, 1996), maximum likelihood linear regression (MLLR) (Leggetter and Woodland, 1995; Gales, 1997), maximum a posteriori linear regression (MAPLR) (Chesta et al., 1999; Siohan et al., 2001; Siohan et al., 2002), structural Bayesian linear regression (SBLR) (Watanabe et al., 2014), VTS-based model adaptation (Kim et al., 1998), joint compensation of additive and convolutive distortions (JAC) (Gong, 2005; Hu and Huo, 2007; Li et al., 2009), and JAC with unscented transform (JAC-UT) (Hu and Huo, 2006; Li et al., 2010). For Category-2, multiple models ({Λ1, Λ2, … , ΛP} in Fig. 1) that are trained using subsets of the entire training data allow more effective local statistics of environment conditions.

    • Prior-shared feature and model space speaker adaptation by consistently employing map estimation

      2013, Speech Communication
      Citation Excerpt :

      However, because the estimation criteria for both spaces are based on MAP, setting the prior information for both feature and model spaces is a crucial issue. Siohan et al. (2002) have already pointed out the importance of the choice of the prior densities for the transformation parameters (Siohan et al., 2002). In this work, by sharing the prior distribution we keep the consistency of the adaptation in the two different spaces.

    • Feature Extraction Methods for Speaker Recognition: A Review

      2017, International Journal of Pattern Recognition and Artificial Intelligence
    View all citing articles on Scopus
    View full text