Elsevier

Computer Speech & Language

Volume 17, Issues 2–3, April–July 2003, Pages 233-262
Computer Speech & Language

Parameter reduction schemes for loosely coupled HMMs

https://doi.org/10.1016/S0885-2308(03)00009-3Get rights and content

Abstract

While Hidden Markov Models (HMMs) have been successful in many speech recognition tasks, performance on conversational speech is somewhat less successful, arguably due in part to the greater variation in timing of articulatory events. Loosely Coupled or Factorial HMMs (FHMMs) represent a family of models that have more flexibility for modeling such variation in speech, but there are tradeoffs to be studied in terms of computation and potential added confusability. This paper investigates two specific instances – Mixed-Memory and Parameter-Tied FHMMs – that can both be thought of as loosely coupled HMMs for modelling multiple time series. The Parameter-Tied FHMM, introduced here, has a potential advantage for speech modelling since it allows a left-to-right topology across the product state space. Experimental results on the ISOLET task show both models are feasible for speech recognition; TI-DIGITS recognition results show the Parameter-Tied FHMM is competitive with Multiband Models. State occupancy and pruning analyses show trends related to asynchrony that hold across the different models.

Introduction

Hidden Markov Model (HMM)-based automatic speech recognition (ASR) has been successfully applied to dictated speech tasks, but the approach is less successful when confronted with more conversational speech (Weintraub, Stolcke, & Sankar, 1995). The experiment in Saraclar, Nock, and Khudanpur (2000) suggests that at least part of the problem may be related to the increased pronunciation variability in conversational speech. It is often hypothesised that the lack of robustness to pronunciation variability is related to the construction of word models by concatenating sequences of phoneme models as specified by the pronunciation dictionary.

One hypothesised source of difficulty is the limited nature of existing pronunciation dictionaries: typical pronunciation dictionaries contain only a few pronunciations for each word. For example, both the LIMSI (Lamel & Adda, 1996; Hain, Woodland, Evermann, & Povey, 2000) and Pronlex (Kingsbury, Strassel, & McLemore, 1997) recognition dictionaries have a single pronunciation for over 90% of the words and less than 1% of words have more than two pronunciations. Most recognition dictionaries contain even fewer pronunciations per word. The source of dictionary pronunciations varies, but is rarely conversational speech. Given such a pronunciation dictionary, it is assumed that the subword acoustic modelling scheme represents all remaining pronunciation variability. Context conditioning does model the influence of context on the realisation of sounds, and mixture of Gaussian output distributions in HMMs can capture variability in segment realisations. However, it can be argued that neither technique is an efficient model of these types of pronunciation change. Whilst reliance on the acoustic models to characterise pronunciation variability proved sufficient for dictated speech, it may not be adequate for capturing the increased range of pronunciations in conversational speech (Keating, 1997; Ostendorf, 2000; Weintraub et al., 1996). Firstly, inappropriate or inadequate pronunciations can lead to recognition errors. Secondly, broad variance models (resulting from a training scheme in which each subword model is potentially trained on data from other subword classes) can increase error rates due to increased acoustic confusability and also tend to increase decoding costs.

Inadequacy of the dictionary motivates explicit pronunciation modelling schemes: these augment the dictionary with one or more pronunciations per word which are more representative of the target style or accent (e.g. Humphries & Woodland, 1997; Riley, 1991). Whilst there is evidence to support research into explicit pronunciation modelling (Saraclar, 2000; Saraclar et al., 2000), the gains achieved in practice have been less than spectacular. Use of an expanded recognition-time dictionary yields improvements of 1–2% absolute on Switchboard (e.g. Byrne et al., 1998; Finke & Waibel, 1997); use of an expanded dictionary during acoustic model training as well as in recognition has not resulted in performance gains (Saraclar et al., 2000; Saraclar, 2000). Difficulties arise from lexical confusability – many new word pronunciations overlap with those of other words, increasing the difficulty of mapping back to word strings from phone sequences – and because pronunciation change often occurs at levels below the segment, rather than simply complete changes of phoneme identity.

The latter observation brings us to a second weakness of current designs: the assumption that speech can be segmented into a linear sequence of (usually phone-like) segments, which is sometimes referred to as the “beads-on-a-string” model. Speech scientists, linguists and engineers agree the notion of a speech segment is not a realistic one (e.g. Huckvale, 1994; Deng & Erler, 1992; King & Taylor, 2000). Speech is produced by loosely coupled articulators and speech production studies show the amplitude and phase between these gestures varies with changes in speaking rate, manner and style (e.g. Vaxelaire, Sock, & Perrier, 2000). The changes in relative timing can have extreme effects on the resulting acoustic signal: it often appears that there has been colouring and merging of the underlying ‘segments’ or even ‘segment-like’ insertions due to interaction between articulatory gestures. Examples include feature spreading e.g. CAN’T /k ae n t/[ k ae_n t ], where vowel /ae/ becomes nasalised due to “deleted” segment /n/, and asynchronous articulatory gestures causing stop insertions e.g. WARMTH /w ao m th/[ w ao m p th ]. The beads-on-a-string scheme was adequate for dictated speech recognition since amplitude and timing of gestures are fairly consistent. But as speech becomes more conversational, relative timing effects become more significant (e.g. Vaxelaire et al., 2000), and this type of variability may not be sufficiently well modelled by the beads-on-a-string approach.

Attempts to better model relative timing effects are mostly implicit schemes, incorporated at the level of the acoustic model. One such approach introduces more flexible state-level parameter sharing schemes, perhaps incorporating more knowledge of phonology or measures of speaking rate and style (e.g. Hain & Woodland, 1999; Hain & Woodland, 2000; Ostendorf, 2000; Finke, Fritsch, Koll, & Waibel, 1999; Saraclar, 2000). A more speculative direction of research investigates schemes for extracting and modelling intermediate articulatory or phonetic representations of speech, which may be a simpler domain in which to model the phonological effects in conversational speech. Rather than model speech as a linear sequence of segments, it is represented as a structured arrangement of phonetic or articulatory features between which there may be some degree of variation in the relative timing of phonetic events. Thus, for example, when nasality from phoneme /n/ partially colours a neighbouring vowel /ae/, this is modelled by asynchrony in the feature changes. There is considerable work about extracting appropriate intermediate representations of speech (e.g. Richmond, 2001; Kirchhoff, 1999). Fewer papers consider schemes for incorporating these ideas within a statistical framework; exceptions include (e.g. Kirchhoff, 1999; King, Stephenson, Isard, Taylor, & Strachan, 1998). The latter problem is considered in this paper.

The problem of modelling asynchronous articulatory, phonological or acoustic feature streams is a problem of modelling multiple, loosely coupled time series. Section 2 reviews conventional speech models that have been applied to modelling loosely coupled time series, particularly with respect to the degree of asynchrony allowed. Section 3 outlines the theory of Factorial Hidden Markov Models (FHMMs), a more general family of models which is potentially applicable to this modelling problem. Two specific instances of the FHMM are then described: one from the machine learning literatures (Saul & Jordan, 1999) and another designed to reflect the left-to-right nature of speech. Section 4 presents some experimental results; Section 5 summarises key results and discusses some open questions.

Section snippets

Existing models for asynchronous data

This section briefly surveys models of parallel time series data that have been investigated in a speech recognition context. The survey is not intended as a general review of techniques for modelling stochastic processes, nor is it a survey of techniques for incorporating phonological or articulatory information into speech models. The discussion focuses only on model assumptions and not the issues that must be addressed when incorporating such models into large vocabulary speech recognition

Factorial HMMs

All of the approaches above attempt to extend existing conventional HMMs to allow modeling of asynchrony. This section discusses a family of models that also allows varying degrees of coupling between the different time series. With the exception of the State-Coupled Model, all the models discussed above may be considered special cases.

As discussed above, combining the K observations at each time t into a single observation vector Ot=(ot1,…,otK) and then modelling the combined observation

Experimental evaluation

Two issues are addressed in this experimental study:

  • Comparison of MM-FHMM and PT-FHMM parameter reduction schemes on a classification task;

  • Comparison of PT-FHMM with more conventional speech models on a small vocabulary recognition task.

The representation of speech used in the classification and recognition tasks is cepstra derived from frequency subbands (e.g. (Mirghafori, 1999; Tomlinson et al., 1997; McMahon, McCourt, & Vaseghi, 1998)), rather than a more speculative articulatory or

Discussion

This paper has discussed and empirically compared one existing and one novel parameter reduction scheme for loosely coupled HMMs, a general class of models which are potentially appropriate for modelling loosely coupled time series data such as articulatory or phonological representations of speech. The new PT-FHMM was shown to give performance comparable to the existing MM-FHMM on the ISOLET task; it was then shown that the PT-FHMM scales to continuous digit recognition, giving performance

Acknowledgements

The idea for the parameter-tied factorial HMM arose during a discussion with Dr. Mark Gales; the authors would also like to thank Professor Steve Young, Dr. Martin Russell, members of the SSLI Lab at the University of Washington and two anonymous reviewers for their assistance. This work was supported by DARPA Grant No. N660019928924.

References (56)

  • A.P. Dempster et al.

    Maximum likelihood from incomplete data via the EM algorithm

    Journal of the Royal Statistical Society

    (1977)
  • L. Deng et al.

    Structural design of a hidden Markov model based speech recognizer using multi-valued phonetic features: comparison with segmental speech units

    Journal of the Acoustical Society of America

    (1992)
  • Finke, M., Fritsch, J., Koll, D., Waibel, A., 1999. Modeling and efficient decoding of large vocabulary conversational...
  • Finke, M., Waibel, A., 1997. Speaking mode dependent pronunciation modeling in large vocabulary conversational speech...
  • Gales, M.J.F., 1995. Model-based techniques for noise robust speech recognition. PhD Thesis, Cambridge University...
  • Z. Ghahramani et al.

    Factorial hidden Markov models

    Machine Learning

    (1997)
  • Gillick, L., Cox, S.J., 1989. Some statistical issues in the comparison of speech recognition algorithms. In:...
  • Hain, T., Woodland, P.C., 1999. Dynamic HMM selection for continuous speech recognition. In: Proceedings of Eurospeech,...
  • Hain, T., Woodland, P.C., 2000. Modelling sub-phone insertions and deletions in continuous speech recognition. In:...
  • Hain, T., Woodland, P.C., Evermann, G., Povey, D., 2000. The CU-HTK march 2000 Hub5e transcription system. In:...
  • Hermansky, H., Tibrewala, S., Pavel, M., 1996. Towards ASR on partially corrupted speech. In: Proceedings of ICSLP, pp....
  • Huckvale, M.A., 1994. Word recognition from tiered phonological models. In: Proceedings of Institute of Acoustics...
  • Humphries, J.J., Woodland, P.C., 1997. Using accent-specific pronunciation modelling for improved large vocabulary...
  • Kapadia, S., 1998. Discriminative training of hidden Markov models. PhD Thesis, Cambridge University Engineering Dept.,...
  • Keating, P., 1997. Word-level phonetic variation in large speech corpora. In: Pompino-Marschal, B. (Ed.), ZAS Working...
  • King, S., Stephenson, T., Isard, S., Taylor, P., Strachan, A., 1998. Speech recognition via phonetically featured...
  • Kingsbury, P., Strassel, S., McLemore, C., 1997. COMLEX pronouncing lexicon (renamed in 1997 release as CALLHOME...
  • Kirchhoff, K., 1999. Robust speech recognition using articulatory information. PhD Thesis, University of Bielefeld,...
  • Cited by (8)

    • Articulatory feature-based pronunciation modeling

      2016, Computer Speech and Language
      Citation Excerpt :

      Some authors have explored DBN models for the task of recognition of asynchronous articulatory features (Wester et al., 2004a). Other work has explored the use of multiple asynchronous streams of variables other than sub-phonetic features, such as different streams of acoustic observations or acoustic observations with the addition of an auxiliary variable (Nock and Ostendorf, 2003; Stephenson et al., 2004; Zweig, 1998; Zhang et al., 2003). Finally, in linguistics and speech science there have been several efforts to formalize models of multiple asynchronous tiers (Huckvale, 1994; Wiebe, 1992) and a simulation of articulatory phonology itself has now been implemented in a toolkit (Nam et al., 2004).

    View all citing articles on Scopus
    View full text