Parameter reduction schemes for loosely coupled HMMs

doi:10.1016/S0885-2308(03)00009-3

Computer Speech & Language

Volume 17, Issues 2–3, April–July 2003, Pages 233-262

https://doi.org/10.1016/S0885-2308(03)00009-3 Get rights and content

Abstract

While Hidden Markov Models (HMMs) have been successful in many speech recognition tasks, performance on conversational speech is somewhat less successful, arguably due in part to the greater variation in timing of articulatory events. Loosely Coupled or Factorial HMMs (FHMMs) represent a family of models that have more flexibility for modeling such variation in speech, but there are tradeoffs to be studied in terms of computation and potential added confusability. This paper investigates two specific instances – Mixed-Memory and Parameter-Tied FHMMs – that can both be thought of as loosely coupled HMMs for modelling multiple time series. The Parameter-Tied FHMM, introduced here, has a potential advantage for speech modelling since it allows a left-to-right topology across the product state space. Experimental results on the ISOLET task show both models are feasible for speech recognition; TI-DIGITS recognition results show the Parameter-Tied FHMM is competitive with Multiband Models. State occupancy and pruning analyses show trends related to asynchrony that hold across the different models.

Introduction

Hidden Markov Model (HMM)-based automatic speech recognition (ASR) has been successfully applied to dictated speech tasks, but the approach is less successful when confronted with more conversational speech (Weintraub, Stolcke, & Sankar, 1995). The experiment in Saraclar, Nock, and Khudanpur (2000) suggests that at least part of the problem may be related to the increased pronunciation variability in conversational speech. It is often hypothesised that the lack of robustness to pronunciation variability is related to the construction of word models by concatenating sequences of phoneme models as specified by the pronunciation dictionary.

One hypothesised source of difficulty is the limited nature of existing pronunciation dictionaries: typical pronunciation dictionaries contain only a few pronunciations for each word. For example, both the LIMSI (Lamel & Adda, 1996; Hain, Woodland, Evermann, & Povey, 2000) and Pronlex (Kingsbury, Strassel, & McLemore, 1997) recognition dictionaries have a single pronunciation for over 90% of the words and less than 1% of words have more than two pronunciations. Most recognition dictionaries contain even fewer pronunciations per word. The source of dictionary pronunciations varies, but is rarely conversational speech. Given such a pronunciation dictionary, it is assumed that the subword acoustic modelling scheme represents all remaining pronunciation variability. Context conditioning does model the influence of context on the realisation of sounds, and mixture of Gaussian output distributions in HMMs can capture variability in segment realisations. However, it can be argued that neither technique is an efficient model of these types of pronunciation change. Whilst reliance on the acoustic models to characterise pronunciation variability proved sufficient for dictated speech, it may not be adequate for capturing the increased range of pronunciations in conversational speech (Keating, 1997; Ostendorf, 2000; Weintraub et al., 1996). Firstly, inappropriate or inadequate pronunciations can lead to recognition errors. Secondly, broad variance models (resulting from a training scheme in which each subword model is potentially trained on data from other subword classes) can increase error rates due to increased acoustic confusability and also tend to increase decoding costs.

Inadequacy of the dictionary motivates explicit pronunciation modelling schemes: these augment the dictionary with one or more pronunciations per word which are more representative of the target style or accent (e.g. Humphries & Woodland, 1997; Riley, 1991). Whilst there is evidence to support research into explicit pronunciation modelling (Saraclar, 2000; Saraclar et al., 2000), the gains achieved in practice have been less than spectacular. Use of an expanded recognition-time dictionary yields improvements of 1–2% absolute on Switchboard (e.g. Byrne et al., 1998; Finke & Waibel, 1997); use of an expanded dictionary during acoustic model training as well as in recognition has not resulted in performance gains (Saraclar et al., 2000; Saraclar, 2000). Difficulties arise from lexical confusability – many new word pronunciations overlap with those of other words, increasing the difficulty of mapping back to word strings from phone sequences – and because pronunciation change often occurs at levels below the segment, rather than simply complete changes of phoneme identity.

The latter observation brings us to a second weakness of current designs: the assumption that speech can be segmented into a linear sequence of (usually phone-like) segments, which is sometimes referred to as the “beads-on-a-string” model. Speech scientists, linguists and engineers agree the notion of a speech segment is not a realistic one (e.g. Huckvale, 1994; Deng & Erler, 1992; King & Taylor, 2000). Speech is produced by loosely coupled articulators and speech production studies show the amplitude and phase between these gestures varies with changes in speaking rate, manner and style (e.g. Vaxelaire, Sock, & Perrier, 2000). The changes in relative timing can have extreme effects on the resulting acoustic signal: it often appears that there has been colouring and merging of the underlying ‘segments’ or even ‘segment-like’ insertions due to interaction between articulatory gestures. Examples include feature spreading e.g. CAN’T /k ae n t/ → [ k ae_n t ], where vowel /ae/ becomes nasalised due to “deleted” segment /n/, and asynchronous articulatory gestures causing stop insertions e.g. WARMTH /w ao m th/ → [ w ao m p th ]. The beads-on-a-string scheme was adequate for dictated speech recognition since amplitude and timing of gestures are fairly consistent. But as speech becomes more conversational, relative timing effects become more significant (e.g. Vaxelaire et al., 2000), and this type of variability may not be sufficiently well modelled by the beads-on-a-string approach.

Attempts to better model relative timing effects are mostly implicit schemes, incorporated at the level of the acoustic model. One such approach introduces more flexible state-level parameter sharing schemes, perhaps incorporating more knowledge of phonology or measures of speaking rate and style (e.g. Hain & Woodland, 1999; Hain & Woodland, 2000; Ostendorf, 2000; Finke, Fritsch, Koll, & Waibel, 1999; Saraclar, 2000). A more speculative direction of research investigates schemes for extracting and modelling intermediate articulatory or phonetic representations of speech, which may be a simpler domain in which to model the phonological effects in conversational speech. Rather than model speech as a linear sequence of segments, it is represented as a structured arrangement of phonetic or articulatory features between which there may be some degree of variation in the relative timing of phonetic events. Thus, for example, when nasality from phoneme /n/ partially colours a neighbouring vowel /ae/, this is modelled by asynchrony in the feature changes. There is considerable work about extracting appropriate intermediate representations of speech (e.g. Richmond, 2001; Kirchhoff, 1999). Fewer papers consider schemes for incorporating these ideas within a statistical framework; exceptions include (e.g. Kirchhoff, 1999; King, Stephenson, Isard, Taylor, & Strachan, 1998). The latter problem is considered in this paper.

The problem of modelling asynchronous articulatory, phonological or acoustic feature streams is a problem of modelling multiple, loosely coupled time series. Section 2 reviews conventional speech models that have been applied to modelling loosely coupled time series, particularly with respect to the degree of asynchrony allowed. Section 3 outlines the theory of Factorial Hidden Markov Models (FHMMs), a more general family of models which is potentially applicable to this modelling problem. Two specific instances of the FHMM are then described: one from the machine learning literatures (Saul & Jordan, 1999) and another designed to reflect the left-to-right nature of speech. Section 4 presents some experimental results; Section 5 summarises key results and discusses some open questions.

Section snippets

Existing models for asynchronous data

This section briefly surveys models of parallel time series data that have been investigated in a speech recognition context. The survey is not intended as a general review of techniques for modelling stochastic processes, nor is it a survey of techniques for incorporating phonological or articulatory information into speech models. The discussion focuses only on model assumptions and not the issues that must be addressed when incorporating such models into large vocabulary speech recognition

Factorial HMMs

All of the approaches above attempt to extend existing conventional HMMs to allow modeling of asynchrony. This section discusses a family of models that also allows varying degrees of coupling between the different time series. With the exception of the State-Coupled Model, all the models discussed above may be considered special cases.

As discussed above, combining the K observations at each time t into a single observation vector $O_{t} =(o_{t}^{1},…,o_{t}^{K})$ and then modelling the combined observation

Experimental evaluation

Two issues are addressed in this experimental study:

•
Comparison of MM-FHMM and PT-FHMM parameter reduction schemes on a classification task;
•
Comparison of PT-FHMM with more conventional speech models on a small vocabulary recognition task.

The representation of speech used in the classification and recognition tasks is cepstra derived from frequency subbands (e.g. (Mirghafori, 1999; Tomlinson et al., 1997; McMahon, McCourt, & Vaseghi, 1998)), rather than a more speculative articulatory or

Discussion

This paper has discussed and empirically compared one existing and one novel parameter reduction scheme for loosely coupled HMMs, a general class of models which are potentially appropriate for modelling loosely coupled time series data such as articulatory or phonological representations of speech. The new PT-FHMM was shown to give performance comparable to the existing MM-FHMM on the ISOLET task; it was then shown that the PT-FHMM scales to continuous digit recognition, giving performance

Acknowledgements

The idea for the parameter-tied factorial HMM arose during a discussion with Dr. Mark Gales; the authors would also like to thank Professor Steve Young, Dr. Martin Russell, members of the SSLI Lab at the University of Washington and two anonymous reviewers for their assistance. This work was supported by DARPA Grant No. N660019928924.

References (56)

S. King et al.
Detection of phonological features in continuous speech using neural networks
Computer Speech and Language
(2000)
H.J. Nock et al.
Modelling asynchrony in automatic speech recognition using loosely coupled hidden Markov models
Cognitive Science
(2002)
M. Ostendorf et al.
HMM topology design using maximum likelihood successive state splitting
Computer Speech and Language
(1997)
M. Saraclar et al.
Pronunciation modeling by sharing Gaussian densities across phonetic models
Computer Speech and Language
(2000)
L.E. Baum et al.
A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains
Annals of Mathematical Statistics
(1970)
Bourlard, H., Dupont, S., Ris, C., 1996. Multi-stream speech recognition. Technical Report IDIAP-RR 96-07,...
Brand, M., Oliver, N., Pentland, A., 1997. Coupled hidden Markov models for complex action recognition. In: Proceedings...
Byrne, W., Finke, M., Khudanpur, S., McDonough, J., Nock, H., Riley, M., Saraclar, M., Wooters, C., Zavaliagkos, G.,...
Cole, R., Muthusamy, Y., Fanty, M., 1990. The ISOLET spoken letter database. Technical Report CSE 90-004,...
Daoudi, K., Fohr, D., Antoine, C., 2000. A new approach for multi-band speech recognition based on probabilistic...

A.P. Dempster et al.

Maximum likelihood from incomplete data via the EM algorithm

Journal of the Royal Statistical Society

(1977)

L. Deng et al.

Structural design of a hidden Markov model based speech recognizer using multi-valued phonetic features: comparison with segmental speech units

Journal of the Acoustical Society of America

(1992)

Finke, M., Fritsch, J., Koll, D., Waibel, A., 1999. Modeling and efficient decoding of large vocabulary conversational...

Finke, M., Waibel, A., 1997. Speaking mode dependent pronunciation modeling in large vocabulary conversational speech...

Gales, M.J.F., 1995. Model-based techniques for noise robust speech recognition. PhD Thesis, Cambridge University...

Z. Ghahramani et al.

Factorial hidden Markov models

Machine Learning

(1997)

Gillick, L., Cox, S.J., 1989. Some statistical issues in the comparison of speech recognition algorithms. In:...

Hain, T., Woodland, P.C., 1999. Dynamic HMM selection for continuous speech recognition. In: Proceedings of Eurospeech,...

Hain, T., Woodland, P.C., 2000. Modelling sub-phone insertions and deletions in continuous speech recognition. In:...

Hain, T., Woodland, P.C., Evermann, G., Povey, D., 2000. The CU-HTK march 2000 Hub5e transcription system. In:...

Hermansky, H., Tibrewala, S., Pavel, M., 1996. Towards ASR on partially corrupted speech. In: Proceedings of ICSLP, pp....

Huckvale, M.A., 1994. Word recognition from tiered phonological models. In: Proceedings of Institute of Acoustics...

Humphries, J.J., Woodland, P.C., 1997. Using accent-specific pronunciation modelling for improved large vocabulary...

Kapadia, S., 1998. Discriminative training of hidden Markov models. PhD Thesis, Cambridge University Engineering Dept.,...

Keating, P., 1997. Word-level phonetic variation in large speech corpora. In: Pompino-Marschal, B. (Ed.), ZAS Working...

King, S., Stephenson, T., Isard, S., Taylor, P., Strachan, A., 1998. Speech recognition via phonetically featured...

Kingsbury, P., Strassel, S., McLemore, C., 1997. COMLEX pronouncing lexicon (renamed in 1997 release as CALLHOME...

Kirchhoff, K., 1999. Robust speech recognition using articulatory information. PhD Thesis, University of Bielefeld,...

Cited by (8)

Articulatory feature-based pronunciation modeling
2016, Computer Speech and Language
Citation Excerpt :
Some authors have explored DBN models for the task of recognition of asynchronous articulatory features (Wester et al., 2004a). Other work has explored the use of multiple asynchronous streams of variables other than sub-phonetic features, such as different streams of acoustic observations or acoustic observations with the addition of an auxiliary variable (Nock and Ostendorf, 2003; Stephenson et al., 2004; Zweig, 1998; Zhang et al., 2003). Finally, in linguistics and speech science there have been several efforts to formalize models of multiple asynchronous tiers (Huckvale, 1994; Wiebe, 1992) and a simulation of articulatory phonology itself has now been implemented in a toolkit (Nam et al., 2004).
Spoken language, especially conversational speech, is characterized by great variability in word pronunciation, including many variants that differ grossly from dictionary prototypes. This is one factor in the poor performance of automatic speech recognizers on conversational speech, and it has been very difficult to mitigate in traditional phone-based approaches to speech recognition. An alternative approach, which has been studied by ourselves and others, is one based on sub-phonetic features rather than phones. In such an approach, a word's pronunciation is represented as multiple streams of phonological features rather than a single stream of phones. Features may correspond to the positions of the speech articulators, such as the lips and tongue, or may be more abstract categories such as manner and place.
This article reviews our work on a particular type of articulatory feature-based pronunciation model. The model allows for asynchrony between features, as well as per-feature substitutions, making it more natural to account for many pronunciation changes that are difficult to handle with phone-based models. Such models can be efficiently represented as dynamic Bayesian networks. The feature-based models improve significantly over phone-based counterparts in terms of frame perplexity and lexical access accuracy. The remainder of the article discusses related work and future directions.
Point process models for event-based speech recognition
2009, Speech Communication
Several strands of research in the fields of linguistics, speech perception, and neuroethology suggest that modelling the temporal dynamics of an acoustic event landmark-based representation is a scientifically plausible approach to the automatic speech recognition (ASR) problem. Adopting a point process representation of the speech signal opens up ASR to a large class of statistical models that have seen wide application in the neuroscience community. In this paper, we formulate several point process models for application to speech recognition, designed to operate on sparse detector-based representations of the speech signal. We find that even with a noisy and extremely sparse phone-based point process representation, obstruent phones can be decoded at accuracy levels comparable to a basic hidden Markov model baseline and with improved robustness. We conclude by outlining various avenues for future development of our methodology.
Exact or approximate inference in graphical models: why the choice is dictated by the treewidth, and how variable elimination can be exploited
2019, Australian and New Zealand Journal of Statistics
Variational Inference for Coupled Hidden Markov Models Applied to the Joint Detection of Copy Number Variations
2019, International Journal of Biostatistics
Variational inference for coupled Hidden Markov Models applied to the joint detection of copy number variations
2017, arXiv
A multimedia English learning system using HMMs to improve phonemic awareness for English learning
2009, Educational Technology and Society

View all citing articles on Scopus

View full text

Parameter reduction schemes for loosely coupled HMMs

Abstract

Introduction

Section snippets

Existing models for asynchronous data

Factorial HMMs

Experimental evaluation

Discussion

Acknowledgements

Computer Speech and Language

Cognitive Science

Computer Speech and Language

Computer Speech and Language

A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains

Annals of Mathematical Statistics

Maximum likelihood from incomplete data via the EM algorithm

Journal of the Royal Statistical Society

Structural design of a hidden Markov model based speech recognizer using multi-valued phonetic features: comparison with segmental speech units

Journal of the Acoustical Society of America

Factorial hidden Markov models

Machine Learning