Parameter reduction schemes for loosely coupled HMMs
Introduction
Hidden Markov Model (HMM)-based automatic speech recognition (ASR) has been successfully applied to dictated speech tasks, but the approach is less successful when confronted with more conversational speech (Weintraub, Stolcke, & Sankar, 1995). The experiment in Saraclar, Nock, and Khudanpur (2000) suggests that at least part of the problem may be related to the increased pronunciation variability in conversational speech. It is often hypothesised that the lack of robustness to pronunciation variability is related to the construction of word models by concatenating sequences of phoneme models as specified by the pronunciation dictionary.
One hypothesised source of difficulty is the limited nature of existing pronunciation dictionaries: typical pronunciation dictionaries contain only a few pronunciations for each word. For example, both the LIMSI (Lamel & Adda, 1996; Hain, Woodland, Evermann, & Povey, 2000) and Pronlex (Kingsbury, Strassel, & McLemore, 1997) recognition dictionaries have a single pronunciation for over 90% of the words and less than 1% of words have more than two pronunciations. Most recognition dictionaries contain even fewer pronunciations per word. The source of dictionary pronunciations varies, but is rarely conversational speech. Given such a pronunciation dictionary, it is assumed that the subword acoustic modelling scheme represents all remaining pronunciation variability. Context conditioning does model the influence of context on the realisation of sounds, and mixture of Gaussian output distributions in HMMs can capture variability in segment realisations. However, it can be argued that neither technique is an efficient model of these types of pronunciation change. Whilst reliance on the acoustic models to characterise pronunciation variability proved sufficient for dictated speech, it may not be adequate for capturing the increased range of pronunciations in conversational speech (Keating, 1997; Ostendorf, 2000; Weintraub et al., 1996). Firstly, inappropriate or inadequate pronunciations can lead to recognition errors. Secondly, broad variance models (resulting from a training scheme in which each subword model is potentially trained on data from other subword classes) can increase error rates due to increased acoustic confusability and also tend to increase decoding costs.
Inadequacy of the dictionary motivates explicit pronunciation modelling schemes: these augment the dictionary with one or more pronunciations per word which are more representative of the target style or accent (e.g. Humphries & Woodland, 1997; Riley, 1991). Whilst there is evidence to support research into explicit pronunciation modelling (Saraclar, 2000; Saraclar et al., 2000), the gains achieved in practice have been less than spectacular. Use of an expanded recognition-time dictionary yields improvements of 1–2% absolute on Switchboard (e.g. Byrne et al., 1998; Finke & Waibel, 1997); use of an expanded dictionary during acoustic model training as well as in recognition has not resulted in performance gains (Saraclar et al., 2000; Saraclar, 2000). Difficulties arise from lexical confusability – many new word pronunciations overlap with those of other words, increasing the difficulty of mapping back to word strings from phone sequences – and because pronunciation change often occurs at levels below the segment, rather than simply complete changes of phoneme identity.
The latter observation brings us to a second weakness of current designs: the assumption that speech can be segmented into a linear sequence of (usually phone-like) segments, which is sometimes referred to as the “beads-on-a-string” model. Speech scientists, linguists and engineers agree the notion of a speech segment is not a realistic one (e.g. Huckvale, 1994; Deng & Erler, 1992; King & Taylor, 2000). Speech is produced by loosely coupled articulators and speech production studies show the amplitude and phase between these gestures varies with changes in speaking rate, manner and style (e.g. Vaxelaire, Sock, & Perrier, 2000). The changes in relative timing can have extreme effects on the resulting acoustic signal: it often appears that there has been colouring and merging of the underlying ‘segments’ or even ‘segment-like’ insertions due to interaction between articulatory gestures. Examples include feature spreading e.g. CAN’T /k ae n t/ → [ k ae_n t ], where vowel /ae/ becomes nasalised due to “deleted” segment /n/, and asynchronous articulatory gestures causing stop insertions e.g. WARMTH /w ao m th/ → [ w ao m p th ]. The beads-on-a-string scheme was adequate for dictated speech recognition since amplitude and timing of gestures are fairly consistent. But as speech becomes more conversational, relative timing effects become more significant (e.g. Vaxelaire et al., 2000), and this type of variability may not be sufficiently well modelled by the beads-on-a-string approach.
Attempts to better model relative timing effects are mostly implicit schemes, incorporated at the level of the acoustic model. One such approach introduces more flexible state-level parameter sharing schemes, perhaps incorporating more knowledge of phonology or measures of speaking rate and style (e.g. Hain & Woodland, 1999; Hain & Woodland, 2000; Ostendorf, 2000; Finke, Fritsch, Koll, & Waibel, 1999; Saraclar, 2000). A more speculative direction of research investigates schemes for extracting and modelling intermediate articulatory or phonetic representations of speech, which may be a simpler domain in which to model the phonological effects in conversational speech. Rather than model speech as a linear sequence of segments, it is represented as a structured arrangement of phonetic or articulatory features between which there may be some degree of variation in the relative timing of phonetic events. Thus, for example, when nasality from phoneme /n/ partially colours a neighbouring vowel /ae/, this is modelled by asynchrony in the feature changes. There is considerable work about extracting appropriate intermediate representations of speech (e.g. Richmond, 2001; Kirchhoff, 1999). Fewer papers consider schemes for incorporating these ideas within a statistical framework; exceptions include (e.g. Kirchhoff, 1999; King, Stephenson, Isard, Taylor, & Strachan, 1998). The latter problem is considered in this paper.
The problem of modelling asynchronous articulatory, phonological or acoustic feature streams is a problem of modelling multiple, loosely coupled time series. Section 2 reviews conventional speech models that have been applied to modelling loosely coupled time series, particularly with respect to the degree of asynchrony allowed. Section 3 outlines the theory of Factorial Hidden Markov Models (FHMMs), a more general family of models which is potentially applicable to this modelling problem. Two specific instances of the FHMM are then described: one from the machine learning literatures (Saul & Jordan, 1999) and another designed to reflect the left-to-right nature of speech. Section 4 presents some experimental results; Section 5 summarises key results and discusses some open questions.
Section snippets
Existing models for asynchronous data
This section briefly surveys models of parallel time series data that have been investigated in a speech recognition context. The survey is not intended as a general review of techniques for modelling stochastic processes, nor is it a survey of techniques for incorporating phonological or articulatory information into speech models. The discussion focuses only on model assumptions and not the issues that must be addressed when incorporating such models into large vocabulary speech recognition
Factorial HMMs
All of the approaches above attempt to extend existing conventional HMMs to allow modeling of asynchrony. This section discusses a family of models that also allows varying degrees of coupling between the different time series. With the exception of the State-Coupled Model, all the models discussed above may be considered special cases.
As discussed above, combining the K observations at each time t into a single observation vector and then modelling the combined observation
Experimental evaluation
Two issues are addressed in this experimental study:
- •
Comparison of MM-FHMM and PT-FHMM parameter reduction schemes on a classification task;
- •
Comparison of PT-FHMM with more conventional speech models on a small vocabulary recognition task.
The representation of speech used in the classification and recognition tasks is cepstra derived from frequency subbands (e.g. (Mirghafori, 1999; Tomlinson et al., 1997; McMahon, McCourt, & Vaseghi, 1998)), rather than a more speculative articulatory or
Discussion
This paper has discussed and empirically compared one existing and one novel parameter reduction scheme for loosely coupled HMMs, a general class of models which are potentially appropriate for modelling loosely coupled time series data such as articulatory or phonological representations of speech. The new PT-FHMM was shown to give performance comparable to the existing MM-FHMM on the ISOLET task; it was then shown that the PT-FHMM scales to continuous digit recognition, giving performance
Acknowledgements
The idea for the parameter-tied factorial HMM arose during a discussion with Dr. Mark Gales; the authors would also like to thank Professor Steve Young, Dr. Martin Russell, members of the SSLI Lab at the University of Washington and two anonymous reviewers for their assistance. This work was supported by DARPA Grant No. N660019928924.
References (56)
- et al.
Detection of phonological features in continuous speech using neural networks
Computer Speech and Language
(2000) - et al.
Modelling asynchrony in automatic speech recognition using loosely coupled hidden Markov models
Cognitive Science
(2002) - et al.
HMM topology design using maximum likelihood successive state splitting
Computer Speech and Language
(1997) - et al.
Pronunciation modeling by sharing Gaussian densities across phonetic models
Computer Speech and Language
(2000) - et al.
A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains
Annals of Mathematical Statistics
(1970) - Bourlard, H., Dupont, S., Ris, C., 1996. Multi-stream speech recognition. Technical Report IDIAP-RR 96-07,...
- Brand, M., Oliver, N., Pentland, A., 1997. Coupled hidden Markov models for complex action recognition. In: Proceedings...
- Byrne, W., Finke, M., Khudanpur, S., McDonough, J., Nock, H., Riley, M., Saraclar, M., Wooters, C., Zavaliagkos, G.,...
- Cole, R., Muthusamy, Y., Fanty, M., 1990. The ISOLET spoken letter database. Technical Report CSE 90-004,...
- Daoudi, K., Fohr, D., Antoine, C., 2000. A new approach for multi-band speech recognition based on probabilistic...
Maximum likelihood from incomplete data via the EM algorithm
Journal of the Royal Statistical Society
Structural design of a hidden Markov model based speech recognizer using multi-valued phonetic features: comparison with segmental speech units
Journal of the Acoustical Society of America
Factorial hidden Markov models
Machine Learning
Cited by (8)
Articulatory feature-based pronunciation modeling
2016, Computer Speech and LanguageCitation Excerpt :Some authors have explored DBN models for the task of recognition of asynchronous articulatory features (Wester et al., 2004a). Other work has explored the use of multiple asynchronous streams of variables other than sub-phonetic features, such as different streams of acoustic observations or acoustic observations with the addition of an auxiliary variable (Nock and Ostendorf, 2003; Stephenson et al., 2004; Zweig, 1998; Zhang et al., 2003). Finally, in linguistics and speech science there have been several efforts to formalize models of multiple asynchronous tiers (Huckvale, 1994; Wiebe, 1992) and a simulation of articulatory phonology itself has now been implemented in a toolkit (Nam et al., 2004).
Point process models for event-based speech recognition
2009, Speech CommunicationExact or approximate inference in graphical models: why the choice is dictated by the treewidth, and how variable elimination can be exploited
2019, Australian and New Zealand Journal of StatisticsVariational Inference for Coupled Hidden Markov Models Applied to the Joint Detection of Copy Number Variations
2019, International Journal of BiostatisticsA multimedia English learning system using HMMs to improve phonemic awareness for English learning
2009, Educational Technology and Society