Elsevier

Speech Communication

Volume 27, Issue 1, February 1999, Pages 19-42
Speech Communication

Parametric subspace modeling of speech transitions

https://doi.org/10.1016/S0167-6393(98)00067-3Get rights and content

Abstract

This paper describes an attempt at capturing segmental transition information for speech recognition tasks. The slowly varying dynamics of spectral trajectories carries much discriminant information that is very crudely modelled by traditional approaches such as HMMs. In approaches such as recurrent neural networks there is the hope, but not the convincing demonstration, that such transitional information could be captured. The method presented here starts from the very different position of explicitly capturing the trajectory of short time spectral parameter vectors on a subspace in which the temporal sequence information is preserved. This was approached by introducing a temporal constraint into the well known technique of Principal Component Analysis (PCA). On this subspace, an attempt of parametric modelling the trajectory was made, and a distance metric was computed to perform classification of diphones. Using the Principal Curves method of Hastie and Stuetzle and the Generative Topographic map (GTM) technique of Bishop, Svensen and Williams as description of the temporal evolution in terms of latent variables was performed. On the difficult problem of /bee/, /dee/, /gee/ it was possible to retain discriminatory information with a small number of parameters. Experimental illustrations present results on ISOLET and TIMIT database.

Zusammenfassung

Dieser Bericht beschreibt den Versuch Informationen über dynamische Transitionen in phonetischen Sprachsegmenten zu erfassen, um sie für die Spracherkennung nutzbar zu machen. Gerade die dynamischen Prozesse der spektralen Trajektoren repräsentieren charakteristische Unterscheidungsmerkmale, welche durch die traditionellen statistischen Mustererkenner, wie z.B. Hidden Markov Model, ungenügend berücksichtigt werden. Man hoffte, durch die Anwendung von rekursiven neuronalen Netzen (RNNs) diese dynamischen Informationen besser in Systeme integrieren zu können, welches aber nicht überzeugend belegt werden konnte. In diesem Bericht wird von einem unterschiedlichen Blickwinkel aus gezeigt, wie Trajektoren, die aus spektralen Parametervektoren gebildet werden, explizit modelliert werden können. Diese Modellierung erfolgt in einem Unterraum, der zeitlich-sequenzielle Informationen erhält. Dies wird durch die Integrierung einer zeitbezogenen Nebenbedingung in die Standardmethode Principal Component Analysis realisiert. In diesem Unterraum erfolgt eine parametrische Modellierung der Trajektoren. Mittels einer Abstandsmetrik wird eine Klassifizierung von Diphonen vorgenommen. Mit den Methoden Principal Curves von Hastie und Stuetzle und der Generative Topographic Map (GTM) von Bishop, Svenson und Williams wird die zeitliche Entwicklung der Vektoren mit Hilfe von latenten Variablen beschrieben. An der Problematik zur Unterscheidung der Diphone /bee/, /dee/ und /gee/ mit Hilfe von charakteristischen Trajektoren wird gezeigt, daß eine hohe Klassifizierungsrate erreichbar ist, wobei eine sehr geringe Anzahl von Parametern benötigt wird. Die Ergebnisse werden mit Hilfe der Datenbanken ISOLET und TIMIT experimentell illustriert, die in den Bericht integriert sind.

Résumé

Ce papier décrit une méthode tentant d'extraire l'information de transition entre segments pour les tâches de reconnaissance de la parole. Les caractéristiques dynamiques (variant lentement) des trajectories spectrales contiennent beaucoup d'information discriminante qui est mal modélisée dans les approaches HMM traditionnelles. Dans les approches telles que les réseaux de neurones récurrents, il y a l'espoir, mais pas de démonstration convainquante, que cette information de transition pourrait être utilisée. La méthode présentée ici se base sur un principe assez différent et consistant à modéliser explicitement la trajectoire des paramètres spectaux à court terme dans un sous-espace où l'information temporelle est préservée. Ceci est réalisé en introduisant une contrainte temporelle dans la technique bien connue de l'Analyse en composantes Principales. Dans ce sous-espace, on a alors défini un modèle paramétrique de la trajectoire, et une measure de distances a été utilisée pour effectuer la classification en diphones. En utilisant la méthode de “Principal Curves” de Hastie et Stuetzle et la “Generative Topographic Map” de Bishop, Svensen et Williams une description de l'évolution temporelle en termes de variables latentes a été effectuée. Sur le problème difficile de /bee/, /dee/ et /gee/, il a été possible de conserver l'information discriminante avec un ensemble réduit de paramètres. Des illustrations expérimentales sont présentées sur les bases de données ISOLET et TIMIT.

Introduction

One important feature of the complex temporal structure of speech signals is the systematic variation in the realisation of phones in different acoustic contexts. Smooth, and continuous, movement of the articulators towards and away from some notional target positions produces the acoustic signal that is rich in structure. It encodes not just the underlying linguistic content to be conveyed, but also much more information relating to the context in which it is spoken. In fluent speech the target positions towards which the articulators move may often not be realised, because of the movement towards the subsequent position may already have begun in parts of the system. In the corresponding acoustic signal, it would then be hard to isolate steady state regions that could be uniquely identified with phones. Discriminatory information enabling the decoding process is not localised in the steady states, but is likely to be smoothly distributed in the sequence of transitions of the signal.

The popular hidden Markov model (HMM) of speech signals, in its simplest form, approximates the signal as a sequence of statistically stationary regions. Once a segmentation is assigned, either at some stage of the iterative training process or at the Viterbi alignment in the test stage, the probabilistic score of the model is insensitive to the temporal ordering of the acoustic vectors that get assigned to a particular state. This clearly is a poor approximation to the dynamics of the vocal tract. The hard segmentation imposed by a finite number of states is also a poor model of the complex generation process.

Such weaknesses of the HMM approach have long been recognised. Techniques to deal with these include building models for context sensitive phones, and expanding the feature vector to include derivatives of spectral parameters.

These refinements have resulted in highly successful speech recognition systems (Young et al., 1994, Young et al., 1995) that can produce impressive accuracies on very large tasks. However, both specialised models for dealing with context and dimensionality expansion to capture ordering results in an explosion in the number of parameters. Robust estimation of a very large number of parameters then becomes the challenging task, requiring techniques such as tied mixtures.

Approaches such as Neural Networks (Robinson et al., 1996, Bourlard and Morgan, 1994) attempt to optimise a non-linear discriminative function that assigns a phone class membership probability to each spectral frame. Here too the slowly varying temporal dynamics is essentially ignored. Techniques to capture some dynamics include the use of a moving window (TDNN, Waibel et al. (1989)) and the Recurrent Neural Network with state feedback (Robinson and Fallside, 1991).

In the speech research community it is well known that an adequate description of the temporal evolution of speech parameters are essential for robust and efficient synthesis and recognition. Work in the speech research community that emphasises the importance of spectral transitions include that of Ahlbom et al. (1987). They showed using resynthesis experiments that segmental transitions may be used in reconstructing speech with minimal coarticulatory effect. Marteau et al. (1988) show that dynamic information is of great importance for the recognition of high speed or very coarticulated transitions where it is difficult to detect any targets. They already suggested diphone-like segments with trajectory concepts.

Other attempts to model slow time variations in spectral parameters include the Temporal Decomposition of Atal (1983) where a sequence of target vectors and corresponding interpolation functions are used. This model has also been studied by Marcus and van Lieshout (1984) and Niranjan and Fallside (1987). The primary motivation in this model is to attempt to capture how the movement of articulators is reflected as slow changes in the short time spectrum. As mentioned earlier, HMM based recognisers cope with such dynamics by appending derivative information to the feature vector. Our motivation in this work is to go beyond this simplistic method and look for explicit modeling of the trajectory in the feature space.

Recently there has been much interest in the use of segmental models (Ostendorf et al., 1996, Afify et al., 1995). These attempts to model the time variation of a particular feature within a segment. Most approaches use phones as a segment. Stochastic trajectory models were used for modeling phone-based speech units as clusters of trajectories in parameter space. The trajectories are modeled by mixture of state sequences of multivariate Gaussian density functions to explain inter-frame dependencies within a segment. Similar results and methods for phone segments were reported being successfully used. Afify et al. (1994), Afify et al. (1995) and Gong et al. (Gong and Haton, 1994; Gong et al., 1996) focused on trajectories which are sampled into n points within a segment and are represented by a mean and covariance vector for each point. Fukada et al. (1997) represented the mean and covariance matrix by a polynomial fit within a segment. All of them found the mean and covariance matrix by employing a k-means algorithm to the representation space. In contrast Gish and Ng (1996), Goldenthal (1994) and Holmes and Russell (1997) modelled each feature dimension directly using additional delta coefficients. Gish and Ng modelled the mean vectors within a segment as a quadratic function but having only a limited covariance matrix variation per segment available. Holmes and Russell modelled the trajectories using slope and mean to form a linear model within a segment using only a Gaussian mixture specific covariance matrix to represent the segmental variance. Goldenthal being aware of the statistical coefficient dependencies used the error component to enhance recognition results. Deng et al. (1994) showed that the stationary-state assumption appears to be reasonable when a state is intended to represent a short segment of sonorant or fricative speech sound but in continuously spoken sentences, even vowels contain virtually no stationary portions (Zue, 1991). They showed the importance of transitional acoustic trajectories for word segments reporting superior results over traditional HMMs on a limited task recognising 36 CVC words. A dynamical system segment model was proposed by Digalakis et al. (Digalakis 1992, Digalakis et al., 1991Digalakis et al., 1993), which resulted in significant improvement over the independent frame model for phone recognition.

Although all approaches try to circumvent the frame independence assumption within a segment and report improved results in comparison to frame independent models, the inter-segment correlation between segments is still modelled using the statistical independent assumption. This in particular does not hold for phones as segments where the acoustic transitions are located at the segment boundaries rather than in the segment centers. The spectral trajectory of say the vowel [i:] is quite different in the CV syllable /bee/ from that in the syllable /gee/. Clearly, a model for the phoneme [i:] derived from occurrences [i:] in all contexts would be noisy due to co-articulation. This work will focus on diphones as units of speech carrying transitional information between acoustic targets. The motivation is partly due to the work of Ghitza and Sondhi (1993), who also used diphones to represent non-stationary acoustic information. They used diphone units as states in an hidden Markov model framework to circumvent the independent and identical distribution assumption for successive observations within a state. Further diphones as units of concatenation has been very effective in producing synthetic speech (Salza et al., 1996).

In a parametric space (i.e., cepstral space) a speech signal can be represented as a point which moves as articulatory configuration changes. The sequence of moving points is called a trajectory of speech. The problem of acoustic modeling of speech is addressed at a diphone level. The model is motivated by the following ideas:

  • 1.

    Context affects the trajectory of speech signals. Models for speech recognition should rely on the trajectory of speech vectors rather than on the geometrical position of observations in the parameter space, since a given point can belong to different trajectories.

  • 2.

    The realisation of trajectories of a diphone form characteristic transitions that relate to acoustic context.

  • 3.

    If diphones are modelled as a sequence of states, then, due to contextual variability, the distribution variance at the boundaries of a speech model is smaller than that of the center part of the model. Joining models together will make the inter-model independency assumption less important. A weighting giving more importance to the extremities of the model in the recognition decision would thus improve the accuracy.

  • 4.

    Diphones as speech model implies a certain inherent syntactic constraint on possible state sequences, quite apart from any additional grammatical constraints that might be imposed.

This paper describes an attempt to capture segmental transition information of diphones in a speech recognition context, looking for trajectories of the spectral parameter vector, projected on a subspace. On this subspace a parametric model of the trajectory was imposed. Transitions corresponding to different diphones result in different representations in the subspace. Quantification of the discriminative information retained on the subspace is demonstrated on a small scale speech recognition task on the ISOLET and TIMIT database. A method of modeling transitions in diphones in a subspace framework is presented which is easy to model and requires a small amount of training data. The hypothesis is illustrated on a typical problem in speech recognition, the discrimination of /b/, /d/ and /g/ in the context of /ee/, an ambiguous problem in phone classification. This method shares the same idea of modeling dynamic transitions of speech with many other methods developed in the recent years mentioned above (Goldenthal, 1994; Sun, 1997; Digalakis, 1992; Kannan and Ostendorf, 1997). However, because the derivation of the model is performed in a low dimensional space, this approach does not increase model complexity for modeling the dynamics in speech. The incorporation of this information in existing recognition systems could be made with an N-best rescoring scheme, proposed by Schmid and Barnard (Schmid, 1996; Schmid and Barnard, 1997) and Rayner et al. (1994), to improve recognition results. This paper is organised as follows. Section 2starts with the basic details of the front end parameterisation. In Section 2.2the subspace projection technique is discussed, introduce a simple idea to enforce temporal ordering information into principal component projection of the data. In Section 3three approaches are described: a simple representation in terms of an average, the principal curves idea of Hastie and Stuetzle and the Generative Topographic map of Bishop as mechanisms employed to model trajectories. Section 4describes the distance computations required in classification tasks, and Section 5describes the experimental illustration of these ideas.

Section snippets

Subspace model

In this section acoustic transition are shown to be representable in a low-dimensional space. A clue to how one can expect trajectory availability in low-dimensional space is given by the spectrogram, a two-dimensional representation of the short time Fourier transform, with frequency on the vertical, time on the horizontal axis and amplitude represented by a gray or colour scale. Within vocalised sounds and in particular at the boundaries between them, the spectrogram is characterised by

Trajectory models

The constrained projection outlined in the previous section, leads to a sequence of points in the subspace. The next stage is to characterise the evolution of these points in a manner that enables one to extract a distance metric with which these sounds can be classified. Three attempts at implementing such characterisation are described in this section. Later in this report experimental comparisons are shown.

Trajectory mapping

To use the ideas discussed above in a classification setting, smoothing of the test data was performed by fitting a constrained natural spline through the sequence of test points before projecting onto the subspace. This section describes the smoothing spline algorithm and the computation of distances in the projected space. This trajectory comparison method is motivated from the observation that the speech signal tends to follow certain paths corresponding to the underlying phonemic units.

ISOLET database

A subset of the ISOLET (Cole et al., 1994) database is used to illustrate the idea, using the isolated spoken characters /B/, /D/ and /G/ to obtain the diphone /bee/, /dee/ and /gee/. The complete database is an isolated speech, alphabet database and consists of two tokens of each letter produced by 150 American English speaker 75 female and 75 male. Hence there were in total 240 training tokens and 60 test tokens for each diphone, which can be split into 120 training tokens and 30 test tokens

ISOLET

These experiments demonstrate that the very simple representation adapted retains a reasonable amount of discrimination. The most accurate model is the principal curve model with its most flexible interpretation. It represents the underlying data distribution most adequately because of its number of latent points which allows principal curve to adjust accurately and results in superior error rates in comparison to other methods. The slowly moving kernel along principal curve allows precise

Conclusions

In this study, a new method of modeling speech transitions with a subspace model was proposed. It could be shown that temporal transitions in speech can be visualised and modeled in a low dimensional space. This approach has the advantages of reduced memory requirements in comparison with models involving context-dependent speech units. In addition, the subspace models require relatively little data compared to HMMs. The results suggest that discriminant information is preserved in the subspace

Acknowledgements

We thank the Neural Computing Research Group at Aston University for making the Matlab code of the GTM algorithm available in the public domain. KR acknowledges financial support from Girton College, Cambridge European Trust and the EPSRC.

References (52)

  • T. Robinson et al.

    A recurrent error propagation network speech recognition system

    Computer Speech and Language

    (1991)
  • Afify, M., Gong, Y., Haton, J.-P., 1994. Non-linear time alignment in stochastic trajectory models for speech...
  • Afify, M., Gong, Y., Haton, J.-P., 1995. Stochastic trajectory models for speech recognition: An extension to modelling...
  • Ahlbom, G., Bimbot, F., Chollet, G., 1987. Modeling spectral speech transitions using temporal decomposition...
  • Atal, B., 1983. Efficient coding of LPC parameters by temporal decomposition. In: Internat. Conf. in Acoustics, Speech...
  • Bishop, C., 1995. Neural Networks for Pattern Recognition. Oxford University Press,...
  • Bishop, C., Svensen, M., Williams, C., 1996. GTM: The Generative Topographic Mapping. NCRG/96/015, Neural Computing...
  • Bishop, C., Hinton, G., Strachan, I., 1997a. GTM through time. In: IEE International Conference on Artificial Neural...
  • Bishop, C., Svensen M., Williams, C., 1997b. GTM: A principled alternative to the self-organizing map, Advances in...
  • Bourlard, H., Morgan, N., 1994. Connectionist Speech Recognition: A Hibrid Approach, Kluwer Academic Publishers,...
  • W. Cleveland

    Robust locally weighted regression and smoothing scatterplots

    Journal of the American Statistical Association

    (1979)
  • Cole, R., Muthusamy, Y., Fanty, M., 1994. The ISOLET spoken letter database, Technical Report CSE 90-004, Oregon...
  • L. Deng et al.

    Speech recognition using hidden Markov models with polynomial regression functions as nonstationary states

    IEEE Transactions on Speech and Audio Processing

    (1994)
  • Digalakis, V., 1992. Segment-based stochastic models of spectral dynamics for continuous speech recognition. Ph.D....
  • Digalakis, V., Rohlicek, R., Ostendorf, M., 1991. A dynamical system approach to continuous speech recognition. In:...
  • V. Digalakis et al.

    ML estimation of a stochastic linear system with EM algorithm and its application to speech recognition

    IEEE Transactions of Speech and Audio Processing

    (1993)
  • Duda, R., Hart, P., 1973. Pattern Classification and Scene Analysis. Wiley, New...
  • Fisher, W., Doddington, G., Goudie-Marshall, K., 1986. The DARPA speech recognition research database: Specification...
  • J. Friedman

    Exploratory projection pursuit

    Journal of the American Statistical Association

    (1987)
  • Fukada, T., Sagisaka, Y., Paliwal, K., 1997. Model parameter estimation for mixture density polynomial segment models....
  • Garofolo, J., 1988. Getting started with the DARPA TIMIT CD-ROM: an acoustic phonetic continuous speech database....
  • O. Ghitza et al.

    Hidden Markov models with templates as non-stationary states: An application to speech recognition

    Computer Speech and Language

    (1993)
  • Gish, H., Ng, K., 1996. Parametric trajectory models for speech recognition. In: Internat. Conf. in Spoken Language...
  • Goldenthal, W., 1994. Statistical trajectory models for phonetic recognition. Ph.D. Thesis, Department of Aeronautics...
  • Gong, Y., Haton, J.-P. 1994. Stochastic trajectory modeling for speech recognition. In: Internat. Conf. in Acoustics,...
  • Gong, Y., Illina, I., Haton, J.-P., 1996. Modeling long term variability information in mixture stochastic trajectory...
  • Cited by (0)

    View full text