Parametric subspace modeling of speech transitions
Introduction
One important feature of the complex temporal structure of speech signals is the systematic variation in the realisation of phones in different acoustic contexts. Smooth, and continuous, movement of the articulators towards and away from some notional target positions produces the acoustic signal that is rich in structure. It encodes not just the underlying linguistic content to be conveyed, but also much more information relating to the context in which it is spoken. In fluent speech the target positions towards which the articulators move may often not be realised, because of the movement towards the subsequent position may already have begun in parts of the system. In the corresponding acoustic signal, it would then be hard to isolate steady state regions that could be uniquely identified with phones. Discriminatory information enabling the decoding process is not localised in the steady states, but is likely to be smoothly distributed in the sequence of transitions of the signal.
The popular hidden Markov model (HMM) of speech signals, in its simplest form, approximates the signal as a sequence of statistically stationary regions. Once a segmentation is assigned, either at some stage of the iterative training process or at the Viterbi alignment in the test stage, the probabilistic score of the model is insensitive to the temporal ordering of the acoustic vectors that get assigned to a particular state. This clearly is a poor approximation to the dynamics of the vocal tract. The hard segmentation imposed by a finite number of states is also a poor model of the complex generation process.
Such weaknesses of the HMM approach have long been recognised. Techniques to deal with these include building models for context sensitive phones, and expanding the feature vector to include derivatives of spectral parameters.
These refinements have resulted in highly successful speech recognition systems (Young et al., 1994, Young et al., 1995) that can produce impressive accuracies on very large tasks. However, both specialised models for dealing with context and dimensionality expansion to capture ordering results in an explosion in the number of parameters. Robust estimation of a very large number of parameters then becomes the challenging task, requiring techniques such as tied mixtures.
Approaches such as Neural Networks (Robinson et al., 1996, Bourlard and Morgan, 1994) attempt to optimise a non-linear discriminative function that assigns a phone class membership probability to each spectral frame. Here too the slowly varying temporal dynamics is essentially ignored. Techniques to capture some dynamics include the use of a moving window (TDNN, Waibel et al. (1989)) and the Recurrent Neural Network with state feedback (Robinson and Fallside, 1991).
In the speech research community it is well known that an adequate description of the temporal evolution of speech parameters are essential for robust and efficient synthesis and recognition. Work in the speech research community that emphasises the importance of spectral transitions include that of Ahlbom et al. (1987). They showed using resynthesis experiments that segmental transitions may be used in reconstructing speech with minimal coarticulatory effect. Marteau et al. (1988) show that dynamic information is of great importance for the recognition of high speed or very coarticulated transitions where it is difficult to detect any targets. They already suggested diphone-like segments with trajectory concepts.
Other attempts to model slow time variations in spectral parameters include the Temporal Decomposition of Atal (1983) where a sequence of target vectors and corresponding interpolation functions are used. This model has also been studied by Marcus and van Lieshout (1984) and Niranjan and Fallside (1987). The primary motivation in this model is to attempt to capture how the movement of articulators is reflected as slow changes in the short time spectrum. As mentioned earlier, HMM based recognisers cope with such dynamics by appending derivative information to the feature vector. Our motivation in this work is to go beyond this simplistic method and look for explicit modeling of the trajectory in the feature space.
Recently there has been much interest in the use of segmental models (Ostendorf et al., 1996, Afify et al., 1995). These attempts to model the time variation of a particular feature within a segment. Most approaches use phones as a segment. Stochastic trajectory models were used for modeling phone-based speech units as clusters of trajectories in parameter space. The trajectories are modeled by mixture of state sequences of multivariate Gaussian density functions to explain inter-frame dependencies within a segment. Similar results and methods for phone segments were reported being successfully used. Afify et al. (1994), Afify et al. (1995) and Gong et al. (Gong and Haton, 1994; Gong et al., 1996) focused on trajectories which are sampled into n points within a segment and are represented by a mean and covariance vector for each point. Fukada et al. (1997) represented the mean and covariance matrix by a polynomial fit within a segment. All of them found the mean and covariance matrix by employing a k-means algorithm to the representation space. In contrast Gish and Ng (1996), Goldenthal (1994) and Holmes and Russell (1997) modelled each feature dimension directly using additional delta coefficients. Gish and Ng modelled the mean vectors within a segment as a quadratic function but having only a limited covariance matrix variation per segment available. Holmes and Russell modelled the trajectories using slope and mean to form a linear model within a segment using only a Gaussian mixture specific covariance matrix to represent the segmental variance. Goldenthal being aware of the statistical coefficient dependencies used the error component to enhance recognition results. Deng et al. (1994) showed that the stationary-state assumption appears to be reasonable when a state is intended to represent a short segment of sonorant or fricative speech sound but in continuously spoken sentences, even vowels contain virtually no stationary portions (Zue, 1991). They showed the importance of transitional acoustic trajectories for word segments reporting superior results over traditional HMMs on a limited task recognising 36 CVC words. A dynamical system segment model was proposed by Digalakis et al. (Digalakis 1992, Digalakis et al., 1991Digalakis et al., 1993), which resulted in significant improvement over the independent frame model for phone recognition.
Although all approaches try to circumvent the frame independence assumption within a segment and report improved results in comparison to frame independent models, the inter-segment correlation between segments is still modelled using the statistical independent assumption. This in particular does not hold for phones as segments where the acoustic transitions are located at the segment boundaries rather than in the segment centers. The spectral trajectory of say the vowel [i:] is quite different in the CV syllable /bee/ from that in the syllable /gee/. Clearly, a model for the phoneme [i:] derived from occurrences [i:] in all contexts would be noisy due to co-articulation. This work will focus on diphones as units of speech carrying transitional information between acoustic targets. The motivation is partly due to the work of Ghitza and Sondhi (1993), who also used diphones to represent non-stationary acoustic information. They used diphone units as states in an hidden Markov model framework to circumvent the independent and identical distribution assumption for successive observations within a state. Further diphones as units of concatenation has been very effective in producing synthetic speech (Salza et al., 1996).
In a parametric space (i.e., cepstral space) a speech signal can be represented as a point which moves as articulatory configuration changes. The sequence of moving points is called a trajectory of speech. The problem of acoustic modeling of speech is addressed at a diphone level. The model is motivated by the following ideas:
- 1.
Context affects the trajectory of speech signals. Models for speech recognition should rely on the trajectory of speech vectors rather than on the geometrical position of observations in the parameter space, since a given point can belong to different trajectories.
- 2.
The realisation of trajectories of a diphone form characteristic transitions that relate to acoustic context.
- 3.
If diphones are modelled as a sequence of states, then, due to contextual variability, the distribution variance at the boundaries of a speech model is smaller than that of the center part of the model. Joining models together will make the inter-model independency assumption less important. A weighting giving more importance to the extremities of the model in the recognition decision would thus improve the accuracy.
- 4.
Diphones as speech model implies a certain inherent syntactic constraint on possible state sequences, quite apart from any additional grammatical constraints that might be imposed.
Section snippets
Subspace model
In this section acoustic transition are shown to be representable in a low-dimensional space. A clue to how one can expect trajectory availability in low-dimensional space is given by the spectrogram, a two-dimensional representation of the short time Fourier transform, with frequency on the vertical, time on the horizontal axis and amplitude represented by a gray or colour scale. Within vocalised sounds and in particular at the boundaries between them, the spectrogram is characterised by
Trajectory models
The constrained projection outlined in the previous section, leads to a sequence of points in the subspace. The next stage is to characterise the evolution of these points in a manner that enables one to extract a distance metric with which these sounds can be classified. Three attempts at implementing such characterisation are described in this section. Later in this report experimental comparisons are shown.
Trajectory mapping
To use the ideas discussed above in a classification setting, smoothing of the test data was performed by fitting a constrained natural spline through the sequence of test points before projecting onto the subspace. This section describes the smoothing spline algorithm and the computation of distances in the projected space. This trajectory comparison method is motivated from the observation that the speech signal tends to follow certain paths corresponding to the underlying phonemic units.
ISOLET database
A subset of the ISOLET (Cole et al., 1994) database is used to illustrate the idea, using the isolated spoken characters /B/, /D/ and /G/ to obtain the diphone /bee/, /dee/ and /gee/. The complete database is an isolated speech, alphabet database and consists of two tokens of each letter produced by 150 American English speaker 75 female and 75 male. Hence there were in total 240 training tokens and 60 test tokens for each diphone, which can be split into 120 training tokens and 30 test tokens
ISOLET
These experiments demonstrate that the very simple representation adapted retains a reasonable amount of discrimination. The most accurate model is the principal curve model with its most flexible interpretation. It represents the underlying data distribution most adequately because of its number of latent points which allows principal curve to adjust accurately and results in superior error rates in comparison to other methods. The slowly moving kernel along principal curve allows precise
Conclusions
In this study, a new method of modeling speech transitions with a subspace model was proposed. It could be shown that temporal transitions in speech can be visualised and modeled in a low dimensional space. This approach has the advantages of reduced memory requirements in comparison with models involving context-dependent speech units. In addition, the subspace models require relatively little data compared to HMMs. The results suggest that discriminant information is preserved in the subspace
Acknowledgements
We thank the Neural Computing Research Group at Aston University for making the Matlab code of the GTM algorithm available in the public domain. KR acknowledges financial support from Girton College, Cambridge European Trust and the EPSRC.
References (52)
- et al.
A recurrent error propagation network speech recognition system
Computer Speech and Language
(1991) - Afify, M., Gong, Y., Haton, J.-P., 1994. Non-linear time alignment in stochastic trajectory models for speech...
- Afify, M., Gong, Y., Haton, J.-P., 1995. Stochastic trajectory models for speech recognition: An extension to modelling...
- Ahlbom, G., Bimbot, F., Chollet, G., 1987. Modeling spectral speech transitions using temporal decomposition...
- Atal, B., 1983. Efficient coding of LPC parameters by temporal decomposition. In: Internat. Conf. in Acoustics, Speech...
- Bishop, C., 1995. Neural Networks for Pattern Recognition. Oxford University Press,...
- Bishop, C., Svensen, M., Williams, C., 1996. GTM: The Generative Topographic Mapping. NCRG/96/015, Neural Computing...
- Bishop, C., Hinton, G., Strachan, I., 1997a. GTM through time. In: IEE International Conference on Artificial Neural...
- Bishop, C., Svensen M., Williams, C., 1997b. GTM: A principled alternative to the self-organizing map, Advances in...
- Bourlard, H., Morgan, N., 1994. Connectionist Speech Recognition: A Hibrid Approach, Kluwer Academic Publishers,...