Elsevier

Journal of Phonetics

Volume 35, Issue 1, January 2007, Pages 20-39
Journal of Phonetics

Principal components of vocal-tract area functions and inversion of vowels by linear regression of cepstrum coefficients

https://doi.org/10.1016/j.wocn.2006.01.001Get rights and content

Abstract

This paper addresses the following two hypotheses: (i) vocal-tract area functions of Japanese vowels can be accurately represented by a linear combination of only a few principal components which, furthermore, are similar to those reported in the literature for different languages; and (ii) the principal components’ weights can be predicted and area functions thereby accurately estimated from acoustics by linear regression of cepstrum parameters. To test these hypotheses, synchronized acoustic and vocal-tract 3D MRI data were recorded from an adult male Japanese speaker for both sustained and dynamic vowel utterances. The first two principal components explained covariations in vocal-tract shape and length accounting for 94–97% of the total variance, and indeed provided a cross-linguistic validation of the two underlying components of vowel production emergent from the literature. Multiple linear regression models were then evaluated for their accuracy in reconstructing the area functions of the dynamic utterance by predicting the first two PC coefficients, using either carefully measured formants or cepstral coefficients defined in various frequency bands. The best formant-based regression model required all four formants, with a mean adjusted correlation of 0.93 and mean absolute errors of 0.187 cm2 in area and 0.131 cm in vocal-tract length. The best cepstrum-based regression model prescribed 24 cepstral coefficients defined in the frequency band 0–4 kHz, with a mean adjusted correlation of 0.92 and mean absolute errors of 0.102 cm2 in area and 0.082 cm in vocal-tract length. These results suggest that vowel production features, properly constrained by PCA modeling, can be mapped with sufficient accuracy from easily measured cepstrum parameters. More work is required to reduce the dependence on MRI data, to extend the applicability of these methods to different voice qualities and different speakers, and to select a smaller subset of acoustic parameters for more robust, real-time inversion.

Introduction

Two questions long of importance to phonetics concern the fundamental dimensions of vowel production, acoustics and perception, and the relations by which those underlying dimensions are mapped between the three domains. For example in the acoustic–phonetic domain, despite persistent problems in robust measurement, the most physically interpretable parameters are still the formants or vocal-tract resonance frequencies (F1, F2…). On the other hand, much of the success of state-of-the-art automatic speech recognition comes from the use of the cepstrum parameters which characterize the shape of the whole spectrum and are easily measured. Helping to bridge the gap between these two important sets of acoustic parameters, Broad and Clermont (1989) showed empirically that the first three formants of vowels can be robustly predicted by linear transformations of the cepstrum. Bridging acoustics and perception, Pfitzinger (2005) empirically derived linear regression formulae directly relating F1, F2, and the fundamental frequency F0, with principal dimensions of perceived vowel quality (albeit traditionally described using articulatory terms) height and backness. In this paper we approach the problem from the viewpoint of speech production, and ask whether similar empirical relations can be found to map directly from unproblematic acoustic parameters such as the cepstrum, to fundamental dimensions underlying vowel production.

Perhaps the most revealing approach to defining fundamental dimensions of vowel production has been statistical reduction of dimensionality of measured articulatory data. There is now a substantial body of evidence in the literature pointing to the general validity of two orthogonal dimensions, linear combinations of which can form the vocal-tract shape for any vowel of a particular speaker. For speakers of American English, comparable results were reported by Shirai and Honda (1976) who applied principal components analysis (PCA) to tongue contours defined on 234 lateral X-ray images (Perkell, 1969), Harshman, Ladefoged, and Goldstein (1977) who applied a three-way factor analysis (PARAFAC) to vocal-tract cross-dimensions on 50 lateral X-ray images, and Story and Titze (1998) who applied PCA to vocal-tract area functions (the cross-sectional area of the vocal-tract airway as a function of the distance from the glottis) measured by 3D volumetric magnetic resonance imaging (MRI). In all three studies, about 90% of the variance in the original data was accounted for by just two principal components. Although the details of these components vary between the studies, their overall shapes are consistent with what Ladefoged, Harshman, Goldstein, and Rice (1978) named “front-raising” (with a range of variation between an /i/-like vowel and an /o/- or /ɑ/-like vowel) and “back-raising” (involving a gesture towards an /u/-like vowel), respectively.

Supporting evidence from X-ray cineradiography was reported by Iskarous (2000) who qualitatively observed mainly two patterns of midsagittal tongue-contour change in dynamic vowel-to-vowel sequences produced by Canadian French and Canadian English speakers: a pivot pattern contrasting mainly an oral and a pharyngeal place of constriction, and an arching pattern involving mainly a change in tongue height and affecting a single broad place of constriction near the velum. Further supporting evidence from a different domain of measurement appeared in Maeda and Honda's (1994) electromyographic (EMG) study of an American English speaker, where it was shown that certain pairs of the extrinsic tongue muscles act agonistically/antagonistically to produce movements of the tongue in directions similar to the two underlying dimensions found in the imaging studies.

There is also growing evidence of the cross-linguistic validity of these results. For example, Nix, Papcun, Hogden, and Zlokarnik's (1996) PARAFAC reanalysis of midsagittal vocal-tract shapes of 16 Icelandic vowels yielded two factors resembling those reported by Harshman et al. (1977). Despite the less complete description of vocal-tract shapes afforded by just four sensors placed along the tongue midline in electromagnetic midsagittal articulography (EMMA), Hoole (1999) found two PARAFAC factors to be sufficient in describing the 15 German vowels, and interpreted the factors as capturing a range of variation between /i:/ and /o:/, and between /ε:/ and /u:/, respectively. Although such interpretations were not so clearly revealed by Yehia, Takeda, and Itakura (1996) who applied PCA to 519 log-area functions inferred from lateral cineradiography of a female speaker of French (Bothorel, Simon, Wioland, & Zerling, 1986), two potentially confounding issues may have been the assumptions necessary to transform 2D X-ray images to 3D area functions, and the fact that their data comprised 10 sentences containing an uncontrolled mix of vowels and consonants. However, an elaborate biomechanical model of the vocal tract in the midsagittal plane applied by Sanguineti, Laboissière, and Ostry (1998) on the same set of French X-ray data, yielded three factors for tongue motion including a front-back component, a tongue dorsum arching-flattening component, and a tongue-tip raising-lowering component. This accords with Story and Titze's (1998) finding that in comparison with a two-component model for American English vowels, a combination of vowels and consonants would require three or four components to achieve a comparable representational accuracy.

The overwhelming consensus in these studies on the German 15-vowel system, the Icelandic 16-vowel system, the Canadian or American English 10-vowel system, and even a combined vowel-and-consonant system for French, is that an individual speaker's vocal-tract shape for any vowel can be succinctly described by a mean or neutral shape on which is superimposed a linear combination of two underlying components—an asymmetric movement contrasting the oral and pharyngeal cavities, and a more symmetric movement towards a velar constriction (presumably concomitant with lip rounding, although most of the studies were limited to only the tongue contour).

In light of such evidence, Perrier et al. (2000) then proposed the interesting hypothesis that these two underlying components are not only language independent, but may represent two degrees of freedom inherent to the anatomical and biomechanical properties of the human tongue. To test this hypothesis, they applied PCA to 1800 tongue contours generated by activation of tongue-muscle control parameters of a 2-D midsagittal model (with the jaw-height kept constant). They found that the first two principal components together accounted for about 85% of the total variance and were indeed very similar in shape to the two components found in the earlier studies working with direct measurements.

An advantage of vocal-tract simulation is that a statistically significant body of data can be generated to test such hypotheses; in this vein, it is also interesting to note that the earlier studies used data from languages containing 10 or more vowels. An equally valid and complementary approach taken in this paper, is to ask whether similar underlying components can be obtained by analysis of real human data for a language such as Japanese which has only five vowels. An affirmative result would not only support the hypotheses of language independence and biomechanical dependence, but would also indicate the sufficiency of a sparsely populated vowel space in revealing this basic phenomenon.

The second question addressed in this paper is whether a speaker's vocalic area functions represented by such principal components, can be accurately estimated from robust acoustic measurements. While it is well known that acoustic-to-articulatory mapping (or inversion of speech) is generally non-linear and one-to-many (e.g., Atal, Chang, Mathews, & Tukey, 1978), it is also known that the non-linearities and the one-to-many ambiguities depend to a great extent on the choice of articulatory and acoustic parameters, and that with proper constraints, such problems may be overcome (e.g., Boë, Perrier, & Bailly, 1992).

Indeed, as stated by Shirai and Honda (1976), such constraints are reflected in the articulatory model, which in their study included two principal components of tongue shapes. Sampling the parameters around values typical of the five Japanese vowels, they created 300 articulatory configurations, then used a transmission-line model of the vocal tract to obtain the corresponding values of the first two formant frequencies. These data were then used to train a regression model to estimate each of the vocal-tract shape parameters from a linear combination of non-linear terms (up to the second order) in the formant frequencies. Similarly, Yehia et al. (1996) created a large number of area functions by a representative sampling of their five principal components, and obtained the first three formant frequencies using a transmission-line model of the vocal tract. The resulting 7285 samples of matching area function and acoustic data were then separately subjected to independent component analysis (ICA), and a linear mapping between the two domains was determined by singular value decomposition (SVD). Story and Titze (1998) also created a large number of area functions by fine sampling of their two principal components (or “orthogonal modes”), then used a transmission-line model of the vocal tract to compute the first two formant frequencies. Their comparison of the regular modal-coefficient grid and the corresponding, warped formant grid showed visually the form and extent of non-linearities between the two domains. Nevertheless, the constraints on area functions imposed by allowing variations in only the first two modes helped to ensure a largely one-to-one mapping between F1, F2 pairs and the modal coefficients, which they implemented by a table-lookup procedure.

Although selected examples of qualitatively reasonable inversion were reported, the use of simulated acoustic data in all three of these studies implies that their inversion methods successfully learned the acoustic characteristics of the transmission-line model, rather than a human speaker. By contrast, Ladefoged et al. (1978) measured the first three formant frequencies (although not without difficulties) from acoustic recordings made at the time of the cinefluorograms. Stepwise multiple regression analyses on the 50 samples of matching articulatory and acoustic data then yielded correlations of 0.935 and 0.902 in predicting the weights on the first and second tongue-shape factors, respectively, by the best linear combinations of three non-linear terms in the formants.

Despite the non-linearities and ambiguities known to be potential problems for inversion, these studies support the feasibility of estimating the underlying components of a suitably constrained vowel production model, by relatively simple mathematical relations with acoustic parameters. However as noted earlier, while the formants are often the acoustic parameters of choice because ideally they represent resonances of the vocal tract, they are also notoriously difficult to measure robustly with unsupervised algorithms. Avoiding the formants, Meyer, Wilhelms, and Strube (1989) used a Kalman filtering algorithm with spectral-envelope matching, for analysis–resynthesis of German words by a quasi-articulatory model that included two principal components of the combined vocal-tract data of Harshman et al. (1977) and Fant (1960). Many other studies in the literature have similarly recognized the importance of using easily measured acoustic parameters for inversion (e.g., Dusan & Deng, 1998; Flanagan, Ishizaka, & Shipley, 1980; Hogden et al., 1996; Meyer, Schroeter, & Sondhi, 1991; Papcun, Hochberg, Thomas, Laroche, & Zacks, 1992), but no other such attempt seems to have appeared in connection with the underlying principal components reviewed earlier.

In the present study we seek first to confirm and extend the cross-linguistic validity of the two underlying dimensions, by applying PCA to the relatively sparse, Japanese 5-vowel system. Overcoming many of the limitations of previous methods while combining their strengths, we measure complete area functions of the vocal tract from just above the glottis to the radiating plane at the lips using volumetric MRI (as in Story and Titze, 1998), and account for covariations in the vocal-tract length by including it explicitly in the statistical analyses (as in Yehia et al., 1996). In Section 2.1 we describe the MRI data measured for both sustained and dynamic vowel utterances, and in Section 3 we present results of the dimensional analyses. Secondly, we seek to determine how accurately the identified principal dimensions can be predicted from robust acoustic measurements in real speech. Section 2.2 describes the measurement of formants and cepstra from acoustic recordings synchronized with the MRI data; and in Section 4 we describe and evaluate a multiple linear regression method of predicting area functions from acoustic parameters, where the performance of the model using manually corrected formants is compared with various combinations of cepstral coefficients defined within various frequency bands. In Section 5 we conclude with discussions pointing out the strengths and limitations of this study, and scope for future work.

Section snippets

Vocal-tract area functions measured by volumetric MRI

The articulatory domain is represented in this study in terms of the area function of the vocal tract, measured by 3D MRI of an adult male native speaker of Tokyo dialect Japanese. MRI scans were acquired with the Shimadzu-Marconi ECLIPSE 1.5 T PowerDrive 250 installed at the ATR Brain Activity Imaging Center (ATR BAIC). Two sets of data were acquired in separate MRI sessions. The first set (henceforth referred to as the “still” data) comprises a single frame measured in each of the five

Modeling area functions by principal components analysis

Our first aim was to further test the language independence of the underlying components of vowel production as discussed in the Introduction. To this end, in Section 3.1 we outline the methods of area-function pre-processing and dimensional analysis; in Section 3.2 we present the results of PCA applied to the Japanese vowel area-functions, with particular emphasis on the robustness of the principal components; and in Section 3.3 we evaluate the representational accuracy of the obtained

Predicting area functions from acoustics by linear regression

In view of the robustness of the two principal components of vowel production reported in the literature and validated for Japanese in the previous section, a compelling question is whether these components would be suitable parameters for speech inversion. Numerous studies over the past four or five decades have considered the problem of mapping from the acoustic to the articulatory domain, and the prevailing view is that without proper constraints (whether in the choice of model parameters or

Concluding discussion

This study aimed to investigate two main hypotheses, concerning the underlying dimensions of vowel production and their robust prediction from acoustic measurements.

Combining the strengths of previous approaches reported in the literature, volumetric MRI was used to obtain vocal-tract area functions whose shapes (section areas) and lengths were jointly subjected to PCA. In accord with several previous studies using different methods and for different languages, 94–97% of the total variance was

Acknowledgements

This research was conducted as part of ‘Research on Human Communication’ with funding from the National Institute of Information and Communications Technology.

References (39)

  • J. Hogden et al.

    Accurate recovery of articulator positions from acoustics: New conclusions based on human data

    Journal of the Acoustical Society of America

    (1996)
  • P. Hoole

    On the lingual organization of the German vowel system

    Journal of the Acoustical Society of America

    (1999)
  • Iskarous, K. (2000). The articulatory meaning of dynamic formant patterns. In Proceedings of the fifth seminar on...
  • J.D. Jobson
    (1991)
  • P. Ladefoged et al.

    Generating vocal tract shapes from formant frequencies

    Journal of the Acoustical Society of America

    (1978)
  • J. Laver

    The phonetic description of voice quality

    (1980)
  • Maeda, S. (1979). An articulatory model of the tongue based on a statistical analysis. In speech communication papers...
  • S. Maeda

    Compensatory articulation during speech: Evidence from the analysis and synthesis of vocal-tract shapes using an articulatory model

  • S. Maeda et al.

    From EMG to formant patterns of vowels: The implication of vowel spaces

    Phonetica

    (1994)
  • Cited by (33)

    • Vocal tract shaping of emotional speech

      2020, Computer Speech and Language
      Citation Excerpt :

      The distance function is generally redundant (its elements are highly correlated) due to the physiological constraints of the vocal tract (e.g., smooth shape of the surface) and the coordinated controls of speech articulators. Hence, previous studies often reduce the number of parameters (in the distance function) as pre-processing for their analysis and modeling of the vocal tract, by using decomposition techniques (Liljencrants, 1971; Harshman et al., 1977; Story et al., 1996; Yehia et al., 1996; Story and Titze, 1998; Mokhtari et al., 2007; Cai et al., 2009). The conventional decomposition methods, e.g., Principal Component Analysis (PCA) and Fourier series, transform the initial parameters to a low-dimensional, compact set of parameters.

    • An acoustically-driven vocal tract model for stop consonant production

      2017, Speech Communication
      Citation Excerpt :

      An inherent limitation with this model, however, as well as with the other models discussed previously, is the difficulty in accurately specifying the shape of consonant superposition functions. Perhaps the same might be said for the vowel tier; however, statistical analyses of area function sets measured for vowels have at least provided a reasonable means of estimating realistic vocal tract configurations for vowel-vowel transitions (Story and Titze, 1998; Story, 2005b; Mokhtari et al., 2007). The consonant function is more elusive to define because it doesn’t exist as an independently measurable entity, rather it is a modifier of the vowel substrate.

    • A real-time MRI investigation of the role of lingual and pharyngeal articulation in the production of the nasal vowel system of French

      2015, Journal of Phonetics
      Citation Excerpt :

      A rough estimation of vocal tract shape can be deduced from the formant structure of the sound that is radiated by the tract when a source of sound (most typically the vocal folds) provides excitation (Ananthakrishnan & Engwall, 2011; Atal, Chang, Mathews, & Tukey, 1978; Iskarous, 2010; Katsamanis, Roussos, Maragos, Aron, & Berger, 2008; Laprie, Potard, & Ouni, 2008; Mokhtari, Kitamura, Takemoto, & Honda, 2007; Panchapagesan & Alwan, 2011).

    View all citing articles on Scopus
    View full text