Linear scaling of vowel-formant ensembles (VFEs) in consonantal contexts

https://doi.org/10.1016/S0167-6393(01)00010-3Get rights and content

Abstract

There are familiar terms such as “contour” and “trajectory” to refer to a vowel formant frequency as a function defined on the time axis, but there is no readily understood term for the analogous idea of how a formant behaves on the “vowel axis”. For this we introduce the concept of a vowel-formant ensemble (VFE) as the set of values realized for a given formant (e.g., F2) in going from vowel to vowel among a speaker's vowel phonemes for a fixed time frame in a fixed CVC context. The VFE affords a simple description of our development: we observe that D.J. Broad and F. Clermont's [J. Acoust. Soc. Am. 81 (1987) 155] formant-contour model is a linear function of its vowel target and that as a consequence all its VFEs for a given speaker and formant number are linearly scaled copies of one another. Are VFEs in actual speech also linearly scaled? To show how this question can be addressed, we use F1 and F2 data on one male speaker's productions of 7 Australian English vowels in 7 CVd contexts, with each CVd repeated 5 times. Our hypothesized scaling relation gives a remarkably good fit to these data, with a residual rms error of only about 14 Hz for either formant after discounting random variations among repetitions. The linear scaling implies a type of normalization for context which shrinks the intra-vowel scatter in the F1F2 plane. VFE scaling is also a new tool which should be useful for showing how contextual effects vary over the duration of the syllable's vocalic nucleus.

Résumé

“Contour” et “trajectoire” sont devenus des termes familiers qui, pour toute voyelle, servent à décrire, sur l'axe des temps, l'évolution des fréquences propres à chacun des formants. Par contre, il y aurait lieu d'établir des vocables analogues permettant de préciser le profil de ces fréquences sur “l'axe des voyelles”. On introduit, donc, le concept de vowel-formant ensemble (et l'acronyme VFE qui en découle) afin de pouvoir regrouper, de voyelle à voyelle, les fréquences d'un formant (e.g., F2) qui sont obtenues, à un instant fixe de l'axe des temps, pour le même locuteur et dans le même contexte syllabique CVC. Notons que le concept de VFE contient à lui seul toute la démarche adoptée ici, à savoir que notre modélisation précédente des trajectoires des formants (Broad et Clermont, J. Acoust. Soc. Am. 81, 1987, 155–165) repose sur une fonction linéaire de la cible des voyelles et, de ce fait, suggère l'hypothèse que des relations linéaires devraient aussi servir à caractériser les VFEs propres à un locuteur et chacun des formants à la fois. De telles relations sont-elles vérifiables sur des échantillons réels de parole? On aborde cette question pour les fréquences des formants F1 et F2 de 7 voyelles de l'anglais australien qui ont été prononcées par un locuteur masculin, 5 fois de suite, dans 7 contextes syllabiques du type CVd. Nonobstant les variations aléatoires inhérentes aux 5 répétitions, l'application de notre hypothèse aux voyelles en question engendre un écart quadratique moyen qui ne dépasse pas la valeur, remarquablement faible, de 14 Hz pour F1 et F2. Les relations linéaires ainsi obtenues se prêtent à une normalisation par rapport au facteur contexte, que l'on démontre par une réduction de la dispersion intra-voyelle dans l'espace planaire F1F2. Les relations dérivées du concept de VFE constituent également un nouvel outil devant permettre la mise en évidence des effets de différents contextes au travers des noyaux vocaliques de syllabes.

Introduction

A goal in the study of speech communication is to better understand the link between the continuous stream of physical events in speech and the corresponding discrete sequence of phonetic units. One of the problems we face in establishing this link at the acoustic level is that formant contours for vowels in context consist mainly of transitions and have steady states that are only fleetingly realized. At any instant the formants are functions not only of the current vowel, but also of its preceding and following contexts.

Fig. 1 illustrates what happens to a formant (such as F2) for a set of monophthongal vowels in some fixed consonant–vowel–consonant (CVC) context (bVd, for example). Just as the contour for each individual vowel makes its transition from the initial consonant to the syllable center and then its transition to the following consonant, the set of contours taken as a whole starts from a relatively compressed pattern at the initial-consonant boundary, moves to a more widely spaced pattern in the syllable center and then becomes more compressed again as it approaches the final consonant.

It is this systematic variation in the intervowel spacing for vowels in CVC context that is the topic of this paper and our hypothesis for it will be more easily stated if we first define a new concept, that of the vowel-formant ensemble.

As just described, Fig. 1 shows variations of a given formant along the two dimensions of time and vowel category for a fixed CVC context. To look at the variations with time while keeping the vowel fixed amounts to selecting a vowel in the figure and following the course of its formant trajectory from beginning to end. For such variation along the time axis we have readily understood terms such as “contour”, “trajectory” and “transition”, terms for which the notion of the time axis is already implicit.

But as illustrated by the vertical line in the figure, we can also look at how the different vowels are distributed for some fixed frame (relative time position in the syllable). Unfortunately, we have no readily understood term for this idea of how the formant varies on the “vowel axis”. In the absence of a ready-made term for this, we now introduce the concept of the vowel-formant ensemble (VFE) by which we mean the set of formant frequencies realized for the set of vowels for the given frame and context. In the figure the vertical slice represents a vowel-formant ensemble.

In this paper our focus will be on the vowel-formant ensemble rather than on individual vowel contours, particularly on how the ensembles for different time frames and contexts are scaled in relation to one another.

The hypothesis we explore is the simplest scaling relation we can imagine, namely, that for a fixed formant (such as F2) all a speaker's VFEs will be linearly scaled copies of one another across CVC contexts and relative time frames in the syllable, i.e., that all these VFEs will be geometrically similar to one another.

Our approach is to first motivate the hypothesis by showing how our earlier time-domain model for formant contours (Broad and Clermont, 1987; cited below as BC87) predicts the linear scaling of VFEs and then to show how this prediction can be tested.

The hypothesis itself is illustrated in Fig. 2 where the top two panels show families of contours for the same formant and the same set of vowels but in two different CVC contexts. A frame is selected from each context and its corresponding vowel-formant ensemble is marked by a vertical line. These ensembles from the two contexts are transferred to the bottom panel, which has the same vertical scale (representing formant frequency) as the top two plots. The horizontal arrangement of the ensembles in the bottom panel is arbitrary, and is planned simply to fit the picture to reasonable proportions. The fact that the vowel placements subdivide the two ensembles in the same proportions, i.e., that the ensembles are geometrically similar to each other, is shown by the fact that the lines connecting identical vowels in the two ensembles all intersect at the same point. (The location of this point in and of itself is meaningless, as can be seen by how it could be moved around by adjusting the arbitrary horizontal placement of the two ensembles.) That the ensemble from Context A and the one from Context B are selected arbitrarily illustrates our hypothesis: that all pairs of a speaker's vowel-formant ensembles for a given formant will be similar to each other, i.e., all these VFEs will be linearly scaled copies of one another across contexts and time frames.

The vowel-formant contours in the different contexts in Fig. 2 have different shapes and different displacements. As the diagram suggests, however, these complexities in the individual contours may give way to a simpler pattern when our point of view shifts from the time axis to the vowel-formant ensemble.

In Section 2 we start from one of the time-domain models we developed in BC87. It embodies the simple properties of (1) additivity of effects from initial and final consonants, (2) per-consonant similarity of transition shapes, and (3) scaling of transitions by the differences between vowel targets and consonant loci. The model involves parameters and functions which are indexed by the vowel and by the initial and final consonants. The only element of the model that depends on the vowel category is the vowel target, which enters the model only as a first-order factor. These properties imply that within a given formant number (such as F2) the vowel-formant ensemble from any frame and any context in the model is a linearly scaled version of the model's target ensemble and that as a result all the VFEs for the given formant in the model are linearly scaled versions of one another.

We finish Section 2 by showing how this theoretical result can be connected to data through some simple numerical operations. F1 and F2 data on a single speaker's vowels spoken in different CVd contexts are used in Section 3 to illustrate the linear scaling relation and test it statistically using interrepetition variation as a baseline. For this example dataset the hypothesized linear scaling of its VFEs cannot be confirmed in the strictest statistical sense, but its implied rms departure from linearity is only about 14 Hz for each formant. Therefore the F1 and F2 VFEs for this dataset are only nearly linearly scaled. If not strictly true, then our hypothesis of linear scaling will be seen to be a remarkably good approximation for this dataset.

In Section 3 we also adapt the linear scaling relation to a form of normalization of vowel-formant ensembles for context and show how vowels become better separated on an F1F2 plot. Limitations and applications of the scaling relation are discussed in Section 4 while our conclusions are summarized in Section 5.

It is noteworthy that none of the data operations for analyzing the linear scaling relation require the estimation of any of the parameters or functions that make up the original formant-contour model. This follows from the fact that the result arises from the linear structure of the model and not from any specific implementation of it. As a consequence, the linear scaling of VFEs can be studied as a phenomenon in its own right without reference to the model which predicted it.

Section snippets

Development of the scaling hypothesis

In this section we derive the linear scaling of vowel-formant ensembles as a prediction from one of our models in BC87. In Section 2.1 we introduce the model in its time-axis formulation and discuss some of its properties and their roots in earlier work. In particular, we note the linear structure of the model which leads to its prediction of the linear scaling of VFEs. In Section 2.2 a “vowel-axis” reformulation of the model provides an explicit characterization of the linear scaling relation

Utterances and recording

The data we use are from a General Australian English speaker's productions of NV=7 undiphthongized vowels in 7 CVd contexts (C=/h,b,d,g,p,t,k/,V=/i,ε,æ,a,ɒ,ʌ,ɜ/,C=/d/). We follow Bernard (1970) for the phonetic symbolization of the vowels. Key words for their pronunciations are, respectively, “hid”, “head”, “had”, “hard”, “hod”, “hudd” and “herd”. Each combination of vowel and context is represented in the dataset by Nrep=5 repetitions.

The vowels included in the dataset are monophthongs in

Systematic errors in the model

We know ahead of time that Eq. (1) will almost certainly not be exact even for the mean statistical trend of phonetic data. As already shown by Broad and Fertig (1970) there exists a systematic departure from additivity which, though numerically small, is highly significant statistically. This is similar to our scaling result where the departures from Eq. (8) are greater than those attributable to interrepetition variation to a high (>99.5%) level of significance, even though numerically the

Conclusion

We began by noting that although contextual effects on the formant contour of a single vowel in CVC context will show up most obviously in its transitions, the overall pattern of a context's effects is more clearly exhibited by the family of formant contours for the context's range of vowels. Such a family may have a fairly close intervowel spacing near a consonantal boundary where the contours tend to converge toward a consonant locus and a more dilated one toward the syllable center where

References (23)

  • F. Clermont

    Spectro-temporal description of diphthongs in F1–F2–F3 space

    Speech Communication

    (1993)
  • D.R. Van Bergem

    A model of coarticulatory effects on the schwa

    Speech Communication

    (1994)
  • H.H. Yang et al.

    Relevance of time-frequency features for phonetic and speaker-channel classification

    Speech Communication

    (2000)
  • J.R.L. Bernard

    Toward the acoustic specification of Australian English

    Z. Phonetik Sprachwiss. Komm. Forsch.

    (1970)
  • D.J. Broad

    Toward defining acoustic phonetic equivalence for vowels

    Phonetica

    (1976)
  • D.J. Broad et al.

    A superposition model for coarticulation in certain CVC utterances

    J. Acoust. Soc. Am.

    (1984)
  • D.J. Broad et al.

    A methodology for modeling vowel formant contours in CVC context

    J. Acoust. Soc. Am.

    (1987)
  • D.J. Broad et al.

    Formant-frequency trajectories in selected CVC utterances

    J. Acoust. Soc. Am.

    (1970)
  • Clermont, F., 1991. Formant contour models of diphthongs: a study in acoustic phonetics and computer modelling of...
  • P.C. Delattre et al.

    Acoustic loci and transitional cues for consonants

    J. Acoust. Soc. Am.

    (1952)
  • C.G.M. Fant

    The Acoustic Theory of Speech Production

    (1960)
  • Cited by (9)

    • A method for analyzing the coarticulated CV and VC components of vowel-formant trajectories in CVC syllables

      2014, Journal of Phonetics
      Citation Excerpt :

      In this section we explain our conceptual framework for characterizing dynamic effects of consonantal contexts on vowel formants. The model on which our method is based is the result of a series of developments (Broad & Fertig, 1970; Broad, 1984; Broad & Clermont, 1984, 1987, 2002, 2010; cited below as BF70, B84, BC84, BC87, BC02, and BC10, respectively), in which we progressively incorporated the additivity of consonantal effects, the concept of consonant-specific transition shapes, and the scaling of these shapes by differences between vowel targets and consonant loci. These properties are recalled and discussed in Sections 2.1 through 2.5 with a view towards explaining them and adapting them for use in the linear-decomposition method.

    • Speech: A dynamic process

      2017, Speech: A Dynamic Process
    • Vowel perception in normal speakers

      2013, Handbook of Vowels and Vowel Disorders
    View all citing articles on Scopus
    View full text