Elsevier

Computer Speech & Language

Volume 45, September 2017, Pages 123-136
Computer Speech & Language

Hierarchical representation and estimation of prosody using continuous wavelet transform

https://doi.org/10.1016/j.csl.2016.11.001Get rights and content

Highlights

  • We introduce a wavelet based representation system for speech prosody.

  • Emergent hierarchy from f0, intensity and duration.

  • Prominences and boundaries are represented in one framework.

  • System allows for efficient analysis and annotation of prosodic events.

  • The unsupervised prosodic labelling scheme is comparable with supervised methods.

Abstract

Prominences and boundaries are the essential constituents of prosodic structure in speech. They provide for means to chunk the speech stream into linguistically relevant units by providing them with relative saliences and demarcating them within utterance structures. Prominences and boundaries have both been widely used in both basic research on prosody as well as in text-to-speech synthesis. However, there are no representation schemes that would provide for both estimating and modelling them in a unified fashion. Here we present an unsupervised unified account for estimating and representing prosodic prominences and boundaries using a scale-space analysis based on continuous wavelet transform. The methods are evaluated and compared to earlier work using the Boston University Radio News corpus. The results show that the proposed method is comparable with the best published supervised annotation methods.

Introduction

Two of the most primary features of speech prosody have to do with chunking speech into linguistically relevant units above the segment and the relative salience of the given units; that is, boundaries and prominences, respectively. These two aspects are present in every utterance and are central to any representation of speech prosody. Arrangement of prominence patterns and placement of boundaries reflect the hierarchical structure of speech, i.e., gradual nesting of units, segments within syllables, syllables within (prosodic) words, words within phrases, phrases within utterances and beyond (Tseng et al., 2005). Borders between adjoining units of higher order – words, phrases – present affordances for prosodic breaks of different types and strengths. Attention of the listener can be selectively drawn to individual units within the hierarchy; prominent syllables mark lexical stress, prominent words signal focus, etc.

In speech, boundaries are usually signalled by a local reduction in one or more signal characteristics (such as intensity or pitch) at a border spanning several hierarchical levels. In a complementary fashion, prominence is typically associated with an increase in some or all of these signal properties, typically associated with a particular hierarchical level.

This simple insight suggests that these prosodic constituents could be represented within a uniform methodology that identifies both prominence and boundaries as complementary phenomena manifested in speech signals. Such a methodology would be beneficial to both basic speech research and speech technology, especially speech synthesis and recognition. At the same time, to be useful for data oriented research and technology, the annotation system should strive towards being unsupervised as opposed to the systems that rely on humans, either directly labelling speech data (Silverman et al., 1992) or providing a manually labelled training set used for training the system.

Ideally, the system should approach human-like performance but without the variability of human labellers caused by complex interactions between the top-down and bottom-up influences. In order to achieve that we propose here a system based on Continuous Wavelet Transform (CWT) that (1) approximates human processing of a complex signal relevant for identifying prominence and boundaries, and (2) is capable of representing the speech signal in a manner that captures the hierarchical nature of prosodic signalling.

In this paper we present a hierarchical, time-frequency scale-space analysis of prosodic signals (e.g., fundamental frequency, energy, duration) based on the CWT. The presented algorithms can be used to analyse and annotate speech signals in an entirely unsupervised fashion. The work stems from the need to annotate speech corpora automatically for text-to-speech synthesis (TTS) (s4a, 2014) and the subject matter is partly examined from that point of view. However, the presented representations should be of interest to anyone working on speech prosody.

Wavelets extend the classical Fourier theory by replacing a fixed window with a family of scaled windows resulting in scalograms, resembling the spectrogram commonly used for analysing speech signals. The most interesting aspect of wavelet analysis with respect to speech is that it resembles the perceptual hierarchical structures related to prosody. In scalograms, speech sounds, syllables, (phonological) words, and phrases can be localised precisely in both time and frequency (scale). This would be considerably more difficult to achieve with traditional spectrograms. Furthermore, the wavelets give natural means to discretise and operationalise the continuous prosodic signals.

Fig. 1 shows how the hierarchical nature of speech can be captured in a time-frequency scale-space by CWT of a composite prosodic signal of an English utterance. The scalogram is shown as a heat map in the top part of the figure above the signal contour in blue. The scalogram is constructed from multiple scale functions (see also Fig. 2). Each of these scale functions is a convolution of the original signal and a dilated, i.e., scaled version of the mother wavelet (the Mexican hat wavelet in this case). Three examples of the scaled wavelets are shown to the left of the scalogram; as we can see, scalogram results from the convolution with progressively more and more scaled up – wider and higher – wavelets. The convolution operator depicts “similarities” between the two convoluted functions, signal and the wavelet. As the highlighted area in the figure illustrates, a local similarity in shape of the signal with the dilated wavelet leads to a higher value of the scale function (red area in the heat map); the most dissimilar portions of the signal (valleys compared to peak-like shape of the wavelets) yield negative values of the scale function shown in blue.

The tree structure superimposed in black over the scalogram in Fig. 1 joins the red areas of high similarity with differently scaled wavelets, i.e., depicts the hierarchy of portions of the signal that are “prominent” at various scales. The hierarchical utterance structure has served as a basis for modelling the prosody, e.g., speech melody, timing, lexical stress, and prominence structure of the synthetic speech.

Controlling prosody in synthesis has been based on a number of different theoretical approaches stemming from both phonological considerations as well as phonetic ones. The phonologically based ones stem from the so called Autosegmental Metrical theory (Goldsmith, 1990) which is based on the three-dimensional phonology developed in Halle et al. (1978) and Halle and Vergnaud (1980) as noted in Klatt (1987). These models are sequential in nature though a hierarchical structure is explicitly referred to for certain features of the models (e.g., break indices in ToBI Silverman et al., 1992). The more phonetically oriented hierarchical models are based on the assumption that prosody – especially intonation – is truly hierarchical in a super-positional and parallel fashion.

Actual models capturing the superpositional nature of intonation were first proposed by Öhman (1967), whose model was further developed by Fujisaki and Sudo (1971) and Fujisaki and Hirose (1984) as a so called command-response model which assumes two separate types of articulatory commands; accents associated with stressed syllables superposed on phrases with their own commands. The accent commands produce faster changes which are superposed on slowly varying phrase contours. Several superpositional models with a varying degree of levels have been proposed since Fujisaki (Anumanchipalli, Oliveira, Black, 2011, Bailly, Holm, 2005, Kochanski, Shih, 2000, Kochanski, Shih, 2003). Superpositional models attempt to capture both the chunking of speech into phrases as well the highlighting of words within an utterance. Typically smaller scale changes, caused by e.g., the modulation of the airflow (and consequently the f0) by the closing of the vocal tract during certain consonants, are not modelled.

Prominence is a functional phonological phenomenon that signals syntagmatic relations of units within an utterance by highlighting some parts of the speech signal while attenuating others. Thus, for instance, some of the syllables within a word stand out as stressed (Eriksson et al., 2001). At the level of words prominence relations can signal how important the speaker considers each word in relation to others in the same utterance. These often information based relations range from simple phrasal structures (e.g., prime minister, yellow car) to relating utterances to each other in discourse as in the case of contrastive focus (e.g., “Where did you leave your car? No, we WALKED here.”). Although prominence impressions might be continuous, they may serve categorical functions. Thus, the prominence can be categorised (Arnold, Wagner, Möbius, 2012, Cole, Mo, Hasegawa-Johnson, 2010) in, e.g, four levels where the first level stands for words that are not stressed in any fashion prosodically to moderately stressed and stressed and finally words that are emphasised (as the word WALKED in the example above). These four categories are fairly easily and consistently labelled even by non-expert listeners (Vainio et al., 2009). In sum, prominence functions to structure utterances in a hierarchical fashion that directs the listener’s attention in a way which enables the understanding of the message in an optimal manner. However, prominent units – be they words or syllables – do not by themselves demarcate the speech signal but are accompanied by boundaries that chunk the prominent and non-prominent units into larger ones: syllables to (phonological) words, words to phrases, and so forth. Prominence and boundary estimation have been treated as separate problems stemming from different sources in the speech signals.

As functional – rather than formal, purely signal-based – prosodic phenomena, prominences and boundaries lend themselves optimally to statistical modelling (traditionally by supervised methods). The actual signalling of prosody in terms of speech parameters is extremely complex and context sensitive: the form follows function in a complex fashion. Capturing prominence and boundaries in terms of one-dimensional values reduces representational complexity of speech annotations in an advantageous way. In a synthesis system this reduction occurs at a juncture that is relevant in terms of both representations and data scarcity. The complex feature set that is known to effect the prosody of speech can be narrowed to a few categories or a single continuum from dozens of context sensitive features, such as e.g, part-of-speech and whatever can be computed from the input text. Taken this way, both prominence and boundaries can be viewed as abstract phonological functions that impact the phonetic realisation of the speech signal predictably despite a possible considerable phonetic variation.

The perceived prominence of a given word in an utterance is a product of many separate sources of information; mostly signal-based although other linguistic, top-down factors have been shown to modulate the perception (Cole, Mo, Hasegawa-Johnson, 2010, Eriksson, Grabe, Traunmüller, 2002, Eriksson, Thunberg, Traunmüller, 2001, Vainio, Järvikivi, 2006, Vainio, Suni, Raitio, Nurminen, Järvikivi, Alku, 2009, Wagner, Origlia, Avesani, Christodoulides, Cutugno, D’Imperio, Escudero Mancebo, Gili Fivela, Lacheret, Ludusan, Moniz, Ní Chasaide, Niebuhr, Rousier-Vercruyssen, Simon, Šimko, Tesser, Vainio, 2015). Typically a prominent word is accompanied with a clearly audible f0 movement, the stressed syllable is longer in duration, and its intensity is higher. However, because of the combination of the bottom-up and top-down influences, and their manifestation in the hierarchical character of speech discussed above, estimating prominences automatically is not straight-forward and a multitude of different estimation algorithms have been suggested (see Section 3 for more detail).

In what follows we present recently developed methods for automatic prominence estimation and boundary detection based on CWT (Section 2) which allow for fully automatic and unsupervised means to estimate both (word) prominences and boundary values from a hierarchical representation of speech (see Suni, Aalto, Raitio, Alku, Vainio, 2013, Vainio, Suni, Aalto, Vainio, Suni, Aalto, 2015 for earlier work). The main insight in this methodology is that both prominences and boundaries can be treated as arising from the same sources in the (prosodic) speech signals and estimated with exactly the same methods. These methods, then, provide for a uniform representation for prosody that is useful in both speech synthesis and basic phonetic research. These representations are purely computational and thus objective. It is – however – interesting to see how the proposed hierarchical method relates to annotations provided by humans as well as earlier attempts at the problem (Section 3).

Section snippets

Methods

Wavelets are used in a great variety of applications for effectively compressing and denoising signals, to represent the hierarchical properties of multidimensional signals like polychromatic visual patterns in image retrieval, and to model optical signal processing of visual neural fields (ter Haar Romeny, 2014, Russ, Woods, 1995). In speech and auditory research there is also a long history going back to the 1970’s (Altosaar, Karjalainen, 1988, Giraud, Poeppel, 2012, Ramachandran, Mammone,

Experimental evaluation

As stated in the introduction, a solid method for annotating prosody would be very welcome in the field of speech synthesis, where recent development has concentrated on the acoustic modelling (Zen et al., 2007). The motivation is crucial in building speech synthesisers for low-resourced languages, where neither linguistically nor prosodically annotated corpora are available (s4a, 2014). In this chapter, we assess the utility of the proposed CWT-LoMA representation of prosody on the tasks of

Discussion and conclusions

The results show that prominences and boundaries can be viewed as manifestations of the same underlying speech production process. This has, of course, many theoretical implications. As foremost is the fact that the suprasegmental variables used (f0, energy envelope, duration) seem to work seamlessly to the same end, which is to signal the hierarchical and parallel structure of the linguistic signals. The role of signal energy as a reliable determinant of prosodic structure is interesting, but

Acknowledgements

The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007–2013) under Grant agreement no. 287678 (Simple4All) and the Academy of Finland (project no. 1265610 (the MIND programme)) as well as project no. 293346 (the Digital Humanities programme). We are indebted to Petra Wagner and two anonymous reviewers for their insightful comments and suggestions.

References (64)

  • G. Bailly et al.

    Sfc: a trainable prosodic model

    Speech Commun.

    (2005)
  • G. Kochanski et al.

    Prosody modeling with soft templates

    Speech Commun.

    (2003)
  • M. Vainio et al.

    Tonal features, intensity, and word order in the perception of prominence

    J. Phon.

    (2006)
  • Vainio, M., Suni, A., Aalto, D., 2013. Continuous wavelet transform for analysis of speech prosody. Proceedings of...
  • T. Altosaar et al.

    Event-based multiple-resolution analysis of speech signals

    Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP-88

    (1988)
  • S. Ananthakrishnan et al.

    Automatic prosodic event detection using acoustic, lexical, and syntactic evidence

    IEEE Trans. Audio Speech Lang. Process.

    (2008)
  • S. Ananthakrishnan et al.

    Combining acoustic, lexical, and syntactic evidence for automatic unsupervised prosody labeling

    Proceedings of InterSpeech

    (2006)
  • G.K. Anumanchipalli et al.

    A statistical phrase/accent model for intonation modeling

    Proceedings of InterSpeech

    (2011)
  • D. Arnold et al.

    Obtaining prominence judgments from naïve listeners–influence of rating scales, linguistic levels and normalisation

    Proceedings of Interspeech 2012

    (2012)
  • J. Barnes et al.

    Voiceless intervals and perceptual completion in f0 contours: evidence from scaling perception in American english

    Proceedings of 16th International Congress of Phonetic Sciences, ICPhS

    (2011)
  • J. Cole et al.

    Signal-based and expectation-based factors in the perception of prosodic prominence

    Lab. Phonol.

    (2010)
  • I. Daubechies

    Ten Lectures on Wavelets

    (1992)
  • A. Eriksson et al.

    Perception of syllable prominence by listeners with and without competence in the tested language

    Proceedings of International Conference on Speech Prosody 2002

    (2002)
  • A. Eriksson et al.

    Syllable prominence: a matter of vocal effort, phonetic distinctness and top-down processing

    Proceedings of European Conference on Speech Communication and Technology Aalborg, September 2001

    (2001)
  • M.H. Farouk

    Application of Wavelets in Speech Processing

    (2014)
  • H. Fujisaki et al.

    Analysis of voice fundamental frequency contours for declarative sentences of Japanese

    J. Acoust. Soc. Jpn. (E)

    (1984)
  • H. Fujisaki et al.

    A Generative Model for the Prosody of Connected Speech in Japanese, Annual Report

    (1971)
  • A.-L. Giraud et al.

    Cortical oscillations and speech processing: emerging computational principles and operations

    Nat. Neurosci.

    (2012)
  • B.R. Glasberg et al.

    Gap detection and masking in hearing-impaired and normal-hearing subjects

    J. Acoust. Soc. Am.

    (1987)
  • J.A. Goldsmith

    Autosegmental and Metrical Phonology

    (1990)
  • A. Grossman et al.

    Decomposition of functions into wavelets of constant shape, and related transforms

    Mathematics and Physics:Lectures on Recent Results

    (1985)
  • B.M. ter Haar Romeny

    A geometric model for the functional circuits of the visual front-end

    Brain-Inspired Computing

    (2014)
  • M. Halle et al.

    Three dimensional phonology

    J. Linguist. Res.

    (1980)
  • M. Halle et al.

    Metrical Structures in Phonology

    (1978)
  • G.H. Hardy et al.

    A maximal theorem with function-theoretic applications

    Acta Math.

    (1930)
  • O. Kalinli et al.

    A saliency-based auditory attention model with applications to unsupervised prominent syllable detection in speech

    Proceedings of the 8th Annual Conference of the International Speech Communication Association, InterSpeech, Antwerp, Belgium, August 27–31, 2007

    (2007)
  • O. Kalinli et al.

    Prominence Detection Using Auditory Attention Cues and Task-Dependent High Level Information. IEEE Trans. Audio Speech Lang. Process

    (2009)
  • D.H. Klatt

    Review of text-to-speech conversion for english

    J. Acoust. Soc. Am.

    (1987)
  • G. Kochanski et al.

    Loudness predicts prominence: fundamental frequency lends little

    J. Acoust. Soc. Am.

    (2005)
  • G. Kochanski et al.

    Stem-ml: language-independent prosody description

    Proceedings of InterSpeech

    (2000)
  • H. Kruschke et al.

    Estimation of the parameters of the quantitative intonation model with continuous wavelet analysis

    Proceedings of Eighth European Conference on Speech Communication and Technology

    (2003)
  • LeiM. et al.

    A hierarchical f0 modeling method for HMM-based speech synthesis

    Proceedings of InterSpeech

    (2010)
  • Cited by (58)

    • Event-related responses reflect chunk boundaries in natural speech

      2022, NeuroImage
      Citation Excerpt :

      Prosodic boundary was defined by tracking minima across all those scales in the resulting scalograms. Applying CWT allows for examining both local and wider context of the prosodic signals, which has shown to be beneficial, yielding good agreement between CWT method and expert annotations of ToBi break indices (Suni et al., 2017). The depth of the minima across wavelet scales is taken into account, resulting in a continuous boundary strength score.

    • The rhythms of rhythm

      2023, Journal of the International Phonetic Association
    • Underspecification in time

      2022, Canadian Journal of Linguistics
    View all citing articles on Scopus
    View full text