Hierarchical representation and estimation of prosody using continuous wavelet transform

doi:10.1016/j.csl.2016.11.001

Computer Speech & Language

Volume 45, September 2017, Pages 123-136

https://doi.org/10.1016/j.csl.2016.11.001 Get rights and content

Highlights

•
We introduce a wavelet based representation system for speech prosody.
•
Emergent hierarchy from f₀, intensity and duration.
•
Prominences and boundaries are represented in one framework.
•
System allows for efficient analysis and annotation of prosodic events.
•
The unsupervised prosodic labelling scheme is comparable with supervised methods.

Abstract

Prominences and boundaries are the essential constituents of prosodic structure in speech. They provide for means to chunk the speech stream into linguistically relevant units by providing them with relative saliences and demarcating them within utterance structures. Prominences and boundaries have both been widely used in both basic research on prosody as well as in text-to-speech synthesis. However, there are no representation schemes that would provide for both estimating and modelling them in a unified fashion. Here we present an unsupervised unified account for estimating and representing prosodic prominences and boundaries using a scale-space analysis based on continuous wavelet transform. The methods are evaluated and compared to earlier work using the Boston University Radio News corpus. The results show that the proposed method is comparable with the best published supervised annotation methods.

Introduction

Two of the most primary features of speech prosody have to do with chunking speech into linguistically relevant units above the segment and the relative salience of the given units; that is, boundaries and prominences, respectively. These two aspects are present in every utterance and are central to any representation of speech prosody. Arrangement of prominence patterns and placement of boundaries reflect the hierarchical structure of speech, i.e., gradual nesting of units, segments within syllables, syllables within (prosodic) words, words within phrases, phrases within utterances and beyond (Tseng et al., 2005). Borders between adjoining units of higher order – words, phrases – present affordances for prosodic breaks of different types and strengths. Attention of the listener can be selectively drawn to individual units within the hierarchy; prominent syllables mark lexical stress, prominent words signal focus, etc.

In speech, boundaries are usually signalled by a local reduction in one or more signal characteristics (such as intensity or pitch) at a border spanning several hierarchical levels. In a complementary fashion, prominence is typically associated with an increase in some or all of these signal properties, typically associated with a particular hierarchical level.

This simple insight suggests that these prosodic constituents could be represented within a uniform methodology that identifies both prominence and boundaries as complementary phenomena manifested in speech signals. Such a methodology would be beneficial to both basic speech research and speech technology, especially speech synthesis and recognition. At the same time, to be useful for data oriented research and technology, the annotation system should strive towards being unsupervised as opposed to the systems that rely on humans, either directly labelling speech data (Silverman et al., 1992) or providing a manually labelled training set used for training the system.

Ideally, the system should approach human-like performance but without the variability of human labellers caused by complex interactions between the top-down and bottom-up influences. In order to achieve that we propose here a system based on Continuous Wavelet Transform (CWT) that (1) approximates human processing of a complex signal relevant for identifying prominence and boundaries, and (2) is capable of representing the speech signal in a manner that captures the hierarchical nature of prosodic signalling.

In this paper we present a hierarchical, time-frequency scale-space analysis of prosodic signals (e.g., fundamental frequency, energy, duration) based on the CWT. The presented algorithms can be used to analyse and annotate speech signals in an entirely unsupervised fashion. The work stems from the need to annotate speech corpora automatically for text-to-speech synthesis (TTS) (s4a, 2014) and the subject matter is partly examined from that point of view. However, the presented representations should be of interest to anyone working on speech prosody.

Wavelets extend the classical Fourier theory by replacing a fixed window with a family of scaled windows resulting in scalograms, resembling the spectrogram commonly used for analysing speech signals. The most interesting aspect of wavelet analysis with respect to speech is that it resembles the perceptual hierarchical structures related to prosody. In scalograms, speech sounds, syllables, (phonological) words, and phrases can be localised precisely in both time and frequency (scale). This would be considerably more difficult to achieve with traditional spectrograms. Furthermore, the wavelets give natural means to discretise and operationalise the continuous prosodic signals.

Fig. 1 shows how the hierarchical nature of speech can be captured in a time-frequency scale-space by CWT of a composite prosodic signal of an English utterance. The scalogram is shown as a heat map in the top part of the figure above the signal contour in blue. The scalogram is constructed from multiple scale functions (see also Fig. 2). Each of these scale functions is a convolution of the original signal and a dilated, i.e., scaled version of the mother wavelet (the Mexican hat wavelet in this case). Three examples of the scaled wavelets are shown to the left of the scalogram; as we can see, scalogram results from the convolution with progressively more and more scaled up – wider and higher – wavelets. The convolution operator depicts “similarities” between the two convoluted functions, signal and the wavelet. As the highlighted area in the figure illustrates, a local similarity in shape of the signal with the dilated wavelet leads to a higher value of the scale function (red area in the heat map); the most dissimilar portions of the signal (valleys compared to peak-like shape of the wavelets) yield negative values of the scale function shown in blue.

The tree structure superimposed in black over the scalogram in Fig. 1 joins the red areas of high similarity with differently scaled wavelets, i.e., depicts the hierarchy of portions of the signal that are “prominent” at various scales. The hierarchical utterance structure has served as a basis for modelling the prosody, e.g., speech melody, timing, lexical stress, and prominence structure of the synthetic speech.

Controlling prosody in synthesis has been based on a number of different theoretical approaches stemming from both phonological considerations as well as phonetic ones. The phonologically based ones stem from the so called Autosegmental Metrical theory (Goldsmith, 1990) which is based on the three-dimensional phonology developed in Halle et al. (1978) and Halle and Vergnaud (1980) as noted in Klatt (1987). These models are sequential in nature though a hierarchical structure is explicitly referred to for certain features of the models (e.g., break indices in ToBI Silverman et al., 1992). The more phonetically oriented hierarchical models are based on the assumption that prosody – especially intonation – is truly hierarchical in a super-positional and parallel fashion.

Actual models capturing the superpositional nature of intonation were first proposed by Öhman (1967), whose model was further developed by Fujisaki and Sudo (1971) and Fujisaki and Hirose (1984) as a so called command-response model which assumes two separate types of articulatory commands; accents associated with stressed syllables superposed on phrases with their own commands. The accent commands produce faster changes which are superposed on slowly varying phrase contours. Several superpositional models with a varying degree of levels have been proposed since Fujisaki (Anumanchipalli, Oliveira, Black, 2011, Bailly, Holm, 2005, Kochanski, Shih, 2000, Kochanski, Shih, 2003). Superpositional models attempt to capture both the chunking of speech into phrases as well the highlighting of words within an utterance. Typically smaller scale changes, caused by e.g., the modulation of the airflow (and consequently the f₀) by the closing of the vocal tract during certain consonants, are not modelled.

Prominence is a functional phonological phenomenon that signals syntagmatic relations of units within an utterance by highlighting some parts of the speech signal while attenuating others. Thus, for instance, some of the syllables within a word stand out as stressed (Eriksson et al., 2001). At the level of words prominence relations can signal how important the speaker considers each word in relation to others in the same utterance. These often information based relations range from simple phrasal structures (e.g., prime minister, yellow car) to relating utterances to each other in discourse as in the case of contrastive focus (e.g., “Where did you leave your car? No, we WALKED here.”). Although prominence impressions might be continuous, they may serve categorical functions. Thus, the prominence can be categorised (Arnold, Wagner, Möbius, 2012, Cole, Mo, Hasegawa-Johnson, 2010) in, e.g, four levels where the first level stands for words that are not stressed in any fashion prosodically to moderately stressed and stressed and finally words that are emphasised (as the word WALKED in the example above). These four categories are fairly easily and consistently labelled even by non-expert listeners (Vainio et al., 2009). In sum, prominence functions to structure utterances in a hierarchical fashion that directs the listener’s attention in a way which enables the understanding of the message in an optimal manner. However, prominent units – be they words or syllables – do not by themselves demarcate the speech signal but are accompanied by boundaries that chunk the prominent and non-prominent units into larger ones: syllables to (phonological) words, words to phrases, and so forth. Prominence and boundary estimation have been treated as separate problems stemming from different sources in the speech signals.

As functional – rather than formal, purely signal-based – prosodic phenomena, prominences and boundaries lend themselves optimally to statistical modelling (traditionally by supervised methods). The actual signalling of prosody in terms of speech parameters is extremely complex and context sensitive: the form follows function in a complex fashion. Capturing prominence and boundaries in terms of one-dimensional values reduces representational complexity of speech annotations in an advantageous way. In a synthesis system this reduction occurs at a juncture that is relevant in terms of both representations and data scarcity. The complex feature set that is known to effect the prosody of speech can be narrowed to a few categories or a single continuum from dozens of context sensitive features, such as e.g, part-of-speech and whatever can be computed from the input text. Taken this way, both prominence and boundaries can be viewed as abstract phonological functions that impact the phonetic realisation of the speech signal predictably despite a possible considerable phonetic variation.

The perceived prominence of a given word in an utterance is a product of many separate sources of information; mostly signal-based although other linguistic, top-down factors have been shown to modulate the perception (Cole, Mo, Hasegawa-Johnson, 2010, Eriksson, Grabe, Traunmüller, 2002, Eriksson, Thunberg, Traunmüller, 2001, Vainio, Järvikivi, 2006, Vainio, Suni, Raitio, Nurminen, Järvikivi, Alku, 2009, Wagner, Origlia, Avesani, Christodoulides, Cutugno, D’Imperio, Escudero Mancebo, Gili Fivela, Lacheret, Ludusan, Moniz, Ní Chasaide, Niebuhr, Rousier-Vercruyssen, Simon, Šimko, Tesser, Vainio, 2015). Typically a prominent word is accompanied with a clearly audible f₀ movement, the stressed syllable is longer in duration, and its intensity is higher. However, because of the combination of the bottom-up and top-down influences, and their manifestation in the hierarchical character of speech discussed above, estimating prominences automatically is not straight-forward and a multitude of different estimation algorithms have been suggested (see Section 3 for more detail).

In what follows we present recently developed methods for automatic prominence estimation and boundary detection based on CWT (Section 2) which allow for fully automatic and unsupervised means to estimate both (word) prominences and boundary values from a hierarchical representation of speech (see Suni, Aalto, Raitio, Alku, Vainio, 2013, Vainio, Suni, Aalto, Vainio, Suni, Aalto, 2015 for earlier work). The main insight in this methodology is that both prominences and boundaries can be treated as arising from the same sources in the (prosodic) speech signals and estimated with exactly the same methods. These methods, then, provide for a uniform representation for prosody that is useful in both speech synthesis and basic phonetic research. These representations are purely computational and thus objective. It is – however – interesting to see how the proposed hierarchical method relates to annotations provided by humans as well as earlier attempts at the problem (Section 3).

Section snippets

Methods

Wavelets are used in a great variety of applications for effectively compressing and denoising signals, to represent the hierarchical properties of multidimensional signals like polychromatic visual patterns in image retrieval, and to model optical signal processing of visual neural fields (ter Haar Romeny, 2014, Russ, Woods, 1995). In speech and auditory research there is also a long history going back to the 1970’s (Altosaar, Karjalainen, 1988, Giraud, Poeppel, 2012, Ramachandran, Mammone,

Experimental evaluation

As stated in the introduction, a solid method for annotating prosody would be very welcome in the field of speech synthesis, where recent development has concentrated on the acoustic modelling (Zen et al., 2007). The motivation is crucial in building speech synthesisers for low-resourced languages, where neither linguistically nor prosodically annotated corpora are available (s4a, 2014). In this chapter, we assess the utility of the proposed CWT-LoMA representation of prosody on the tasks of

Discussion and conclusions

The results show that prominences and boundaries can be viewed as manifestations of the same underlying speech production process. This has, of course, many theoretical implications. As foremost is the fact that the suprasegmental variables used (f₀, energy envelope, duration) seem to work seamlessly to the same end, which is to signal the hierarchical and parallel structure of the linguistic signals. The role of signal energy as a reliable determinant of prosodic structure is interesting, but

Acknowledgements

The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007–2013) under Grant agreement no. 287678 (Simple4All) and the Academy of Finland (project no. 1265610 (the MIND programme)) as well as project no. 293346 (the Digital Humanities programme). We are indebted to Petra Wagner and two anonymous reviewers for their insightful comments and suggestions.

References (64)

G. Bailly et al.
Sfc: a trainable prosodic model
Speech Commun.
(2005)
G. Kochanski et al.
Prosody modeling with soft templates
Speech Commun.
(2003)
M. Vainio et al.
Tonal features, intensity, and word order in the perception of prominence
J. Phon.
(2006)
Vainio, M., Suni, A., Aalto, D., 2013. Continuous wavelet transform for analysis of speech prosody. Proceedings of...
T. Altosaar et al.
Event-based multiple-resolution analysis of speech signals
Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP-88
(1988)
S. Ananthakrishnan et al.
Automatic prosodic event detection using acoustic, lexical, and syntactic evidence
IEEE Trans. Audio Speech Lang. Process.
(2008)
S. Ananthakrishnan et al.
Combining acoustic, lexical, and syntactic evidence for automatic unsupervised prosody labeling
Proceedings of InterSpeech
(2006)
G.K. Anumanchipalli et al.
A statistical phrase/accent model for intonation modeling
Proceedings of InterSpeech
(2011)
D. Arnold et al.
Obtaining prominence judgments from naïve listeners–influence of rating scales, linguistic levels and normalisation
Proceedings of Interspeech 2012
(2012)
J. Barnes et al.
Voiceless intervals and perceptual completion in f0 contours: evidence from scaling perception in American english
Proceedings of 16th International Congress of Phonetic Sciences, ICPhS
(2011)

J. Cole et al.

Signal-based and expectation-based factors in the perception of prosodic prominence

Lab. Phonol.

(2010)

I. Daubechies

Ten Lectures on Wavelets

(1992)

A. Eriksson et al.

Perception of syllable prominence by listeners with and without competence in the tested language

Proceedings of International Conference on Speech Prosody 2002

(2002)

A. Eriksson et al.

Syllable prominence: a matter of vocal effort, phonetic distinctness and top-down processing

Proceedings of European Conference on Speech Communication and Technology Aalborg, September 2001

(2001)

M.H. Farouk

Application of Wavelets in Speech Processing

(2014)

H. Fujisaki et al.

Analysis of voice fundamental frequency contours for declarative sentences of Japanese

J. Acoust. Soc. Jpn. (E)

(1984)

H. Fujisaki et al.

A Generative Model for the Prosody of Connected Speech in Japanese, Annual Report

(1971)

A.-L. Giraud et al.

Cortical oscillations and speech processing: emerging computational principles and operations

Nat. Neurosci.

(2012)

B.R. Glasberg et al.

Gap detection and masking in hearing-impaired and normal-hearing subjects

J. Acoust. Soc. Am.

(1987)

J.A. Goldsmith

Autosegmental and Metrical Phonology

(1990)

A. Grossman et al.

Decomposition of functions into wavelets of constant shape, and related transforms

Mathematics and Physics:Lectures on Recent Results

(1985)

B.M. ter Haar Romeny

A geometric model for the functional circuits of the visual front-end

Brain-Inspired Computing

(2014)

M. Halle et al.

Three dimensional phonology

J. Linguist. Res.

(1980)

M. Halle et al.

Metrical Structures in Phonology

(1978)

G.H. Hardy et al.

A maximal theorem with function-theoretic applications

Acta Math.

(1930)

O. Kalinli et al.

A saliency-based auditory attention model with applications to unsupervised prominent syllable detection in speech

Proceedings of the 8th Annual Conference of the International Speech Communication Association, InterSpeech, Antwerp, Belgium, August 27–31, 2007

(2007)

O. Kalinli et al.

Prominence Detection Using Auditory Attention Cues and Task-Dependent High Level Information. IEEE Trans. Audio Speech Lang. Process

(2009)

D.H. Klatt

Review of text-to-speech conversion for english

J. Acoust. Soc. Am.

(1987)

G. Kochanski et al.

Loudness predicts prominence: fundamental frequency lends little

J. Acoust. Soc. Am.

(2005)

G. Kochanski et al.

Stem-ml: language-independent prosody description

Proceedings of InterSpeech

(2000)

H. Kruschke et al.

Estimation of the parameters of the quantitative intonation model with continuous wavelet analysis

Proceedings of Eighth European Conference on Speech Communication and Technology

(2003)

LeiM. et al.

A hierarchical f0 modeling method for HMM-based speech synthesis

Proceedings of InterSpeech

(2010)

Cited by (58)

An acoustic study of rhythmic synchronization with natural English speech
2023, Journal of Phonetics
Sensorimotor synchronization as a means of studying rhythmic perception-action coupling has been extensively researched across a large number of temporally regular structures including music while little is known about synchronization with speech. The present study fills this gap by applying a sensorimotor synchronization paradigm to natural speech and studying acoustic landmarks that may serve as perceptual anchors of rhythmic movement in spoken sentences. Five rhythmically relevant types of acoustic landmarks were identified in twenty sentences of English containing syllables with vocalic and non-vocalic nuclei. The landmarks were either manually defined or algorithm-generated and included nucleus onsets, peaks and onsets of inter-syllabic and inter-stress timescales, moments of the fastest energy change (approximating the P-center location), and timepoints of combined pitch and periodic power. Sensorimotor synchronization data from 32 native English participants were examined with regards to the location of an increased synchronization activity in the proximity of the predefined landmarks. The results demonstrated that participants synchronized with syllable-size units regardless of the type of syllable nucleus (vowel or consonant) and that their taps were consistently timed close to nucleus onsets. Hereby, the manually defined nucleus onsets predicted synchronization peaks as well as the algorithm-generated moments of the fastest energy change around nucleus onsets (i.e., a model of the P-center location) did. In contrast, other landmarks did not constitute a stable acoustic anchor of sensorimotor synchronization with English speech. The synchronization performance was not influenced by either acoustic F0-information or by phonological tune specifications. These findings provide new evidence for the proposals that rhythmic attention in natural speech may be locked on to fast spectral changes within a syllable as the smallest structuring unit of prosodic hierarchy.
Prosody and fluency of Finland Swedish as a second language: Investigating global parameters for automated speaking assessment
2023, Speech Communication
This study investigates prosody and fluency of Finland Swedish as a second language (L2). The main objective is to investigate global measures of prosody and fluency as predictors of overall oral proficiency, fluency, and pronunciation ratings.
We analyzed parameters related to temporal fluency, timing (based on syllable durations), and f0 change from spontaneous speech produced by 30 native and 235 non-native speakers of Finland Swedish representing proficiency levels from beginner to intermediate. We used pairwise comparisons to investigate the differences between native speech (L1) and L2 samples from different proficiency levels. To study the predictability of ratings with acoustic parameters, we fitted a multiple linear regression model for each assessed dimension of L2 skills.
The comparison of L1 and L2 samples as well as L2 samples with different proficiency and fluency levels showed clear differences in f0 change and fluency parameters. Standard deviation of syllable durations also showed differences with respect to L2 learners’ fluency level. The results for multiple linear regression models, however, indicate contribution of rate-normalized standard deviation of syllable duration to fluency ratings, alongside traditionally used fluency parameters. As for proficiency ratings, f0 slope complemented fluency parameters in the prediction model. The predictive power of the parameters varied depending on the assessed dimension of L2 skills.
This study provides new information on the prosodic features of Finland Swedish as a second language and suggests new research on the assessment of non-dominant varieties of pluricentric languages. The results support previous findings on the importance of speed and pausing measures in predicting oral L2 skills. However, further investigation of language-specific f0 and timing parameters as part of automated or computer-assisted speaking assessment is called for.
Event-related responses reflect chunk boundaries in natural speech
2022, NeuroImage
Citation Excerpt :
Prosodic boundary was defined by tracking minima across all those scales in the resulting scalograms. Applying CWT allows for examining both local and wider context of the prosodic signals, which has shown to be beneficial, yielding good agreement between CWT method and expert annotations of ToBi break indices (Suni et al., 2017). The depth of the minima across wavelet scales is taken into account, resulting in a continuous boundary strength score.
Chunking language has been proposed to be vital for comprehension enabling the extraction of meaning from a continuous stream of speech. However, neurocognitive mechanisms of chunking are poorly understood. The present study investigated neural correlates of chunk boundaries intuitively identified by listeners in natural speech drawn from linguistic corpora using magneto- and electroencephalography (MEEG). In a behavioral experiment, subjects marked chunk boundaries in the excerpts intuitively, which revealed highly consistent chunk boundary markings across the subjects. We next recorded brain activity to investigate whether chunk boundaries with high and medium agreement rates elicit distinct evoked responses compared to non-boundaries. Pauses placed at chunk boundaries elicited a closure positive shift with the sources over bilateral auditory cortices. In contrast, pauses placed within a chunk were perceived as interruptions and elicited a biphasic emitted potential with sources located in the bilateral primary and non-primary auditory areas with right-hemispheric dominance, and in the right inferior frontal cortex. Furthermore, pauses placed at stronger boundaries elicited earlier and more prominent activation over the left hemisphere suggesting that brain responses to chunk boundaries of natural speech can be modulated by the relative strength of different linguistic cues, such as syntactic structure and prosody.
Chunking up speech in real time: linguistic predictors and cognitive constraints
2023, Language and Cognition
The rhythms of rhythm
2023, Journal of the International Phonetic Association
Underspecification in time
2022, Canadian Journal of Linguistics

View all citing articles on Scopus

View full text

Hierarchical representation and estimation of prosody using continuous wavelet transform

Highlights

Abstract

Introduction

Section snippets

Methods

Experimental evaluation

Discussion and conclusions

Acknowledgements

Speech Commun.

Speech Commun.

J. Phon.

Event-based multiple-resolution analysis of speech signals

Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP-88

Automatic prosodic event detection using acoustic, lexical, and syntactic evidence

IEEE Trans. Audio Speech Lang. Process.

Combining acoustic, lexical, and syntactic evidence for automatic unsupervised prosody labeling

Proceedings of InterSpeech

A statistical phrase/accent model for intonation modeling

Proceedings of InterSpeech

Obtaining prominence judgments from naïve listeners–influence of rating scales, linguistic levels and normalisation

Proceedings of Interspeech 2012

Voiceless intervals and perceptual completion in f0 contours: evidence from scaling perception in American english

Proceedings of 16th International Congress of Phonetic Sciences, ICPhS

Signal-based and expectation-based factors in the perception of prosodic prominence

Lab. Phonol.

Ten Lectures on Wavelets

Perception of syllable prominence by listeners with and without competence in the tested language

Proceedings of International Conference on Speech Prosody 2002

Syllable prominence: a matter of vocal effort, phonetic distinctness and top-down processing

Proceedings of European Conference on Speech Communication and Technology Aalborg, September 2001

Application of Wavelets in Speech Processing

Analysis of voice fundamental frequency contours for declarative sentences of Japanese

J. Acoust. Soc. Jpn. (E)

A Generative Model for the Prosody of Connected Speech in Japanese, Annual Report

Cortical oscillations and speech processing: emerging computational principles and operations

Nat. Neurosci.

Gap detection and masking in hearing-impaired and normal-hearing subjects

J. Acoust. Soc. Am.

Autosegmental and Metrical Phonology

Decomposition of functions into wavelets of constant shape, and related transforms

Mathematics and Physics:Lectures on Recent Results

A geometric model for the functional circuits of the visual front-end

Brain-Inspired Computing

Three dimensional phonology

J. Linguist. Res.

Metrical Structures in Phonology

A maximal theorem with function-theoretic applications

Acta Math.

A saliency-based auditory attention model with applications to unsupervised prominent syllable detection in speech

Proceedings of the 8th Annual Conference of the International Speech Communication Association, InterSpeech, Antwerp, Belgium, August 27–31, 2007

Prominence Detection Using Auditory Attention Cues and Task-Dependent High Level Information. IEEE Trans. Audio Speech Lang. Process

Review of text-to-speech conversion for english

J. Acoust. Soc. Am.

Loudness predicts prominence: fundamental frequency lends little

J. Acoust. Soc. Am.

Stem-ml: language-independent prosody description

Proceedings of InterSpeech

Estimation of the parameters of the quantitative intonation model with continuous wavelet analysis

Proceedings of Eighth European Conference on Speech Communication and Technology

A hierarchical f0 modeling method for HMM-based speech synthesis

Proceedings of InterSpeech