Stochastic and syntactic techniques for predicting phrase breaks

doi:10.1016/j.csl.2006.09.004

Computer Speech & Language

Volume 21, Issue 3, July 2007, Pages 519-542

https://doi.org/10.1016/j.csl.2006.09.004 Get rights and content

Abstract

Determining the position of breaks in a sentence is a key task for a text-to-speech system. A synthesized sentence containing incorrect breaks at best requires increased listening effort, and at worst, may have lower intelligibility and different semantics from a correctly phrased sentence. In addition, the position of breaks must be known before other components of the sentence’s prosodic structure can be determined. We consider here some methods for phrase break prediction in which the whole sentence is analysed, in contrast to most previous work which has focused on analysing an area around an individual juncture. One of the main features we use is part-of-speech tags. First, we report an algorithm that reduces the number of tags in the tagset whilst improving break prediction accuracy. We then describe three approaches to break prediction: by analogy, in which we find the best-matching sentence in our training data to the unseen sentence; by phrase modelling, in which we build stochastic models of phrases and use these, together with a “phrase grammar”, to segment the unseen sentence; and finally, using features derived from a syntactic parse of the sentence. All techniques achieve well above our baseline performance, which used punctuation symbols to determine break positions, and performance increased with each successive technique. Our best result, obtained on the MARSEC corpus and using a combination of parse tree derived features and a local feature, gave an F score of 81.6%, which we believe to be the highest published on this dataset.

Introduction

The goal of a universal text-to-speech (TTS) synthesizer is to be able to take any given passage of text, and automatically generate speech that is indistinguishable from that of a human reading the passage. Although considerable progress has been made towards this goal, synthesizers are still poor at the supra-segmental features of speech, including intonation, stress, rhythm and phrasing. This should not be surprising, as these features are typically determined by the semantics as well as the syntax of the text and hence are difficult to estimate. If speech synthesizers wish to mimic human speech, they must be able to produce natural sounding prosody. In this work we focus on the prediction of one of the most fundamental aspects of prosody: the position of phrase breaks within a sentence. Natural sounding synthesis also requires accurate intonation (Hirschberg, 1993) and durations (Campbell and Isard, 1991, van Santen, 1998); as algorithms for predicting these features commonly use phrase breaks as input features, errors introduced in phrase break prediction can adversely effect the whole prosody prediction phase. Because of this, we have focused on the crucial problem of identifying breaks in a sentence before investigating the problem of intonation.

When a sentence is being read, some words naturally group together to form phrases. This leads to the theory that a sentence can be described as a hierarchical structure of prosodic phrases (Pierrehumbert, 1980). Previous techniques for predicting the prosodic phrasing of a sentence have mainly focused on using a set of features derived from a window centred on a juncture between two words (Taylor and Black, 1998, Busser et al., 2001, Hirschberg and Prieto, 1995, Wang and Hirschberg, 1992). As prosody applies to a whole sentence, we argue that when making predictions, we need to consider the sentence as a complete unit. To illustrate this point, consider the phrase “a thousand people were led to safety” in the context of the following sentences:

a.
A thousand people were led to safety after being trapped by a fire in the London Underground last night.
b.
A senior fire brigade officer estimated that as many as a thousand people were led to safety from the trains and from platforms.
Both these examples come from the prosodically annotated MARSEC corpus (see Section 4). The sequence “a thousand people were led to safety” has a different prosodic phrase structure in the two sentences. In the first example, the sequence of words does not include any prosodic phrases. However, in the second example, there is a phrase break between “people” and “were”. Numerous syntactic features contribute to differences between these two examples; in the second example, the break after “people” corresponds to the end of a six word noun phrase, whereas in the first example the same juncture marks the end of a noun phrase of three words. The position of the phrases within the sentence and the surrounding phrases also affects the different prosodic rendering. These two sentences are examples that can be differentiated using syntactic analysis. However, consider examples c and d
c.
John doesn’t play cards because he’s bored.
d.
John doesn’t play cards because he’s bored – he plays them because he is an addict.

Most readers would divide the sentence c into two phrases, with a break between “cards” and “because”, as the implication of the sentence is that John is bored, and as a result he does not play cards. In example d, most readers would not place a break between “cards” and “because”, but would pause after “bored”. The implication here is that John does play cards, not due to boredom, but because he has an addiction to them.

These two sentences are examples where it is necessary to perform a semantic analysis of the sentence before a correct prosodic rendering of it can be made. The pair are also unusual in that the addition of the second phrase to the first sentence alters its semantics in such a way that its prosody is also altered.

Such sentences present a formidable challenge to language processing, but informal analysis of the data used in the experiments reported here (transcripts of news broadcasts) showed that very few of them required either semantic or pragmatic considerations to enable correct prosodic phrasing. Instead we have focused on using syntactic features that operate over the whole sentence so we can model how human speakers plan the phrasing of a sentence as a whole unit. Sentences (c) and (d) above demonstrate that it is unlikely that techniques focusing on individual junctures can succeed completely.

This paper describes some techniques for break prediction that are based on using the whole sentence rather than features gathered from a local area around a word juncture. Section 2 begins with an overview of the theory of prosodic phrasing, followed in Section 3 by a review of previous research on predicting prosody. The different corpora and criteria used for evaluating our algorithms are presented in Section 4. Section 5 explores optimizing part-of-speech (POS) tags for use in phrase break prediction algorithms. A dynamic programming model is presented in Section 6 that makes predictions for unseen sentences by aligning the breaks from the most similar sentence in a set of annotated examples. Section 7 proposes a method using Hidden Markov Models (HMMs) to segment sentences into phrases. Finally, we present research using features extracted solely from a syntactic parse tree to predict prosodic phrase breaks in Section 8.

Section snippets

Prosodic phrasing

A spoken utterance can be broken down into a hierarchical structure of prosodic phrases. An example of the prosodic phrase structure of the sentence “Today eighteen people asked Judge John Owen to excuse them from the jury pool.” is shown in Fig. 1. The words in this sentence divide into four main phrases. These are known as intonational phrases (Beckman and Pierrehumbert, 1986). When read aloud, there are noticeable pauses between words on the boundaries of two intonational phrases, for

Previous prosody prediction techniques

We will first examine the merits and limitations of previous techniques that have been used for prediction the prosodic phrasing of some given text. Previous work on this problem has been divided into two broad approaches – rule-based methods and data driven methods.

Corpora

For the purposes of these experiments, two separate data sets were used. The first was the Boston Radio News Corpus (Ostendorf et al., 1995) annotated to the full ToBI specification (Silverman et al., 1992). This was divided into a training set of 13,754 words (3437 breaks), with an additional 15,333 words (3894 breaks) reserved for testing. The Machine Readable Spoken English Corpus (MARSEC) (Roach et al., 1993) consists of transcripts from news stories, talk shows and weather reports from BBC

Reducing the part-of-speech tagset

Part-of-speech (POS) tags have been shown to be a good feature for predicting phrase breaks (Taylor and Black, 1998, Busser et al., 2001) and the accuracy of POS taggers is in excess of 96% (Daelemans et al., 1996, Taylor and Black, 1998 noted that automatically tagged data “performs as well as (if not better)” than hand-annotated data. Most POS taggers use at least 40 different tags. However, these tagsets were developed for descriptive use, and it is not clear that such a large set of tags

Prediction-by-example

Our contention is that, in contrast to techniques that use features local to each word juncture to predict breaks, the whole sentence should be analysed to plan the position of the breaks. A simple way of performing such an analysis is by comparison with sentences in the training set. Given a sufficiently large training set of annotated examples, it may be possible to predict the breaks in an unseen sentence by analogy i.e. by finding the most similar sentence in the training set and using the

Prediction by phrase modelling

A sentence can be modelled at two levels within the prosodic hierarchy: as a sequence of words and a sequence of phrases. Hence we can construct sets of models for both of these levels, and by combining them, make break predictions on the sentence as a single unit. This can be accomplished by estimating the best sequence of phrase models that “explains” an unseen sentence. Breaks are then postulated as occurring at the junctions of these phrase models. Hidden Markov models (HMMs) (Young et al.,

Syntactic parsing

There is clearly a strong relationship between phrase breaks and syntactical structure. For instance, consider this sentence taken from the MARSEC corpus: “The little girl and the lion went into the classroom just as the teacher was calling the register.” The automatically derived syntactic parse of this sentence is shown in Fig. 11. The phrase breaks in this case occur at exactly the junctures between the major phrases of the parse, and this behaviour is typical of fairly simple sentences.

Summary of results

Table 14 summaries the performance of all the algorithms presented in this paper. The use of syntactic features has been shown to significantly outperform all the other algorithms.

Summary and discussion

This paper has examined some different techniques for the prediction of phrase breaks in sentences. In contrast to previous approaches that used information gathered locally from the neighbourhood of a juncture, our philosophy has been to develop techniques that consider the whole sentence when estimating the break positions. This led us to investigate first an approach based on using the similarity between the unseen sentence and stored sentences. Although results obtained on this technique

References (46)

K.W. Church et al.
A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams
Computer Speech and Language
(1991)
J.P. Gee et al.
Performance structures: a psycholinguistic and linguistic appraisal
Cognitive Psychology
(1983)
F. Grosjean et al.
The patterns of silence: performance structures in sentence production
Cognitive Psychology
(1979)
P. Taylor et al.
Assigning phrase breaks from part-of-speech sequences
Computer Speech and Language
(1998)
Allen, J., Hunnicutt, S., Carlson, R., Granstrom, B., 1979. MITalk-79: The 1979 MIT text-to-speech system. In: Wolf,...
J. Bachenko et al.
A computational grammar of discourse-neutral prosodic phrasing in English
Speech Communication
(1990)
M. Beckman et al.
Intonational Structures in English and Japanese
Phonology Yearbook
(1986)
R. Bellman
Dynamic programming treatment of the travelling salesman problem
Journal ACM
(1962)
L. Breiman et al.
Classification and Regression Trees
(1984)
E. Brill
Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging
Computational Linguistics
(1995)

Busser, G., Daelemans, W., van den Bosch, A., 2001. Predicting phrase breaks with memory-based learning. In:...

N.W. Campbell et al.

Segment durations in a syllable frame

Journal of Phonetics

(1991)

Charniak, E., 2000. A maximum-entropy-inspired parser. In: Proceedings ANLP-NAACL, Seattle,...

Collins, M., 1999. Head-driven statistical models for natural language parsing. Ph.D. thesis, University of...

T. Cover et al.

Nearest neighbor pattern classification

IEEE Transactions on Information Theory

(1967)

Cox, S., 2005. Generalised probabilistic descent applied to phrase break modelling. In: Proceedings of Eurospeech....

Daelemans, W., Zavrel, J., Berck, P., Gillis, S., 1996. MBT: A memory-based part of speech tagger-generator. In:...

Daelemans, W., Zavrel, J., van der Sloot, K., van den Bosch, A., 2004. TiMBL: Tilburg Memory Based Learner, version...

D. Decoste et al.

Training invariant support vector machines

Machine Learning

(2002)

J. Dommergues et al.

Performance structures in the recall of sentences

Memory and Cognition

(1981)

R.O. Duda et al.

Pattern Classification

(2000)

Fach, M., September 1999. A comparison between syntactic and prosodic phrasing. In: Eurospeech. vol. 1 Budapest, pp....

G.D. Forney

The Viterbi Algorithm

Proceedings of the IEEE

(1973)

Cited by (21)

Prosodic boundary detection using syntactic and acoustic information
2019, Computer Speech and Language
Citation Excerpt :
The existing procedures developed for prosodic boundary detection differ in two major ways: (1) in terms of supervised vs. unsupervised classification methods, (2) in terms of features used for classification—acoustic vs. textual. The most common supervised approaches to prosodic boundary prediction use rule-based methods (Bachenko and Fitzpatrick, 1990; Ostendorf and Veilleux, 1994; Segal and Bartkova, 2007); data-driven probabilistic methods: N-grams (Taylor and Black, 1998), probabilistic rules (Hirschberg and Rambow, 2001), weighted regular tree transducer (Tepperman and Nava, 2011); machine learning: memory-based learning (Busser et al., 2001), HMM (Liu et al., 2006; Read and Cox, 2007), CART and random forests (Khomitsevich et al., 2014), support vector machines (Jeon and Liu, 2009). Nowadays recurrent neural networks are gaining popularity.
This paper presents a two-stage procedure for automatic prosodic boundary detection in Russian based on textual and acoustic data. The key idea of the method is (1) to predict all potential prosodic boundaries based on syntax and (2) among these potential boundaries, to choose those which are marked acoustically. For the first stage we developed a system which predicted a potential boundary whenever two adjacent words were not connected with each other in terms of syntax; for this we used a dependency tree parser and added several simple rules. At the second stage we run a random forest classifier to detect the actual prosodic boundaries using a small set of acoustic features. Of all the observed prosodic features pause duration worked best, and for some speakers it could be used as the only acoustic cue with no change in efficiency. For other speakers, however, other features were useful, such as tempo and amplitude resets or F₀ range, and the choice of the features was speaker-dependent. In the end the procedure worked with the F₁ measure of 0.91, recall of 0.90 and precision of 0.93, which is the best published result for Russian.
Improving the automatic segmentation of subtitles through conditional random field
2017, Speech Communication
Citation Excerpt :
To date, most of the automatic subtitling solutions have not been capable of generating syntactic and semantically coherent breaks for quality segmentation and, thus, segmentation is mainly performed considering the maximum number of characters allowed per line or through manual intervention. The subtitle segmentation is similar to other segmentation techniques that are necessary for many Natural Language Processing tasks: Dialogue Act segmentation (Warnke et al., 1997), sentence boundary detection for Text-to-Speech (Read and Cox, 2007), or punctuation mark enriched speech recognition output (Beeferman et al., 1998). These applications can be used to improve the source data for other tasks such as summarisation (Mrozinski et al., 2006) or Machine Translation (Matusov et al., 2006).
Automatic segmentation of subtitles is a novel research field which has not been studied extensively to date. However, quality automatic subtitling is a real need for broadcasters which seek for automatic solutions given the demanding European audiovisual legislation. In this article, a method based on Conditional Random Field is presented to deal with the automatic subtitling segmentation. This is a continuation of a previous work in the field, which proposed a method based on Support Vector Machine classifier to generate possible candidates for breaks. For this study, two corpora in Basque and Spanish were used for experiments, and the performance of the current method was tested and compared with the previous solution and two rule-based systems through several evaluation metrics. Finally, an experiment with human evaluators was carried out with the aim of measuring the productivity gain in post-editing automatic subtitles generated with the new method presented.
Evaluation of automatic break insertion for an agglutinative and inflected language
2008, Speech Communication
Citation Excerpt :
F score has been calculated when possible, if not directly provided by the authors. Read and Cox (2007) achieve a best F of 80.19% with reduced POS tag set and syntactic information over MARSEC corpus. The predicting technique applied in this case was decision trees.
This paper presents the evaluation of automatic break insertion for standard Basque. Basque is an agglutinative and inflected language and POS features, widely used for other languages, are not enough to accurately predict the insertion of breaks in the text. Other morpho-syntactic features, like grammatical case and information about syntagms have also been taken into account. With a textual corpus specially gathered for this study where the sentence internal punctuation marks have been removed, CARTs have been used to predict break locations. After applying parameter selection to the whole morpho-syntactic feature set, the best features were employed to build two CARTs, one that gives the same importance to deletion and insertion errors, T1, and another one, T2, that tries to minimize insertion errors. The objective evaluation of the break insertion algorithms gives a κ statistic of 0.518 and an F of 0.757 for T1 tree. The algorithms have also been subjectively evaluated and although T1 had better objective measures, the number of serious errors made by this tree is larger than the number of serious errors made by T2.
Duration-Aware Pause Insertion Using Pre-Trained Language Model for Multi-Speaker Text-To-Speech
2023, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Detection of Prosodic Boundaries in Speech Using Wav2Vec 2.0
2022, arXiv
Detection of Prosodic Boundaries in Speech Using Wav2Vec 2.0
2022, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

View all citing articles on Scopus

View full text