Stochastic and syntactic techniques for predicting phrase breaks
Introduction
The goal of a universal text-to-speech (TTS) synthesizer is to be able to take any given passage of text, and automatically generate speech that is indistinguishable from that of a human reading the passage. Although considerable progress has been made towards this goal, synthesizers are still poor at the supra-segmental features of speech, including intonation, stress, rhythm and phrasing. This should not be surprising, as these features are typically determined by the semantics as well as the syntax of the text and hence are difficult to estimate. If speech synthesizers wish to mimic human speech, they must be able to produce natural sounding prosody. In this work we focus on the prediction of one of the most fundamental aspects of prosody: the position of phrase breaks within a sentence. Natural sounding synthesis also requires accurate intonation (Hirschberg, 1993) and durations (Campbell and Isard, 1991, van Santen, 1998); as algorithms for predicting these features commonly use phrase breaks as input features, errors introduced in phrase break prediction can adversely effect the whole prosody prediction phase. Because of this, we have focused on the crucial problem of identifying breaks in a sentence before investigating the problem of intonation.
When a sentence is being read, some words naturally group together to form phrases. This leads to the theory that a sentence can be described as a hierarchical structure of prosodic phrases (Pierrehumbert, 1980). Previous techniques for predicting the prosodic phrasing of a sentence have mainly focused on using a set of features derived from a window centred on a juncture between two words (Taylor and Black, 1998, Busser et al., 2001, Hirschberg and Prieto, 1995, Wang and Hirschberg, 1992). As prosody applies to a whole sentence, we argue that when making predictions, we need to consider the sentence as a complete unit. To illustrate this point, consider the phrase “a thousand people were led to safety” in the context of the following sentences:
- a.
A thousand people were led to safety after being trapped by a fire in the London Underground last night.
- b.
A senior fire brigade officer estimated that as many as a thousand people were led to safety from the trains and from platforms.
Both these examples come from the prosodically annotated MARSEC corpus (see Section 4). The sequence “a thousand people were led to safety” has a different prosodic phrase structure in the two sentences. In the first example, the sequence of words does not include any prosodic phrases. However, in the second example, there is a phrase break between “people” and “were”. Numerous syntactic features contribute to differences between these two examples; in the second example, the break after “people” corresponds to the end of a six word noun phrase, whereas in the first example the same juncture marks the end of a noun phrase of three words. The position of the phrases within the sentence and the surrounding phrases also affects the different prosodic rendering. These two sentences are examples that can be differentiated using syntactic analysis. However, consider examples c and d
- c.
John doesn’t play cards because he’s bored.
- d.
John doesn’t play cards because he’s bored – he plays them because he is an addict.
Most readers would divide the sentence c into two phrases, with a break between “cards” and “because”, as the implication of the sentence is that John is bored, and as a result he does not play cards. In example d, most readers would not place a break between “cards” and “because”, but would pause after “bored”. The implication here is that John does play cards, not due to boredom, but because he has an addiction to them.
These two sentences are examples where it is necessary to perform a semantic analysis of the sentence before a correct prosodic rendering of it can be made. The pair are also unusual in that the addition of the second phrase to the first sentence alters its semantics in such a way that its prosody is also altered.
Such sentences present a formidable challenge to language processing, but informal analysis of the data used in the experiments reported here (transcripts of news broadcasts) showed that very few of them required either semantic or pragmatic considerations to enable correct prosodic phrasing. Instead we have focused on using syntactic features that operate over the whole sentence so we can model how human speakers plan the phrasing of a sentence as a whole unit. Sentences (c) and (d) above demonstrate that it is unlikely that techniques focusing on individual junctures can succeed completely.
This paper describes some techniques for break prediction that are based on using the whole sentence rather than features gathered from a local area around a word juncture. Section 2 begins with an overview of the theory of prosodic phrasing, followed in Section 3 by a review of previous research on predicting prosody. The different corpora and criteria used for evaluating our algorithms are presented in Section 4. Section 5 explores optimizing part-of-speech (POS) tags for use in phrase break prediction algorithms. A dynamic programming model is presented in Section 6 that makes predictions for unseen sentences by aligning the breaks from the most similar sentence in a set of annotated examples. Section 7 proposes a method using Hidden Markov Models (HMMs) to segment sentences into phrases. Finally, we present research using features extracted solely from a syntactic parse tree to predict prosodic phrase breaks in Section 8.
Section snippets
Prosodic phrasing
A spoken utterance can be broken down into a hierarchical structure of prosodic phrases. An example of the prosodic phrase structure of the sentence “Today eighteen people asked Judge John Owen to excuse them from the jury pool.” is shown in Fig. 1. The words in this sentence divide into four main phrases. These are known as intonational phrases (Beckman and Pierrehumbert, 1986). When read aloud, there are noticeable pauses between words on the boundaries of two intonational phrases, for
Previous prosody prediction techniques
We will first examine the merits and limitations of previous techniques that have been used for prediction the prosodic phrasing of some given text. Previous work on this problem has been divided into two broad approaches – rule-based methods and data driven methods.
Corpora
For the purposes of these experiments, two separate data sets were used. The first was the Boston Radio News Corpus (Ostendorf et al., 1995) annotated to the full ToBI specification (Silverman et al., 1992). This was divided into a training set of 13,754 words (3437 breaks), with an additional 15,333 words (3894 breaks) reserved for testing. The Machine Readable Spoken English Corpus (MARSEC) (Roach et al., 1993) consists of transcripts from news stories, talk shows and weather reports from BBC
Reducing the part-of-speech tagset
Part-of-speech (POS) tags have been shown to be a good feature for predicting phrase breaks (Taylor and Black, 1998, Busser et al., 2001) and the accuracy of POS taggers is in excess of 96% (Daelemans et al., 1996, Taylor and Black, 1998 noted that automatically tagged data “performs as well as (if not better)” than hand-annotated data. Most POS taggers use at least 40 different tags. However, these tagsets were developed for descriptive use, and it is not clear that such a large set of tags
Prediction-by-example
Our contention is that, in contrast to techniques that use features local to each word juncture to predict breaks, the whole sentence should be analysed to plan the position of the breaks. A simple way of performing such an analysis is by comparison with sentences in the training set. Given a sufficiently large training set of annotated examples, it may be possible to predict the breaks in an unseen sentence by analogy i.e. by finding the most similar sentence in the training set and using the
Prediction by phrase modelling
A sentence can be modelled at two levels within the prosodic hierarchy: as a sequence of words and a sequence of phrases. Hence we can construct sets of models for both of these levels, and by combining them, make break predictions on the sentence as a single unit. This can be accomplished by estimating the best sequence of phrase models that “explains” an unseen sentence. Breaks are then postulated as occurring at the junctions of these phrase models. Hidden Markov models (HMMs) (Young et al.,
Syntactic parsing
There is clearly a strong relationship between phrase breaks and syntactical structure. For instance, consider this sentence taken from the MARSEC corpus: “The little girl and the lion went into the classroom just as the teacher was calling the register.” The automatically derived syntactic parse of this sentence is shown in Fig. 11. The phrase breaks in this case occur at exactly the junctures between the major phrases of the parse, and this behaviour is typical of fairly simple sentences.
As
Summary of results
Table 14 summaries the performance of all the algorithms presented in this paper. The use of syntactic features has been shown to significantly outperform all the other algorithms.
Summary and discussion
This paper has examined some different techniques for the prediction of phrase breaks in sentences. In contrast to previous approaches that used information gathered locally from the neighbourhood of a juncture, our philosophy has been to develop techniques that consider the whole sentence when estimating the break positions. This led us to investigate first an approach based on using the similarity between the unseen sentence and stored sentences. Although results obtained on this technique
References (46)
- et al.
A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams
Computer Speech and Language
(1991) - et al.
Performance structures: a psycholinguistic and linguistic appraisal
Cognitive Psychology
(1983) - et al.
The patterns of silence: performance structures in sentence production
Cognitive Psychology
(1979) - et al.
Assigning phrase breaks from part-of-speech sequences
Computer Speech and Language
(1998) - Allen, J., Hunnicutt, S., Carlson, R., Granstrom, B., 1979. MITalk-79: The 1979 MIT text-to-speech system. In: Wolf,...
- et al.
A computational grammar of discourse-neutral prosodic phrasing in English
Speech Communication
(1990) - et al.
Intonational Structures in English and Japanese
Phonology Yearbook
(1986) Dynamic programming treatment of the travelling salesman problem
Journal ACM
(1962)- et al.
Classification and Regression Trees
(1984) Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging
Computational Linguistics
(1995)
Segment durations in a syllable frame
Journal of Phonetics
Nearest neighbor pattern classification
IEEE Transactions on Information Theory
Training invariant support vector machines
Machine Learning
Performance structures in the recall of sentences
Memory and Cognition
Pattern Classification
The Viterbi Algorithm
Proceedings of the IEEE
Cited by (21)
Prosodic boundary detection using syntactic and acoustic information
2019, Computer Speech and LanguageCitation Excerpt :The existing procedures developed for prosodic boundary detection differ in two major ways: (1) in terms of supervised vs. unsupervised classification methods, (2) in terms of features used for classification—acoustic vs. textual. The most common supervised approaches to prosodic boundary prediction use rule-based methods (Bachenko and Fitzpatrick, 1990; Ostendorf and Veilleux, 1994; Segal and Bartkova, 2007); data-driven probabilistic methods: N-grams (Taylor and Black, 1998), probabilistic rules (Hirschberg and Rambow, 2001), weighted regular tree transducer (Tepperman and Nava, 2011); machine learning: memory-based learning (Busser et al., 2001), HMM (Liu et al., 2006; Read and Cox, 2007), CART and random forests (Khomitsevich et al., 2014), support vector machines (Jeon and Liu, 2009). Nowadays recurrent neural networks are gaining popularity.
Improving the automatic segmentation of subtitles through conditional random field
2017, Speech CommunicationCitation Excerpt :To date, most of the automatic subtitling solutions have not been capable of generating syntactic and semantically coherent breaks for quality segmentation and, thus, segmentation is mainly performed considering the maximum number of characters allowed per line or through manual intervention. The subtitle segmentation is similar to other segmentation techniques that are necessary for many Natural Language Processing tasks: Dialogue Act segmentation (Warnke et al., 1997), sentence boundary detection for Text-to-Speech (Read and Cox, 2007), or punctuation mark enriched speech recognition output (Beeferman et al., 1998). These applications can be used to improve the source data for other tasks such as summarisation (Mrozinski et al., 2006) or Machine Translation (Matusov et al., 2006).
Evaluation of automatic break insertion for an agglutinative and inflected language
2008, Speech CommunicationCitation Excerpt :F score has been calculated when possible, if not directly provided by the authors. Read and Cox (2007) achieve a best F of 80.19% with reduced POS tag set and syntactic information over MARSEC corpus. The predicting technique applied in this case was decision trees.
Duration-Aware Pause Insertion Using Pre-Trained Language Model for Multi-Speaker Text-To-Speech
2023, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - ProceedingsDetection of Prosodic Boundaries in Speech Using Wav2Vec 2.0
2022, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)