Integrated recognition of words and prosodic phrase boundaries

https://doi.org/10.1016/S0167-6393(01)00027-9Get rights and content

Abstract

In this paper, we present an integrated approach for recognizing both the word sequence and the syntactic–prosodic structure of a spontaneous utterance. The approach aims at improving the performance of the understanding component of speech understanding systems by exploiting not only acoustic–phonetic and syntactic information, but also prosodic information directly within the speech recognition process. Whereas spoken utterances are typically modelled as unstructured word sequences in the speech recognizer, our approach includes phrase boundary information in the language model and provides HMMs to model the acoustic and prosodic characteristics of phrase boundaries. This methodology has two major advantages compared to purely word-based speech recognizers. First, additional syntactic–prosodic boundaries are determined by the speech recognizer which facilitates parsing and resolve syntactic and semantic ambiguities. Second – after having removed the boundary information from the result of the recognizer – the integrated model yields a 4% relative word error rate (WER) reduction compared to a traditional word recognizer. The boundary classification performance is equal to that of a separate prosodic classifier operating on the word recognizer output, thus making a separate classifier unnecessary for this task and saving the computation time involved. Compared to the baseline word recognizer, the integrated word-and-boundary recognizer does not involve any computational overhead.

Zusammenfassung

In diesem Artikel stellen wir einen integrierten Ansatz zur Erkennung der Wortkette und der syntaktisch–prosodischen Struktur einer spontansprachlichen Äußerung vor. Ziel des Ansatzes ist es, die Leistungsfähigkeit von sprachverstehenden Systemen dadurch zu verbessern, dass nicht nur akustisch–phonetische und syntaktische Information, sondern auch prosodische Information direkt im Rahmen des Spracherkennungsprozesses genutzt wird. Üblicherweise werden gesprochene Äußerungen innerhalb des Spracherkenners als unstrukturierte Wortfolgen modelliert. In unserem Ansatz werden Phrasengrenzen dagegen direkt in das Sprachmodell integriert, und HMMs werden zur Modellierung der akustischen und prosodischen Eigenschaften von Phrasengrenzen herangezogen. Hierdurch ergeben sich zwei wesentliche Vorteile gegenüber rein wortbasierten Spracherkennern: Zum einen werden zusätzliche syntaktisch–prosodische Grenzen durch den Spracherkenner bestimmt, die von einem nachgeschalteten Parser zur Beschleunigung und Disambiguierung genutzt werden können. Zum anderen konnte – auch ohne Berücksichtigung der erkannten Grenzen – mit dem integrierten Ansatz eine relative Verbesserung der Wortfehlerrate um 4% erzielt werden, verglichen mit dem traditionellen wortbasierten Ansatz. Die Güte der Grenzklassifikation entspricht dabei der eines separaten prosodischen Klassifikators, der auf dem Worterkennungsergebnis aufsetzt; ein solcher wird also für diese Aufgabe nicht mehr benötigt und die hierfür benötigte Rechenzeit wird eingespart. Verglichen mit dem reinen Worterkenner benötigt der integrierte Erkenner für Wörter und Grenzen keinerlei zusätzliche Rechenzeit.

Introduction

Today, there appears to be a general consensus in the speech recognition community that the area of speech recognition is only concerned with the problem of finding the sequence of words associated to a given acoustic observation. Accordingly, the term `speech recognition' is usually defined in the following manner (similar definitions can be found, for example, in (Jelinek, 1997; Schukat-Talamazzini, 1995):

The automatic speech recognition problem consists of finding the sequence of words W associated to a given acoustic sequence X. (Beccetti and Ricotti, 1999, p. 8)

It is well known, however, that the word sequence associated to an utterance does not always contain all the information that is necessary for understanding its meaning, because an important source of information is usually not captured by the word sequence: prosody. The problem of classifying prosodic phenomena, such as phrase boundaries, sentence mood and accentuation, has therefore received a lot of attention in recent years. As a result, the first speech understanding systems have emerged that take into account prosodic information (Kompe, 1997). The classification of prosodic phenomena, however, has typically been regarded as a task which could be treated either independently from the problem of recognizing the spoken word sequence, or as a subsequent step.

In this paper, we investigate an integrated approach that combines the recognition of the spoken word sequence (i.e. speech recognition in the above sense) and the classification of prosodic phrase boundaries in a single search procedure. Instead of regarding speech as an unstructured sequence of words, speech is modelled as a sequence of words and phrase boundaries. The resulting recognizer for words and prosodic phrase boundaries is still a speech recognizer according to the following, more general, definition:

Speech recognition can be generally defined as the process of transforming a continuous speech signal into discrete representations which may be assigned proper meanings and which, when comprehended, may be used to affect responsive behaviour. (Lea, 1980b, p. 40)

In spoken language, especially in spontaneous speech, prosodic boundaries are of similar importance for understanding an utterance as punctuation marks are in written language. Words which `belong together' from the point of view of meaning are grouped into prosodic phrases, and it is widely agreed upon that there is a high correspondence between prosodic and syntactic phrase boundaries (Wightman et al., 1992; Kompe, 1997).

Prosodic boundaries are often marked by silence periods, and sometimes by filled pauses, such as `uh', and they are usually indicated by specific energy and fundamental frequency (F0) contours and by durational variations of the surrounding syllables (Kießling, 1997). Also, as punctuation marks in written language, they are often predictable from the surrounding word context.

In automatic speech understanding, this information may be important even in the context of a comparatively simple application, such as an automatic train timetable information system. Consider, for example, the following user utterances:

  • U1: Of course not on Monday.

  • U2: Of course not. On Monday!

The question whether a prosodic phrase boundary occurred after the word `not' is crucial for the semantic interpretation of the word sequence and for determining the next system utterance. Depending on the phrasing, one of the following two utterances may be appropriate:

  • S1: What day would you like to travel?

  • S2: You would like to travel on Monday?

Selecting the wrong response (S1 for U2, or S2 for U1) will most certainly annoy the caller and will probably make her/him hang-up.

It might be argued that the correct interpretation of the word sequence could also be determined without prosodic information, if the dialogue history is taken into account. Depending on the previous system utterance, at least one of the two above interpretations could be declared illogical. This involves a considerable amount of higher-level knowledge and “intelligent” processing, however, whereas prosodic information in the speech signal can directly resolve the ambiguity. Furthermore, there is no reason to ignore information that may without a doubt contribute to finding the correct semantic interpretation, even if a sufficiently intelligent dialogue module is available (Kompe, 1997, Section 8.4).

The first speech understanding system to really integrate prosodic information into the understanding process is the German Verbmobil speech-to-speech translation system for appointment scheduling dialogues (Wahlster, 1993; Bub and Schwinn, 1996). In the Verbmobil prototype, prosodic information is calculated on the basis of the speech signal and the word recognition result. This information is used in various system modules, mainly for resolving syntactic and semantic ambiguities, and has been shown to significantly improve the total system performance (Kompe, 1997). For example, Verbmobil is able to provide different English translations for German utterances that contain the same word sequence but are prosodically distinct (Kompe, 1997):

  • Ja zur Not geht's auch am Samstag.

  • (Well, if necessary, Saturday is also possible.)

  • Ja. Zur Not. Geht's auch am Samstag?

  • (Okay. If necessary. Is Saturday possible as well?)

Speech recognition and prosodic analysis are performed in two separate modules, however, and the speech recognizer itself is only concerned with finding an optimal sequence of words (or a word graph, a graph of competing word hypotheses) that covers the whole speech signal. That is, the prosodic and the syntactic structures of the utterance are neither determined nor taken into account by the speech recognizer. The basic structure of the word recognition and prosody classification modules in Verbmobil is depicted in Fig. 1(a). A similar sequential architecture where prosody is classified on the basis of the word recognition result has also been used by other groups, e.g. (Stolcke et al., 1998).

We believe that syntactic–prosodic boundary information is also useful in an earlier stage of spontaneous speech processing. State of the art speech recognizers are typically based on two sources of knowledge: acoustic information and language model information. Statistical language models as used in most speech recognizers provide the probability of a given word sequence based on a rather simple model: it is assumed that a spoken utterance is an unstructured sequence w1,w2,…,wn of words. This is not the case, however. It is intuitively clear that words at the beginning of a new phrase correlate less strongly with the last word of the preceding phrase than words within the same phrase. This can easily be demonstrated empirically. On Verbmobil utterances from a sub-corpus which is not part of the training corpus, for example, the amount of unseen word pairs is almost three times as high for word pairs across phrase boundaries than for word pairs within phrases. Whereas only 14% of the word pairs within phrases have not been observed in the training corpus, the same ratio for words pairs across phrase boundaries is 38%. Any n-gram language model will provide lower probabilities for word transitions that have not been observed in the training data. That is, language model probabilities across phrase boundaries are systematically underestimated by traditional, word-based language models.

A similar effect has also been found in the neighbourhood of filled pauses (Shriberg and Stolcke, 1996). As a consequence, a language model for spontaneous speech is proposed in (Stolcke and Shriberg, 1996), where different types of disfluencies (filled pauses, repetitions and deletions) are predicted, and probabilities of following words are estimated on the basis of the fluent word sequence that was supposedly intended by the speaker. This approach, however, did not have a significant impact on the recognition accuracy. One of the reasons for this result is noted in (Stolcke and Shriberg, 1996): phrase (or clause) boundaries grossly violate the assumptions of the proposed model, because filled pauses strongly correlate with boundaries of linguistic segments. Thus, `cleaning up' the surrounding words to remove the disfluency can be counterproductive.

In our integrated approach which is depicted in Fig. 1(b), phrase boundaries are directly integrated into the language model, and silent and filled pauses are allowed to occur in two different functions: either they are syntactically insignificant and thus ignored in the language model (`clean-up'), or they occur at phrase boundaries. These two different functions of pauses have been described earlier in (O'Shaughnessy, 1992), where the first type of pause was referred to as ungrammatical (or unintentional), and the second type as grammatical (or intentional). Although a more detailed discrimination of different pause functions may be possible, we found this two-class model especially suitable for an integration into the recognizer search procedure.

Furthermore, in our model, phrase boundaries are also allowed to occur at fluently spoken word–word transitions. The fact that a word is separated from its predecessor by a phrase boundary should contribute a great amount of information when language model probabilities are calculated, while the preceding word is less significant. By integrating models for syntactic–prosodic phrase boundaries into the word recognizer and into the statistical language model, the word recognizer can incorporate information about the structure of the utterance. An integrated model of sequences of words and boundaries allows for a distinction between word transitions across phrase boundaries and transitions within a phrase, which is an obvious advantage.

An entertaining but representative example that clearly shows the advantages of an integrated processing of word information and prosodic information as proposed in this paper is given in (Lea, 1980a, p. 167):

  • A: What is that in the road ahead?

  • B: What is that in the road? A head?

Here, not just the semantic interpretation, but also the word sequence depends on the prosodic structure of the utterance. That is, if prosodic information is taken into account in this example, it will be considerably more helpful if it is integrated into the word recognition process.

In phrase boundary recognition experiments based on word recognizer results, it has been shown that prosodic features can significantly improve the detection accuracy of syntactic phrase boundaries compared to a pure language model-based approach (Kompe, 1997). This is especially the case with syntactically ambiguous boundaries, as in the above example utterances. In this paper, we also investigate how additional prosodic information can be incorporated into our integrated approach to recognize words and syntactic–prosodic boundaries.

As the feature set used for our separate boundary classifier is not suitable for the system architecture of the integrated word-and-boundary recognizer, we developed new frame-based prosodic feature sets that incorporate information on the fundamental frequency and energy contours as well as durational information.These features are used as input to an ANN in order to calculate the prosodic probability of a phrase boundary for each time frame. The resulting probabilities are then utilized as a second input stream to the HMM based recognizer, in addition to the acoustic–phonetic probabilities that are based on a cepstral feature vector and a Gaussian codebook. Thus, the integrated recognizer combines three sources of information: acoustic–phonetic information, prosodic information, and language model information.

The research presented in this paper is described in more detail in (Gallwitz, 2001). This thesis also addresses other problems of spontaneous speech recognition which are not discussed in this paper, such as the integrated detection and classification of out-of-vocabulary words.

The remainder of this paper is structured as follows. In Section 2, we review some important publications which are related to the work described this paper. In Section 3, we briefly describe the phrase boundary labelling system that was used as a basis of our experiments. In Section 4, the treatment of phrase boundaries during training and recognition in our approach is described. In Section 5, a hybrid HMM–MLP system architecture is presented that incorporates prosodic features into the recognition process. The prosodic feature sets employed in our experiments are described in Section 6. The training procedure of the hybrid speech recognizer is then discussed in Section 7. Finally, experimental results are given in Section 8. The paper closes with a brief summary of the main results.

Section snippets

Related work

To our knowledge, no approach has been published earlier where syntactic–prosodic structure was directly integrated into the speech recognition process. A number of studies have been performed, however, where information about the syntactic–prosodic structure of utterances was used for a rescoring of the n-best sentence hypotheses, or for a rescoring of word graphs.

In (Veilleux and Ostendorf, 1993; Ostendorf, 1994), two models which predict prosodic phrase boundaries and accents were employed

Syntactic–prosodic boundaries

The labelling scheme used for our experiments was originally designed for the purpose of improving the syntactic and semantic analysis of word graphs with the help of prosodic information. Starting point for the annotation of our material with syntactic–prosodic labels was the assumption that there is a strong – albeit not perfect – correlation between syntactic phrasing and prosodic phrasing (cf. Lea, 1980a; Vaissière, 1988; Price et al., 1991). This assumption could be corroborated earlier in

Basic approach

The basic idea behind our approach is that phrase boundaries should be treated in the language model (LM) in a similar fashion as words. Thus, we provide a language model category (or word class) for phrase boundaries in the n-gram LM, and we provide HMMs to model the acoustic and prosodic characteristics of phrase boundaries.

In (Kompe, 1997), it has been shown that the syntactic–prosodic boundaries often happen to occur in combination with non-verbal noises, pauses or filled pauses. This makes

System architecture

The proposed approach can be used with any state-of-the-art HMM-based speech recognizer, irrespective of the specifics of the HMM topology, the type of density, or the decoding algorithm. In particular, it can also be used within single-pass recognizer architectures. Only some slight modifications to the decoding algorithm might be necessary, to allow for the treatment of syntactically irrelevant silence-periods and non-verbals as described above. Even without additional prosodic information,

Prosodic features

The following acoustic parameters are considered to be the most valuable for the classification of prosodic information in ASU (Kießling, 1997, p. 67):

  • energy (the acoustic correlate of loudness),

  • the fundamental frequency F0 (the acoustic correlate of pitch),

  • pause-length, and

  • phone duration.

Although there are obviously strong interdependencies between acoustic–phonetic and acoustic–prosodic information, we find it helpful to use the terms acoustic–phonetic feature and acoustic–prosodic feature.

MLP and HMM training

It is not straightforward to define the optimal output of the MLP in the hybrid architecture described above. Ideally, it should provide the phrase boundary probability 1.0 for frames that are associated with a boundary HMM, and 0 for non-boundary frames. This is not feasible, however, because prosodic boundaries cannot realistically be associated to one single time frame. Instead, indications for a prosodic boundary should also be expected in the surrounding frames. Restricting the MLP

Experiments and results

The experiments reported in this paper have been performed on a subset of the German Verbmobil corpus. The training, validation and test samples are shown in Table 2 (the figures for phrase boundaries do not contain the trivial boundaries at the beginning or end of a turn).

We used a speaker independent SCHMM word recognizer with a codebook size of 512 classes. No speaker adaptation was performed and only intra-word subword models (polyphones) were used. The 24D acoustic–phonetic feature set

Summary and conclusion

In this paper, we presented the first integrated approach for the recognition of words and prosodic phrase boundaries. Whereas speech recognizers typically use language models that regard spoken utterances as unstructured sequences of words, our approach uses a more sophisticated model that regards utterances as sequences of words and phrase boundaries. This approach has two main advantages compared to the traditional, word-based model. First, additional syntactic–prosodic information is

Acknowledgements

This work was funded by the DFG (German Research Foundation) under contract number 810 939-9 and by the German Federal Ministry of Education, Science, Research and Technology (BMBF) in the framework of the Verbmobil Project under the grant 01 IV 701 K5. The responsibility for the contents of this study lies with the authors.

References (28)

  • A. Batliner et al.

    M=syntax + posody: A syntactic–prosodic labelling scheme for large spontaneous speech databases

    Speech Communication

    (1998)
  • C. Beccetti et al.

    Speech Recognition – Theory and C++ Implementation

    (1999)
  • Bub, T., Schwinn, J., 1996. Verbmobil: The evolution of a complex large speech-to-speech translation system. In: Proc....
  • Gallwitz, F., 2001. Integrated stochastic models for spontaneous speech recognition. Dissertation, Technische Fakultät...
  • Gallwitz, F., Batliner, A., Buckow, J., Huber, R., Niemann, H., Nöth, E., 1998. Integrated recognition of words and...
  • Heeman, P.A., 1999. Modeling speech repairs and intonational phrasing to improve speech recognition. In: Proc. IEEE...
  • P.A. Heeman et al.

    Modeling speaker's utterances in spoken dialog

    Computational Linguistics

    (1999)
  • Iwano, K., Hirose, K., 1999. Prosodic word boundary detection using statistical modeling of Moraic fundamental...
  • F. Jelinek

    Statistical Methods for Speech Recognition

    (1997)
  • Kießling, A., 1997. Extraktion und Klassifikation prosodischer Merkmale in der automatischen Sprachverarbeitung,...
  • Kompe, R., 1997. Prosody in Speech Understanding Systems, Lecture Notes for Artificial Intelligence. Springer,...
  • Kompe, R., Batliner, A., Kießling, A., Kilian, U., Niemann, H., Nöth, E., Regel-Brietzmann, P., 1994. Automatic...
  • Kompe, R., Kießling, A., Niemann, H., Nöth, E., Schukat-Talamazzini, E., Zottmann, A., Batliner, A., 1995. Prosodic...
  • W. Lea

    Prosodic aids to speech recognition

  • Cited by (0)

    1

    Now with Sympalog Speech Technologies AG, Erlangen.

    View full text