Automatic prosodic event detection using a novel labeling and selection method in co-training

doi:10.1016/j.specom.2011.10.008

Speech Communication

Volume 54, Issue 3, March 2012, Pages 445-458

https://doi.org/10.1016/j.specom.2011.10.008 Get rights and content

Abstract

Most previous approaches to automatic prosodic event detection are based on supervised learning, relying on the availability of a corpus that is annotated with the prosodic labels of interest in order to train the classification models. However, creating such resources is an expensive and time-consuming task. In this paper, we exploit semi-supervised learning with the co-training algorithm for automatic detection of coarse-level representation of prosodic events such as pitch accent, intonational phrase boundaries, and break indices. Since co-training works on the condition that the views are compatible and uncorrelated, and real data often do not satisfy these conditions, we propose a method to label and select examples in co-training. In our experiments on the Boston University radio news corpus, when using only a small amount of the labeled data as the initial training set, our proposed labeling method can effectively use unlabeled data to improve performance and finally reach performance close to the results of the supervised method using more labeled data. We perform a thorough analysis of various factors impacting the learning curves, including labeling error rate and informativeness of added examples, performance of the individual classifiers and their difference, and the initial and added data size.

Highlights

► This study investigates the co-training algorithm for prosodic event labeling. ► We propose a novel scheme to address the co-training algorithm assumption. ► We perform a thorough analysis on various factors impacting the performance.

Introduction

Prosody represents suprasegmental information in speech since the acoustic correlates of a prosodic event extend over more than one phoneme segment. Prosodic phenomena manifest themselves in different ways, including changes in relative intensity to emphasize specific syllables or words, variations of the fundamental frequency (pitch) range and contour, and subtle timing variations, such as syllable lengthening and insertion of pause. In spoken language, prosody conveys linguistic and paralinguistic information such as emphasis, intent, attitude, and emotion of a speaker. Listeners rely on prosodic information to help interpretation of speech.

In many spoken language processing tasks, prosody also plays an important role because it includes aspects of higher level information that is not completely revealed by segmental acoustics or lexical information. Below we list a few applications where prosody can contribute to augmenting the abilities of spoken language systems.

•
Speech recognition: the correlation of pitch accent patterns between acoustic and lexical-prosodic evidence can be used to reduce word error rate in speech recognition (Chen and Hasegawa-Johnson, 2006, Ananthakrishnan and Narayanan, 2007, Jeon et al., 2011).
•
Dialog act detection: intonation patterns at the end of a sentence are useful indications of specific dialog acts or sentence categories (question, statement, exclamation, etc.) (Shriberg et al., 1998).
•
Lexical and syntactic disambiguation: knowledge of syllable pitch accent patterns and boundary information can help resolve lexical or syntactic ambiguity (Price et al., 1991).
•
Natural speech synthesis: one of the challenges in natural sounding speech synthesis systems is to generate human-like prosody to accompany the segmental acoustic properties. This includes local effects (such as syllable accent), properly timed boundaries used to reflect the syntactic structure of the sentence, as well as modulation of pitch at a global level to produce appropriate intonation patterns.

For some applications such as speech recognition, raw prosodic features capturing pitch, energy, and duration might be used in statistical classifiers. However, these low level prosodic features are not appropriate for some applications (e.g., they cannot be directly combined with other information sources in the system), rather, symbolic representations of prosody are more useful, for example, knowing whether there is a pitch accent in a syllable, or whether there is a phrase boundary in a place. Automatic labeling of such prosodic events has received a lot of attention over the past decades because it is important for some speech understanding tasks mentioned above, as well as for scientific understanding of prosody and its relationship with lexical, syntactic, and semantic structure of sentences. Many previous efforts on prosodic event detection adopt supervised learning approaches using acoustic, lexical, and syntactic cues. However, the major drawback with these methods is that they require a large hand-labeled training corpus, and system performance is highly dependent on the specific corpus used for model training. It is very expensive and time-consuming to create a reasonable amount of training data annotated with prosodic information.

Limited research has been conducted using unsupervised and semi-supervised methods for automatic prosodic event labeling. In this paper, we exploit semi-supervised learning with the co-training algorithm (Blum and Mitchell, 1998) for this task. Two different views corresponding to the acoustic and lexical/syntactic knowledge sources are used in the co-training framework. We propose a novel labeling and selection scheme to address the problem that natural data do not meet the compatible and uncorrelated conditions required in the co-training algorithm. Our experiments on the Boston Radio News corpus (Ostendorf et al., 1995) show that our proposed approach works well – using unlabeled data can lead to significant improvement for prosodic event detection compared to using the original small training set, yielding results comparable to those from supervised learning with a similar amount of labeled training data. Our analysis shows that reducing the labeling error rate of the added examples for the next iteration does not always improve performance, and that increasing the informativeness of examples is more useful for performance gain, even though some erroneous examples are included. We also find that the performance gain is affected by the status of the two classifiers, such as the similarity of their views.

The remainder of this paper is organized as follows. In the next section, we provide details of the corpus and the prosodic event detection tasks. Section 3 reviews previous work briefly. In Section 4, we describe the basic classification method for prosodic event detection, including the acoustic and lexical/syntactic prosodic models, and the features used. Section 5 introduces the co-training algorithm we use. Section 6 presents our experimental results and analysis. The final section gives a brief summary along with future directions.

Section snippets

Data and task

Annotation of prosodic events requires appropriate representation schemes that can characterize prosody in a standardized manner. One of the most popular labeling schemes is the Tones and Break Indices (ToBI) framework (Silverman et al., 1992). The most important prosodic phenomena captured within the ToBI framework include pitch accent and prosodic phrase boundaries. A pitch accent can be broadly thought of as a prominence or a stress mark, and prosodic phrasing refers to the perceived

Previous work

Many previous efforts on prosodic event detection used supervised learning approaches. Each study used different knowledge sources, such as only acoustic or lexical/syntactic information, or a combination of them. A lot of previous work has used the BU corpus for prosodic event detection. In the work by Wightman and Ostendorf (1994), binary pitch accent, IPB, and break index were assigned to syllables based on posterior probabilities computed from acoustic evidence using decision trees (CART),

Prosodic event detection method

We model the prosody detection problem as a binary classification task. We assume that the acoustic observations are conditionally independent of the lexical/syntactic features given the prosodic label, and thus develop two models separately according to information sources and then combine them for the final decision, similar to Jeon and Liu (2009a). Fig. 2 shows this framework. A product rule is used to decide the event label l ∈ {presence, absence} using two classifiers: $\hat{l} = \arg \max_{l} p (l | a_{i}) \cdot p (l | s_{i})$

Co-training strategy for prosodic event detection

Many previous efforts on prosodic event detection adopted supervised learning approaches. However, the major drawback with these methods is that they require a hand-labeled data set, which is expensive and time consuming. In this study, we exploit semi-supervised learning that uses only a small amount of hand-labeled data along with a lot of unlabeled data. We use co-training proposed by Blum and Mitchell (1998). It is a semi-supervised multi-view algorithm, and applies well to learning

Experimental setup

For co-training experiments, we randomly choose 5 utterances from speaker f2b and m2b (see Table 1 for data) as the initial training set L, which includes 560 syllables and 378 words. In the initial training set, the positive event ratios are 36%, 28%, and 20% for pitch accent, break index, and IPB tasks respectively. All the data from f1a and m1b (103 utterances) are used for testing. As shown in the labeled data part in Table 1, the positive event ratios of test data are 35%, 27%, and 18% for

Conclusion and future work

In this paper, we proposed a novel labeling and selection method under the co-training framework for prosodic event detection tasks. The co-training algorithm relies on two assumptions that the views are compatible and uncorrelated, but real world data rarely meet these requirements. Our proposed labeling method aimed to address this problem. We conducted experiments on the BU corpus and demonstrated that our proposed co-training approach achieved significantly better performance than using the

Acknowledgment

This work is partly supported by an award from the US Air Force Office of Scientific Research, FA9550-10-1-0388.

References (42)

J. Hirschberg
Pitch accent in context: predicting intonational prominence from text
Artificial Intelligence
(1993)
Ananthakrishnan, S., Narayanan, S., 2006. Combining acoustic, lexical, and syntactic evidence for automatic...
Ananthakrishnan, S., Narayanan, S., 2007. Improved speech recognition using acoustic and lexical correlated of pitch...
S. Ananthakrishnan et al.
Automatic prosodic event detection using acoustic, lexical and syntactic evidence
IEEE Transactions on Audio, Speech, and Language Processing
(2008)
M. Balcan et al.
Co-training and expansion: towards bridging theory and practice
In Advances in Neural Information Processing Systems
(2005)
Bartlett, S., Kondrak, G., Cherry, C., 2009. On the syllabification of phonemes. In: Proceedings of NAACL-HLT, pp....
Blum, A., Mitchell, T., 1998. Combining labeled and unlabeled data with co-training. In: Proceedings of the Workshop on...
P. Boersma
Praat, a system for doing phonetics by computer
Glot International
(2001)
Brants, T., 2000. TnT – a statistical part-of-speech tagger. In: Proceedings of ANLP-NAACL, pp....
K. Chen et al.
Prosody dependent speech recognition on radio news corpus of American English
IEEE Transactions on Audio, Speech, and Language Processing
(2006)

Chen, K., Hasegawa-Johnson, M., Cohen, A., 2004. An automatic prosody labeling system using ann-based...

Clark, S., Curran, J.R., Osborne, M., 2003. Bootstrapping pos-taggers using unlabelled data. In: Proceedings of CoNLL,...

S. Dasgupta et al.

Pac generalization bounds for co-training

Advances in Neural Information Processing Systems

(2001)

N. Dehak et al.

Modeling prosodic features with joint factor analysis for speaker verification

IEEE Transactions on Audio, Speech, and Language Processing

(2007)

Goldman, S., Zhou, Y., 2000. Enhancing supervised learning with unlabeled data. In: Proceedings of ICML, pp....

Grabe, E., Kochanski, G., Coleman, J., 2003. Quantitative modelling of intonational variation. In: Proceedings of...

Gregory, M.L., Altun, Y., 2004. Using conditional random fields to predict pitch accents in conversational speech. In:...

Guz, U., Cuendet, S., Hakkani-Tür, D., Tur, G., 2007. Co-training using prosodic and lexical information for sentence...

Jeon, J.H., Liu, Y., 2009a. Automatic prosodic events detection suing syllable-based acoustic and syntactic features....

Jeon, J.H., Liu, Y., 2009b. Semi-supervised learning for automatic prosodic event detection using co-training...

Jeon, J.H., Liu, Y., 2010. Syllable-level prominence detection with acoustic evidence. In: Proceedings of Interspeech,...

Cited by (0)

View full text

Automatic prosodic event detection using a novel labeling and selection method in co-training

Abstract

Highlights

Introduction

Section snippets

Data and task

Previous work

Prosodic event detection method

Co-training strategy for prosodic event detection

Experimental setup

Conclusion and future work

Acknowledgment

Artificial Intelligence

Automatic prosodic event detection using acoustic, lexical and syntactic evidence

IEEE Transactions on Audio, Speech, and Language Processing

Co-training and expansion: towards bridging theory and practice

In Advances in Neural Information Processing Systems

Praat, a system for doing phonetics by computer

Glot International

Prosody dependent speech recognition on radio news corpus of American English

IEEE Transactions on Audio, Speech, and Language Processing

Pac generalization bounds for co-training

Advances in Neural Information Processing Systems

Modeling prosodic features with joint factor analysis for speaker verification

IEEE Transactions on Audio, Speech, and Language Processing