A Multi-level Model for Recognition of Intonation Labels

Ostendorf, M.; Ross, K.

doi:10.1007/978-1-4612-2258-3_19

M. Ostendorf &
K. Ross

315 Accesses

Abstract

Prosodic patterns can be an important source of information for interpreting an utterance, but because the suprasegmental nature poses a challenge to computational modelling, prosody has seen limited use in automatic speech understanding. This work describes a new computational model of prosody aimed at recognizing detailed intonation patterns, both pitch accent and phrase boundary location and their specific tonal markers, using a multi-level representation to capture acoustic feature dependence at different time scales. The model assumes that an utterance is a sequence of phrases, each of which is composed of a sequence of syllable-level tone labels, which are in turn realized as a sequence of acoustic feature vectors (fundamental frequency and energy) depending in part on the segmental composition of the syllable. The variable lengths are explicitly modelled in a probabilistic representation of the complete sequence, using a dynamical system model at the syllable level that builds on existing models of intonation. Recognition and training algorithms are described, and initial experimental results are reported for prosodic labelling of radio news speech.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Speech Processing and Prosody

A Prototype of the Software System for Study, Training and Analysis of Speech Intonation

Automatic Syllabification and Syllable Timing of Automatically Recognized Speech – for Czech

References

M. Anderson, J. Pierrehumbert, and M. Liberman. Synthesis by rule of English intonation patterns. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. 2.8.1–2.8.4, 1984.
Google Scholar
M. Beckman. The parsing of prosody. Language and Cognitive Processes, 1996.
Google Scholar
J. Butzberger, M. Ostendorf, P. Price, and S. Shattuck-Hufnagel. Isolated word intonation recognition using hidden Markov models. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. 773–776, 1990.
Chapter Google Scholar
M. Beckman and J. Pierrehumbert. Intonational structure in Japanese and English. In J. Ohala, editor, Phonology Yearbook 3, pp. 255–309. New York: Academic, 1986.
Google Scholar
W. N. Campbell. Syllable-based segmental duration. In G. Bailly, C. Benoît, and T. R. Sawallis, editors, Talking Machines: Theories, Models, and Designs, pp. 211–224. Amsterdam: Elsevier Science, 1992.
Google Scholar
W. N. Campbell. Automatic detection of prosodic boundaries in speech. Speech Communication, 13:343–354, 1993.
Article Google Scholar
W. N. Campbell. Combining the use of duration and F0 in an automatic analysis of dialogue prosody. In Proceedings of the International Conference on Spoken Language Processing, Yokohama, Japan, Vol. 3, pp. 1111–1114, 1994.
Google Scholar
W. N. Campbell and S. D. Isard. Segment durations in a syllabic frame. Journal of Phonetics, 47:19:37, 1991.
Google Scholar
F. Chen and M. Withgott. The use of emphasis to automatically summarize a spoken discourse. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Vol. 1, pp. 229–232, 1992.
Google Scholar
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 37:1–38, 1977.
MathSciNet Google Scholar
V. Digalakis, J. R. Rohlicek, and M. Ostendorf. ML estimation of a stochastic linear system with the EM algorithm and its application to speech recognition. IEEE Trans, on Speech and Audio Processing, 1:431–442, 1993.
Article Google Scholar
H. Fujisaki and H. Kawai. Realization of linguistic information in the voice fundamental frequency contour. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. 663–666, 1988.
Google Scholar
E. Geoffrois. A pitch contour analysis guided by prosodic event detection. Proceedings of the European Conference on Speech Communication and Technology, Berlin, Germany, pp. 793–796, 1993.
Google Scholar
K. Hirose and H. Fujisaki. Analysis and synthesis of voice fundamental frequency contours of spoken sentences. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. 950–953, 1982.
Google Scholar
T. Hirai, N. Higuchi, and Y. Sagisaka. A study of a scale for automatic prediction of prosodie phrase boundary based on the distribution of parameters from a critical damping model. Proceedings Spring Meeting, Acoustics Soc. Jpn, I:315–316, 1995 (in Japanese).
Google Scholar
J. Hirschberg. Pitch accent in context: Predicting prominence from text. Artificial Intelligence, 63:305–340, 1993.
Article Google Scholar
D. Huber. A statistical approach to the segmentation and broad classification of continuous speech into phrase-sized information units. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pp. 600–603, 1989.
Chapter Google Scholar
U. Jensen, R. Moore, P. Dalsgaard, and B. Lindberg. Modelling of intonation contours at the sentence level using CHMMs and the 1961 O’Connor and Arnold scheme. Proceedings of the European Conference on Speech Communication and Technology, Berlin, Germany, pp. 785–788, 1993.
Google Scholar
R. Kompe, A. Batliner, A. Kießling, U. Kilian, H. Niemann, E. Nöth, and P. Regel-Brietzmann. Automatic classification of prosodically marked boundaries in German. Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 2:173–176, 1994.
Google Scholar
A. Ljolje and F. Fallside. Recognition of isolated prosodie patterns using hidden Markov models. Computer Speech and Language, 2:27–33, 1987.
Article Google Scholar
D. Macanucco. Automatic recognition of prosodie patterns, unpublished Boston University course report, 1994.
Google Scholar
M. Nakai, H. Singer, Y. Sagisaka, and H. Shimodaira. Automatic prosodie segmentation by ƒ₀ clustering using superposition modelling. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 1995.
Google Scholar
M. Ostendorf, V. Digalakis, and O. Kimball. From HMMs to segment models: A unified view of stochastic modelling for speech recognition. IEEE Trans, on Acoustics Speech and Signal Processing, 1990.
Google Scholar
M. Ostendorf, A. Kannan, S. Austin, 0. Kimball, R. Schwartz, and J. R. Rohlicek. Integration of diverse recognition methodologies through reevaluation of N-best sentence hypotheses. Proceedings of the DARPA Workshop on Speech and Natural Language, pp. 83–87, 1991.
Chapter Google Scholar
M. Ostendorf, P. J. Price, and S. Shattuck-Hufnagel. The Boston University Radio News Corpus. Technical Report ECS- 95–001, Boston University ECS Dept., 1995.
Google Scholar
M. Ostendorf and S. Roukos. A stochastic segment model for phoneme-based continuous speech recognition. IEEE Trans, on Acoustics, Speech, and Signal Processing, 37:1857–1869, 1989.
Article Google Scholar
M. Ostendorf, N. Veilleux, M. Hendrix, and D. Macannuco. Linking speech and language processing through prosody. J. Acoustics Soc. Am., 95:2947, 1994.
Article ADS Google Scholar
K. Ross and M. Ostendorf. A dynamical system model for generating F ₀ for synthesis. Proceedings of the ESC A/IEEE Workshop on Speech Synthesis, Mohonk, NY, pp. 131–134, 1994.
Google Scholar
K. Ross and M. Ostendorf. Prediction of abstract prosodie labels for speech synthesis. Computer, Speech and Language, 1996.
Google Scholar
K. Silverman, M. Beckman, J. Pitrelli, M. Ostendorf, C. Wightman, P. Price, J. Pierrehumbert, and J. Hirschberg. ToBI: A standard for labelling English prosody. In Proceedings of the International Conference on Spoken Language Processing, Banff, Canada, Vol. 2, pp. 867–870, 1992.
Google Scholar
K. E. A. Silverman. The structure and processing of fundamental frequency contours. Ph.D. thesis, University of Cambridge, 1987.
Google Scholar
L. ten Bosch. On the automatic classification of pitch movements. Proceedings of the European Conference on Speech Communication and Technology, Berlin, Germany, pp. 781–784, 1993.
Google Scholar
N. Veilleux and M. Ostendorf. Probabilistic parse scoring with prosodic information. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Vol. II, pp. 51–54, 1993.
Article Google Scholar
J. P. H. van Santen. Assignment of segmental duration in text- to-speech synthesis. Computer Speech and Language, 8:95–128, 1994.
Article Google Scholar
J. P. H. van Santen. Segmental duration and speech timing. In Computing Prosody: Approaches to a Computational Analysis of the Prosody of Spontaneous Speech. New York: Springer- Verlag, 1997. This volume.
Google Scholar
M. Wang and J. Hirschberg. Automatic classification of international phrase boundaries. Computer Speech and Language 6:175–196, 1992.
Article Google Scholar
C. W. Wightman and M. Ostendorf. Automatic labelling of prosodie patterns. IEEE Trans, on Speech and Audio Processing, 2:469–481, 1994.
Article Google Scholar

Download references

Authors

M. Ostendorf
View author publications
You can also search for this author in PubMed Google Scholar
K. Ross
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

ATR Interpreting Telecommunications Research Labs, 2-2, Hikaridai, Seika-cho, Soraku-gun, 619-02, Kyoto, Japan
Yoshinori Sagisaka , Nick Campbell & Norio Higuchi , &

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ostendorf, M., Ross, K. (1997). A Multi-level Model for Recognition of Intonation Labels. In: Sagisaka, Y., Campbell, N., Higuchi, N. (eds) Computing Prosody. Springer, New York, NY. https://doi.org/10.1007/978-1-4612-2258-3_19

Download citation

DOI: https://doi.org/10.1007/978-1-4612-2258-3_19
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4612-7476-6
Online ISBN: 978-1-4612-2258-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

A Multi-level Model for Recognition of Intonation Labels

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Speech Processing and Prosody

A Prototype of the Software System for Study, Training and Analysis of Speech Intonation

Automatic Syllabification and Syllable Timing of Automatically Recognized Speech – for Czech

References

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A Multi-level Model for Recognition of Intonation Labels

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Speech Processing and Prosody

A Prototype of the Software System for Study, Training and Analysis of Speech Intonation

Automatic Syllabification and Syllable Timing of Automatically Recognized Speech – for Czech

References

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation