Abstract
This paper investigates speech prosody for automatic story segmentation in Mandarin broadcast news. Prosodic cues effectively used in English story segmentation deserve a re-investigation since the lexical tones of Mandarin may complicate the expressions of pitch declination and reset. Our data-oriented study shows that story boundaries cannot be clearly discriminated from utterance boundaries by speaker normalized pitch features due to their large variations across different Mandarin syllable tones. We thus propose to use speaker- and tone-normalized pitch features that can provide clear separations between utterance and story boundaries. Our study also shows that speaker-normalized pause duration is quite effective to separate between story and utterance boundaries, while speaker-normalized speech energy and syllable duration are not effective. Experiments using decision trees for story boundary detection reinforce the difference between English and Chinese, i.e., speaker- and tone-normalized pitch features should be favorably adopted in Mandarin story segmentation. We show that the combination of different prosodic cues can achieve a very high F-measure of 93.04% due to the complementarity between pause, pitch and energy. Analysis of the decision tree uncovered five major heuristics that show how speakers jointly utilize pause duration and pitch to separate speech into stories.
Similar content being viewed by others
References
Blei, D.M., Moreno, P.J.: Topic segmentation with an aspect hidden markov model. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 343–348 (2001)
Buntine W., Caruana R.: Introduction to IND version 2.1 and recursive partitioning. NASA Ames Research Center, USA 3, 1157–1182 (1992)
Chan, S.K., Xie, L., Meng, H.: Modeling the statistical behavior of lexical chains to capture word cohesiveness for automatic story segmentation. In: Proceedings of Interspeech. Anterwerp, Belgium (2007)
Chen S.H., Tseng C., Wang H.M.: Tone modeling for speech synthesis. In: Lee, C.H., Li, H.Z., Lee, L.S., Wang, R.H., Huo Q., (eds) Advances in Chinese Spoken Language Processing., pp. 77–98. World Scientific, USA (2006)
de Cheveigné A., Kawahara H.: Yin, a fundamental frequency estimator for speech and music. J. Acoust. Soc. Am. 111(4), 1917–1930 (2002)
Dharanipragada, S., Franz, M., McCarley, J.S., Zhu, W.J.: Segmentation and detection at IBM: hybrid statistical models and two-tiered clustering. In: Topic detection and tracking: event-based information organization, The Kluwer International Series on Information Retrieval, pp. 135–148 (1999)
Gotoh, Y., Renals, S.: Sentence boundary detection in broadcast speech transcripts. In: Proceedings of the International Speech Communication Association Workshop: Automatic Speech Recognition: Challenges for the New Millennium (2000)
Hauptmann, A.G., Witbrock, M.J.: Story segmentation and detection of commercials in broadcast news video. In: Advances in Digital Libraries (1998)
Hearst, M.: Textiling: segmenting text info multiparagraph subtopic passages. Comput. Linguist. 23(1), 33–64 (1997)
Hsu, W., Chang, S.F., Huang, C.W., Kennedy, L., Lin, C.Y., Iyengar, G.: Discovery and fusion of salient multimodal features toward news story segmentation. In: Storage and Retrieval Methods and Applications for Multimedia 2004, Proceedings of the SPIE, vol. 5307, pp. 244–258 (2003)
Huang, J.T., Lee, L.S.: Prosodic modeling in large vocabulary mandarin speech recognition. In: Proceedings of Interspeech, pp. 1241–1244, Pittsburgh, USA (2006)
Hui, P.Y., Tang, X., Meng, H., Lam, W., Gao, X.: Automatic story segmentation for spoken document retrieval. In: Proceedings of the IEEE Fuzzy Conference, Melbourne (2001)
Kolar, J., Shriberg, E., Liu, Y.: Using prosody for automatic sentence segmentation of multi-party meetings. In: Proceeding of the International Conference on Text, Speech, and Dialogue (TSD). Czech Republic (2006)
Kozima, H.: Text segmenation based on similarity between words. In: Proc. ACL, pp. 286–288 (1993)
Lee C.H., Li H., Lee L.S., Wang R.H., Huo Q.: Advances in Chinese Spoken Language Processing. World Scientific, USA (2007)
Lee L.S., Chen B.: Spoken document understanding and orgnization. IEEE Signal Process. Magazine 22(5), 42–60 (2005)
Lee T., Lau W., Wong Y.W., Ching P.C.: Using tone information in cantonese continuous speech recognition. ACM Trans. Asian Lang. Inform. Process. (TALIP) 1(1), 83–102 (2002)
Levow, G.A.: Prosody-based topic segmentation for mandarin broadcast news. In: Proc. HLT-HAACL, pp. 137–140 (2004)
Liu, Y., Shriberg, E., Stolcke, A., Harper, M.: Using machine learning to cope with imbalanced classes in natural speech: Evidence from sentence boundary and disfluency detection. In: Proceedings of ICSLP (2004)
Lu, L., Zhang, H.J.: Speaker change detection and tracking in real-time news broadcasting analysis. In: Proceedings of the tenth ACM international conference on Multimedia, pp. 602–610 (2002)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA (1992)
Rosenberg, A., Sharifi, M., Hirschberg, J.: Story segmentation of broadcast news in english, mandarin and arabic. In: HLT-NAACL, pp. 125–128 (2006)
Rosenberg, A., Sharifi, M., Hirschberg, J.: Varying input segmentation for story boundary detection in english, arabic and Mandarin broadcast news. In: Interspeech, pp. 2589–2592 (2007)
Shriberg, E., Bates, R., Stolcke, A.: A prosody-only decision-tree model for disfluency detection. In: Proc. Eurospeech, pp. 2383–2386, Rhodes, Greece (1997)
Shriberg, E., Stolcke, A., Hakkani-Tür, D., Tür, G.: Prosody-based automatic segmentation of speech into sentences and topics. Speech Commun. 32(1–2), 127–154 (2000)
Sonmez, K., Shriberg, E., Heck, L., Weintraub, M.: Modeling dynamic prosodic variation for speaker verification. In: Proc. ICSLP, vol. 7, pp. 3189–3192 (1998)
Stokes N., Carthy J., Smeaton A.F.: Select: a lexical cohesion based news story segmentation system. J. AI Commun. 17(1), 3–12 (2004)
Stolcke, A., Shriberg, E., Hakkani-Tür, D., Tür, G., Rivlin, Z., Sonmez, K.: Combining words and speech prosody for automatic topic segmentation. In: Proceedings of DARPA Broadcast News Transcription and Understanding Workshop (1999)
Swerts, M.: Prosodic features at discourse boundaries of different strength. Journal of the Acoustical Society of America vol. 101, pp. 514–521 (1997)
Swerts, M., Geluykens, R., Terken, J.: Prosodic correlates of discourse units in spontaneous speech. In: Proceedings of the International Conference on Spoken Language Processing, pp. 421–424, Canada (2006)
Tang X., Gao X., Liu J., Zhang H.: A spatial–temporal approach for video caption detection and recognition. IEEE Trans. Neural Netw. 13(4), 961–971 (2002)
Tseng C.Y., Pin S.H., Lee Y., Wang H.M., Chen Y.C.: Fluent speech prosody: framework and modelling. Speech Commun. 46, 284–309 (2005)
Tür G., Hakkani-Tür D.: Integrating prosodic and lexical cues for automatic topic segmentation. Comput. Linguist. 27(1), 31–57 (2001)
Vaissiére, J.: Language-independent prosodic features. In: Cutler, A., Ladd, D.R. (eds.) Prosody: Models andMeasurements, pp. 53–66. Springer, Berlin (1983)
Xie, L., Zeng, J., Feng, W.: Multi-scale TextTiling for automatic sroty segmentation in chinese broascast news. In: Proc. Asia Information Retrieval Symposium, pp. 345–355 (2007)
Yamron, J., Carp, I., Gillick, L., Lowe, S., van Mulbregt, P.: A hidden markov model approach to text segmentation and event tracking. In: Proc. ICASSP, vol. 1, pp. 333–336 (1998)
Zimmerman, M., Hakkani-Tür, D., Fung, J., Mirghafori, N., Gottlieb L., Shriberg, E., Liu, Y.: The ICSI+ multilingual sentence segmentation system. In: Proc. Interspeech, pp. 117–120 (2006)
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Changsheng Xu.
Rights and permissions
About this article
Cite this article
Xie, L. Discovering salient prosodic cues and their interactions for automatic story segmentation in Mandarin broadcast news. Multimedia Systems 14, 237–253 (2008). https://doi.org/10.1007/s00530-008-0141-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-008-0141-1