Initial Experiments on Automatic Story Segmentation in Chinese Spoken Documents Using Lexical Cohesion of Extracted Named Entities

Li, Devon; Lo, Wai-Kit; Meng, Helen

doi:10.1007/11939993_70

Initial Experiments on Automatic Story Segmentation in Chinese Spoken Documents Using Lexical Cohesion of Extracted Named Entities

Devon Li²²,
Wai-Kit Lo²² &
Helen Meng²²

Conference paper

1563 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4274))

Abstract

Story segmentation plays a critical role in spoken document processing. Spoken documents often come in a continuous audio stream without explicit boundaries related to stories or topics. It is important to be able to automatically segment these audio streams into coherent units. This work is an initial attempt to make use of informative lexical terms (or key terms) in recognition transcripts of Chinese spoken documents for story segmentation. This is because changes in the distribution of informative terms are generally associated with story changes and topic shifts. Our methods of information lexical term extraction include the extraction of POS-tagged nouns, as well as a named entity identifier that extracts Chinese person names, transliterated person names, location and organization names. We also adopted a lexical chaining approach that links up sentences that are lexically “coherent” with each other. This leads to the definition of a lexical chain score that is used for story boundary hypothesis. We conducted experiments on the recognition transcripts of the TDT2 Voice of America Mandarin speech corpus. We compared among several methods of story segmentation, including the use of pauses for story segmentation, the use of lexical chains of all lexical entries in the recognition transcripts, the use of lexical chains of nouns tagged by a part-of-speech tagger, as well as the use of lexical chains of extracted named entities. Lexical chains of informative terms, namely POS-tagged nouns and named entities were found to give comparable performance (F-measures of 0.71 and 0.73 respectively), which is superior to the use of all lexical entries (F-measure of 0.69).

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Greiff, W., Hurwitz, L., Merlino, A.: MITRE TDT-3 Segmentation System. TDT Evaluation System Summary (1999)
Google Scholar
Shriberg, E., Stolcke, A., Hakkani-Tur, D., Tur, G.: Prosody-Based Automatic Segmentation of Speech into Sentences and Topics. Speech Communication 32(1-2), 127–154 (2000)
Article Google Scholar
Franz, M., McCarley, J.S., Ward, T., Zhu, W.J.: Segmentation and Detection at IBM: Hybrid Statistical Models and Two-tiered Clustering. TDT Evaluation System Summary (1999)
Google Scholar
Stokes, N., Carthy, J., Smeaton, A.: SeLeCT: A Lexical Cohesion based News Story Segmentation System. The Journal of AI Communications 17(1), 3–12 (2004)
MATH MathSciNet Google Scholar
TDT2 Main page, http://projects.ldc.upenn.edu/TDT2/
TDT2 Evaluation Plan 1998, v 3.7. (1998), http://www.nist.gov/speech/tests/tdt/tdt98/doc/tdt2.eval.plan.98.v3.7.ps
Palmer, D., Ostendorf, M.: Improved word confidence estimation using long range features. In: EUROSPEECH 2001, pp. 2117–2120 (2001)
Google Scholar
Meng, H., Ip, C.W.: An Analytical Study of Transformational Tagging on Chinese Text. In: Proceedings of the 1999 ROCLING conference (August 1999)
Google Scholar
Manber, U., Myers, E.W.: Suffix arrays: a new method for on-line string searches. SIAM Journal of Computing 22(5), 948–953 (1993)
Article MathSciNet Google Scholar
Yuan, http://news.sina.com.cn/c/2006-01-10/09097941017s.shtml (January 2006)
Meng, et al.: Mandarin-English Information (MEI): Investigating Translingual Speech Retrieval (2000), http://www.clsp.jhu.edu/ws2000/groups/mei/
HowNet, http://www.keenage.com

Download references

Author information

Authors and Affiliations

Human-Computer Communications Laboratory, Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Hong Kong
Devon Li, Wai-Kit Lo & Helen Meng

Authors

Devon Li
View author publications
You can also search for this author in PubMed Google Scholar
Wai-Kit Lo
View author publications
You can also search for this author in PubMed Google Scholar
Helen Meng
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, The University of Hong Kong, Hong Kong
Qiang Huo
Human Language Technology Department, Institute for Infocomm Research (I2R), 119613, Singapore
Bin Ma
School of Computer Engineering, Nanyang Technological University (NTU), 639798, Singapore
Eng-Siong Chng
Institute for Infocomm Research, 21 Heng Mui Keng Terrace, 119613, Singapore
Haizhou Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, D., Lo, WK., Meng, H. (2006). Initial Experiments on Automatic Story Segmentation in Chinese Spoken Documents Using Lexical Cohesion of Extracted Named Entities. In: Huo, Q., Ma, B., Chng, ES., Li, H. (eds) Chinese Spoken Language Processing. ISCSLP 2006. Lecture Notes in Computer Science(), vol 4274. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11939993_70

Download citation

DOI: https://doi.org/10.1007/11939993_70
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49665-6
Online ISBN: 978-3-540-49666-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics