Abstract
Text segmentation has played an important role in information retrieval as well as natural language processing. Current segmentation methods are well suited for written and structured texts making use of their distinctive macro-level structures; however text segmentation of transcribed multi-party conversation presents a different challenge given its ill-formed sentences and the lack of macro-level text units. This paper describes an algorithm suitable for segmenting spoken meeting transcripts combining semantically complex lexical relations with speech cue phrases to build lexical chains in determining topic boundaries.
Similar content being viewed by others
References
Arguello, J., & Rosé, C. (2006). Topic segmentation of dialogue. In Proceedings of the analyzing conversations in text and speech (ACTS) workshop at HLT-NAACL 2006. New York (pp. 42–49).
Beeferman, D., Berger, A., & Laffety, J. (1997). Text segmentation using exponential models. In EMNLP-2 proceedings of the 2nd conference on empirical methods in natural language processing (pp. 35–46).
Beeferman, D., Berger, A., & Laffety, J. (1999). Statistical models for text segmentation. Machine Learning, Special Issue on Natural Language Processing, 34(1–3), 177–210.
Bengel, J., Gauch, S., Mittur, E., & Vijayaraghavan, R. C. (2004). Chat room topic detection using classification. In Proceedings of the 2nd symposium on intelligence and security informatics (ISI-2004). Tucson, Arizona (pp. 266–277).
Bilan, Z., & Nakagawa, M. (2005). Segmentation of on-line handwritten Japanese text of arbitrary line direction by a neural network for improving text recognition. In Proceedings of the eighth international conference on document analysis and recognition (pp. 157–161).
Boehm, B. W., & Basili, V. R. (2001). Software defect reduction. IEEE Computer, 34(1), 135–137.
Boufaden, N., Lapalme, G., & Bengio, Y. (2001). Topic segmentation: A first stage to dialog-based information extraction. In Proceedings of the natural language processing rim symposium, NLPRS’01 (pp. 273–280).
Chai, J. Y., & Jin, R. (2004). Discourse structure for context question answering. In HLT-NAACL’04 workshop on pragmatics of question answering (pp. 23–30).
Chibelushi, C. (2008). Text mining for meeting transcripts analysis to support decision management. PhD thesis, Stafford: Staffordshire University.
Choi, F., Wiemer-Hastings, P., & Moore, J. (2001). Latent semantic analysis for text segmentation. In Proceedings of the 6th conference on empirical methods in natural language processing (pp. 109–117).
Crystal, D. (1991). A dictionary of linguistics and phonetics (3rd ed.). Cambridge: Basil Blackwell.
Eisenstein, J. (2009). Hierarchical text segmentation from multi-scale lexical cohesion. In Human language technologies: The 2009 annual conference of the North American chapter of the ACL. Boulder, Colorado (pp. 353–361).
Fellbaum, C. D. (1998). A lexical database of English: The mother of all WordNets. In P. Vossen (Ed.), Special issue of computers and the humanities (pp. 209–220). Dordrecht: Kluwer.
Flammia, G. (1998). Discourse segmentation on spoken language: An empirical approach. PhD Thesis, Massachusetts Institute of Technology.
Fraser, B. (1996). Pragmatic markers. Pragmatics, 6, 167–190.
Galley, M., McKeown, K. Fosler-Lussier, E., & Jing, H. (2003). Discourse segmentation of multi-party conversation. In Proceedings of the ACL (pp. 562–569).
Gruenstein, A., Niekrasz, J., & Purver, M. (2005). Meeting structure annotation: Data and Tools. In Proceedings of the 6th SIGdial workshop on discourse and dialogue (pp. 117–127).
Halliday, M., & Hasan, R. (1976). Cohesion in English. London: Longman.
Hearst, M. (1994). Multi-paragraph segmentation of expository text. In Proceedings of the 32nd annual meeting of the association for computational linguistics. Las Cruces, New Mexico (pp. 9–16).
Hearst, M. (1997). TextTiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics, 23(1), 33–64.
Hearst, M. A. (2002). A critique and improvement of an evaluation metric for text segmentation. Computational Linguistics, 28(1), 19–36.
Hirschberg, J., & Litman, D. (1993). Empirical studies on the disambiguation and cue phrases. Computational Linguistics, 19, 501–530.
Kan, M., Klavans, J. L., & McKeown, K. R. (1998). Linear segmentation and segment relevance. In Proceedings of the sixth workshop on very large corpora (pp. 197–205).
Kawahara, T., Nanjo, H., & Furui, S. (2001). Automatic transcription of spontaneous lecture speech. In Proceedings of the IEEE workshop on automatic speech recognition and understanding (pp. 186–189).
Lampert, A., Dale, R., & Paris, C. (2009). Segmenting email message text into zones. In Proceedings of empirical methods in natural language processing, Singapore, August 6–7.
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211–240.
Levow, G. A. (2004). Prosodic cues to discourse segment boundaries in human-computer dialogue. In Proceedings of the 5th sigdial workshop on discourse and dialogue (pp. 93–96).
Manning, C. (1998). Rethinking text segmentation models: An information extraction case study (Technical Report SULTRY-98-07-01). University of Sydney.
Morris, J., & Hirst, G. (1991). Lexical cohesion, the thesaurus, and the structure of text. Computational Linguistics, 17(1), 211–232.
Mulbregt, P., Carp, I., Gillick, L., Lowe, S., & Yamron, J. (1998). Text segmentation and topic tracking on broadcast news via hidden Markov model approach. Proceedings of the ICSLP-98, 6, 2519–2522.
Oard, D., Ramabhadran, B., & Gustman, S. (2004). Building an information retrieval test collection for spontaneous conversational speech. In Proceedings of the 27th annual international. ACM SIGIR conference on research and development in information retrieval. Sheffield (pp. 41–48).
Passoneau, R., & Litman, D. (1997). Discourse segmentation by human and automated means. Computational Linguistics, 23(1), 103–139.
Pevzner, L., & Hearst, M. (2002). Evaluation metric for text segmentation. Computational Linguistics, 1(28), 19–36.
Rayson, P. (2003). Matrix: A statistical method and software tool for linguistic analysis through corpus comparison. PhD thesis. Lancaster: Lancaster University.
Reynar, J. (1999). Statistical models for topic segmentation. In Proceedings of the association for computational linguistics (pp. 357–364).
Reynar, J. (1998). Topic segmentation: Algorithms and applications. PhD Thesis. University of Pennsylvania.
Senda, S., & Yamada, K. (2001). A Maximum-likelihood approach to segmentation-based recognition of unconstrained handwriting text. In Proceedings of the sixth international conference on document analysis and recognition (pp. 184–188).
Sharp, B. (1989). Elaboration and testing of new methodologies in automatic abstracting. PhD Thesis. Birmingham: Aston University.
Stokes, N. (2003). Spoken and written news story segmentation using lexical chains. In HLT-NAACL proceedings, student research workshop. Edmonton (pp. 49–54).
Stokes, N. (2004). Applications of lexical cohesion analysis in the topic detection and tracking domain. PhD Thesis. Dublin: University College Dublin.
Strayer, S. E., Heeman, P. A., & Yang, F. (2003). Reconciling control and discourse structure. In J. van Kuppevelt & R. Smith (Eds.), Current and new directions in discourse and dialogue (pp. 305–323). Dordrecht: Kluwer.
Tsenga, Y. H., Linb, C. J., & Lin, Y. L. (2007). Text mining techniques for patent analysis. Information Processing & Management, 43(5), 1216–1247.
Yamron, J., Carp, I., Gillick, L., Lowe, S., & Mulbregt, P. V. (1998). A hidden Markov model approach to text segmentation and event tracking. In Proceedings of ICASSP’98 (pp. 333–336).
Youmans, G. (1991). A new tool for discourse analysis: The vocabulary management profile. Languages, 763–789.
Zechner, K. (2001). Automatic summarization of spoken dialogues in unrestricted domains. PhD Thesis. Carnegie Mellon University.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Sharp, B., Chibelushi, C. Text segmentation of spoken meeting transcripts. Int J Speech Technol 11, 157 (2008). https://doi.org/10.1007/s10772-009-9048-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10772-009-9048-2