Skip to main content
Log in

Text segmentation of spoken meeting transcripts

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Text segmentation has played an important role in information retrieval as well as natural language processing. Current segmentation methods are well suited for written and structured texts making use of their distinctive macro-level structures; however text segmentation of transcribed multi-party conversation presents a different challenge given its ill-formed sentences and the lack of macro-level text units. This paper describes an algorithm suitable for segmenting spoken meeting transcripts combining semantically complex lexical relations with speech cue phrases to build lexical chains in determining topic boundaries.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Arguello, J., & Rosé, C. (2006). Topic segmentation of dialogue. In Proceedings of the analyzing conversations in text and speech (ACTS) workshop at HLT-NAACL 2006. New York (pp. 42–49).

  • Beeferman, D., Berger, A., & Laffety, J. (1997). Text segmentation using exponential models. In EMNLP-2 proceedings of the 2nd conference on empirical methods in natural language processing (pp. 35–46).

  • Beeferman, D., Berger, A., & Laffety, J. (1999). Statistical models for text segmentation. Machine Learning, Special Issue on Natural Language Processing, 34(1–3), 177–210.

    MATH  Google Scholar 

  • Bengel, J., Gauch, S., Mittur, E., & Vijayaraghavan, R. C. (2004). Chat room topic detection using classification. In Proceedings of the 2nd symposium on intelligence and security informatics (ISI-2004). Tucson, Arizona (pp. 266–277).

  • Bilan, Z., & Nakagawa, M. (2005). Segmentation of on-line handwritten Japanese text of arbitrary line direction by a neural network for improving text recognition. In Proceedings of the eighth international conference on document analysis and recognition (pp. 157–161).

  • Boehm, B. W., & Basili, V. R. (2001). Software defect reduction. IEEE Computer, 34(1), 135–137.

    Google Scholar 

  • Boufaden, N., Lapalme, G., & Bengio, Y. (2001). Topic segmentation: A first stage to dialog-based information extraction. In Proceedings of the natural language processing rim symposium, NLPRS’01 (pp. 273–280).

  • Chai, J. Y., & Jin, R. (2004). Discourse structure for context question answering. In HLT-NAACL’04 workshop on pragmatics of question answering (pp. 23–30).

  • Chibelushi, C. (2008). Text mining for meeting transcripts analysis to support decision management. PhD thesis, Stafford: Staffordshire University.

  • Choi, F., Wiemer-Hastings, P., & Moore, J. (2001). Latent semantic analysis for text segmentation. In Proceedings of the 6th conference on empirical methods in natural language processing (pp. 109–117).

  • Crystal, D. (1991). A dictionary of linguistics and phonetics (3rd ed.). Cambridge: Basil Blackwell.

    Google Scholar 

  • Eisenstein, J. (2009). Hierarchical text segmentation from multi-scale lexical cohesion. In Human language technologies: The 2009 annual conference of the North American chapter of the ACL. Boulder, Colorado (pp. 353–361).

  • Fellbaum, C. D. (1998). A lexical database of English: The mother of all WordNets. In P. Vossen (Ed.), Special issue of computers and the humanities (pp. 209–220). Dordrecht: Kluwer.

    Google Scholar 

  • Flammia, G. (1998). Discourse segmentation on spoken language: An empirical approach. PhD Thesis, Massachusetts Institute of Technology.

  • Fraser, B. (1996). Pragmatic markers. Pragmatics, 6, 167–190.

    Google Scholar 

  • Galley, M., McKeown, K. Fosler-Lussier, E., & Jing, H. (2003). Discourse segmentation of multi-party conversation. In Proceedings of the ACL (pp. 562–569).

  • Gruenstein, A., Niekrasz, J., & Purver, M. (2005). Meeting structure annotation: Data and Tools. In Proceedings of the 6th SIGdial workshop on discourse and dialogue (pp. 117–127).

  • Halliday, M., & Hasan, R. (1976). Cohesion in English. London: Longman.

    Google Scholar 

  • Hearst, M. (1994). Multi-paragraph segmentation of expository text. In Proceedings of the 32nd annual meeting of the association for computational linguistics. Las Cruces, New Mexico (pp. 9–16).

  • Hearst, M. (1997). TextTiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics, 23(1), 33–64.

    Google Scholar 

  • Hearst, M. A. (2002). A critique and improvement of an evaluation metric for text segmentation. Computational Linguistics, 28(1), 19–36.

    Article  Google Scholar 

  • Hirschberg, J., & Litman, D. (1993). Empirical studies on the disambiguation and cue phrases. Computational Linguistics, 19, 501–530.

    Google Scholar 

  • Kan, M., Klavans, J. L., & McKeown, K. R. (1998). Linear segmentation and segment relevance. In Proceedings of the sixth workshop on very large corpora (pp. 197–205).

  • Kawahara, T., Nanjo, H., & Furui, S. (2001). Automatic transcription of spontaneous lecture speech. In Proceedings of the IEEE workshop on automatic speech recognition and understanding (pp. 186–189).

  • Lampert, A., Dale, R., & Paris, C. (2009). Segmenting email message text into zones. In Proceedings of empirical methods in natural language processing, Singapore, August 6–7.

  • Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 104, 211–240.

    Article  Google Scholar 

  • Levow, G. A. (2004). Prosodic cues to discourse segment boundaries in human-computer dialogue. In Proceedings of the 5th sigdial workshop on discourse and dialogue (pp. 93–96).

  • Manning, C. (1998). Rethinking text segmentation models: An information extraction case study (Technical Report SULTRY-98-07-01). University of Sydney.

  • Morris, J., & Hirst, G. (1991). Lexical cohesion, the thesaurus, and the structure of text. Computational Linguistics, 17(1), 211–232.

    Google Scholar 

  • Mulbregt, P., Carp, I., Gillick, L., Lowe, S., & Yamron, J. (1998). Text segmentation and topic tracking on broadcast news via hidden Markov model approach. Proceedings of the ICSLP-98, 6, 2519–2522.

    Google Scholar 

  • Oard, D., Ramabhadran, B., & Gustman, S. (2004). Building an information retrieval test collection for spontaneous conversational speech. In Proceedings of the 27th annual international. ACM SIGIR conference on research and development in information retrieval. Sheffield (pp. 41–48).

  • Passoneau, R., & Litman, D. (1997). Discourse segmentation by human and automated means. Computational Linguistics, 23(1), 103–139.

    Google Scholar 

  • Pevzner, L., & Hearst, M. (2002). Evaluation metric for text segmentation. Computational Linguistics, 1(28), 19–36.

    Article  Google Scholar 

  • Rayson, P. (2003). Matrix: A statistical method and software tool for linguistic analysis through corpus comparison. PhD thesis. Lancaster: Lancaster University.

  • Reynar, J. (1999). Statistical models for topic segmentation. In Proceedings of the association for computational linguistics (pp. 357–364).

  • Reynar, J. (1998). Topic segmentation: Algorithms and applications. PhD Thesis. University of Pennsylvania.

  • Senda, S., & Yamada, K. (2001). A Maximum-likelihood approach to segmentation-based recognition of unconstrained handwriting text. In Proceedings of the sixth international conference on document analysis and recognition (pp. 184–188).

  • Sharp, B. (1989). Elaboration and testing of new methodologies in automatic abstracting. PhD Thesis. Birmingham: Aston University.

  • Stokes, N. (2003). Spoken and written news story segmentation using lexical chains. In HLT-NAACL proceedings, student research workshop. Edmonton (pp. 49–54).

  • Stokes, N. (2004). Applications of lexical cohesion analysis in the topic detection and tracking domain. PhD Thesis. Dublin: University College Dublin.

  • Strayer, S. E., Heeman, P. A., & Yang, F. (2003). Reconciling control and discourse structure. In J. van Kuppevelt & R. Smith (Eds.), Current and new directions in discourse and dialogue (pp. 305–323). Dordrecht: Kluwer.

    Google Scholar 

  • Tsenga, Y. H., Linb, C. J., & Lin, Y. L. (2007). Text mining techniques for patent analysis. Information Processing & Management, 43(5), 1216–1247.

    Article  Google Scholar 

  • Yamron, J., Carp, I., Gillick, L., Lowe, S., & Mulbregt, P. V. (1998). A hidden Markov model approach to text segmentation and event tracking. In Proceedings of ICASSP’98 (pp. 333–336).

  • Youmans, G. (1991). A new tool for discourse analysis: The vocabulary management profile. Languages, 763–789.

  • Zechner, K. (2001). Automatic summarization of spoken dialogues in unrestricted domains. PhD Thesis. Carnegie Mellon University.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bernadette Sharp.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sharp, B., Chibelushi, C. Text segmentation of spoken meeting transcripts. Int J Speech Technol 11, 157 (2008). https://doi.org/10.1007/s10772-009-9048-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10772-009-9048-2

Navigation