Skip to main content

An Automatic Approach for Efficient Text Segmentation

  • Conference paper
Knowledge-Based Intelligent Information and Engineering Systems (KES 2006)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4251))

Abstract

This paper presents a domain-independent approach for partitioning text documents into a set of topic-coherent segment units, where the structure of segments reflects the patterns of sub-topics of the processed text document. The approach adopts similarity analyses, which is based on Shannon Information Theory, to determine topic distribution among text documents without incorporating thesaurus information and other auxiliary knowledge bases. It first observes the documents in terms of consistency of distribution from the viewpoint of individual word and then constructs a number of segmentation proposals accordingly. Furthermore, it employs the K-means clustering technique to get a consensus from these proposals and finally partition text into a set of topic coherent paragraphs. Through extensive experimental studies based on real and synthetic data sources, the performance analysis illustrates the effectiveness of the approach in text segmentation.

This work is supported by the Key Science and Technology Plan of Zhejiang Province, China (Grant no. 2005C23047).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Hearst, M.: Multi-paragraph segmentation of expository texts. In: Proceedings of 32nd Annual meeting of Association for Computational Linguistics, pp. 9–16 (1994)

    Google Scholar 

  2. Hearst, M.: TextTiling: Segmenting text into multi–paragraph subtopic passages. Computational Linguistics 23(1), 33–64 (1997)

    Google Scholar 

  3. Reynar, J.: An Automatic Method of Finding Topic Boundaries. In: Proceedings of 32nd Annual meeting of Association for Computational Linguistics, pp. 331–333 (1994)

    Google Scholar 

  4. Shannon, C., Weaver, W.: The Mathematical Theory of Communication, Univ of Illinois Press, USA (1963)

    Google Scholar 

  5. Choi, F.: Advances in domain independent linear text segmentation. In: Proceedings of the North American Chapter of the ACL, pp. 26–33 (2000)

    Google Scholar 

  6. Ponte, J., Croft, W.: Text Segmentation by Topic. In: Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries, pp. 113–125 (1997)

    Google Scholar 

  7. Kozima, H.: Text Segmentation Based on Similarity between Words. In: Proceedings of ACL 1993, pp. 286–288 (1993)

    Google Scholar 

  8. Al-Halimi, R.: Mining Topic Signals from Text. University of Waterloo Electronic Theses (2003)

    Google Scholar 

  9. Fox, C.: Lexical Analysis and stoplists. In: Information Retrieval: Data Structures and Algorithms, Prentice Hall, Englewood Cliffs (1992)

    Google Scholar 

  10. Ji, X., Zha, H.: Domain-independent Text Segmentation Using Anisotropic Diffusion and Dynamic Programming. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 322–329 (2003)

    Google Scholar 

  11. Hamerly, G., Elkan, C.: Learning the k in k-means. In: Thrun, S. (ed.) Advances of the Neural Information Processing Systems 16, pp. 281–289. MIT Press, Cambridge (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Cai, K., Bu, J., Chen, C., Huang, P. (2006). An Automatic Approach for Efficient Text Segmentation. In: Gabrys, B., Howlett, R.J., Jain, L.C. (eds) Knowledge-Based Intelligent Information and Engineering Systems. KES 2006. Lecture Notes in Computer Science(), vol 4251. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11892960_51

Download citation

  • DOI: https://doi.org/10.1007/11892960_51

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-46535-5

  • Online ISBN: 978-3-540-46536-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics