Abstract
This paper presents a domain-independent approach for partitioning text documents into a set of topic-coherent segment units, where the structure of segments reflects the patterns of sub-topics of the processed text document. The approach adopts similarity analyses, which is based on Shannon Information Theory, to determine topic distribution among text documents without incorporating thesaurus information and other auxiliary knowledge bases. It first observes the documents in terms of consistency of distribution from the viewpoint of individual word and then constructs a number of segmentation proposals accordingly. Furthermore, it employs the K-means clustering technique to get a consensus from these proposals and finally partition text into a set of topic coherent paragraphs. Through extensive experimental studies based on real and synthetic data sources, the performance analysis illustrates the effectiveness of the approach in text segmentation.
This work is supported by the Key Science and Technology Plan of Zhejiang Province, China (Grant no. 2005C23047).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Hearst, M.: Multi-paragraph segmentation of expository texts. In: Proceedings of 32nd Annual meeting of Association for Computational Linguistics, pp. 9–16 (1994)
Hearst, M.: TextTiling: Segmenting text into multi–paragraph subtopic passages. Computational Linguistics 23(1), 33–64 (1997)
Reynar, J.: An Automatic Method of Finding Topic Boundaries. In: Proceedings of 32nd Annual meeting of Association for Computational Linguistics, pp. 331–333 (1994)
Shannon, C., Weaver, W.: The Mathematical Theory of Communication, Univ of Illinois Press, USA (1963)
Choi, F.: Advances in domain independent linear text segmentation. In: Proceedings of the North American Chapter of the ACL, pp. 26–33 (2000)
Ponte, J., Croft, W.: Text Segmentation by Topic. In: Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries, pp. 113–125 (1997)
Kozima, H.: Text Segmentation Based on Similarity between Words. In: Proceedings of ACL 1993, pp. 286–288 (1993)
Al-Halimi, R.: Mining Topic Signals from Text. University of Waterloo Electronic Theses (2003)
Fox, C.: Lexical Analysis and stoplists. In: Information Retrieval: Data Structures and Algorithms, Prentice Hall, Englewood Cliffs (1992)
Ji, X., Zha, H.: Domain-independent Text Segmentation Using Anisotropic Diffusion and Dynamic Programming. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 322–329 (2003)
Hamerly, G., Elkan, C.: Learning the k in k-means. In: Thrun, S. (ed.) Advances of the Neural Information Processing Systems 16, pp. 281–289. MIT Press, Cambridge (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Cai, K., Bu, J., Chen, C., Huang, P. (2006). An Automatic Approach for Efficient Text Segmentation. In: Gabrys, B., Howlett, R.J., Jain, L.C. (eds) Knowledge-Based Intelligent Information and Engineering Systems. KES 2006. Lecture Notes in Computer Science(), vol 4251. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11892960_51
Download citation
DOI: https://doi.org/10.1007/11892960_51
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-46535-5
Online ISBN: 978-3-540-46536-2
eBook Packages: Computer ScienceComputer Science (R0)