An Automatic Approach for Efficient Text Segmentation

Cai, Keke; Bu, Jiajun; Chen, Chun; Huang, Peng

doi:10.1007/11892960_51

Keke Cai²¹,
Jiajun Bu²¹,
Chun Chen²¹ &
…
Peng Huang²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4251))

Included in the following conference series:

International Conference on Knowledge-Based and Intelligent Information and Engineering Systems

1258 Accesses
1 Citations

Abstract

This paper presents a domain-independent approach for partitioning text documents into a set of topic-coherent segment units, where the structure of segments reflects the patterns of sub-topics of the processed text document. The approach adopts similarity analyses, which is based on Shannon Information Theory, to determine topic distribution among text documents without incorporating thesaurus information and other auxiliary knowledge bases. It first observes the documents in terms of consistency of distribution from the viewpoint of individual word and then constructs a number of segmentation proposals accordingly. Furthermore, it employs the K-means clustering technique to get a consensus from these proposals and finally partition text into a set of topic coherent paragraphs. Through extensive experimental studies based on real and synthetic data sources, the performance analysis illustrates the effectiveness of the approach in text segmentation.

This work is supported by the Key Science and Technology Plan of Zhejiang Province, China (Grant no. 2005C23047).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Hearst, M.: Multi-paragraph segmentation of expository texts. In: Proceedings of 32nd Annual meeting of Association for Computational Linguistics, pp. 9–16 (1994)
Google Scholar
Hearst, M.: TextTiling: Segmenting text into multi–paragraph subtopic passages. Computational Linguistics 23(1), 33–64 (1997)
Google Scholar
Reynar, J.: An Automatic Method of Finding Topic Boundaries. In: Proceedings of 32nd Annual meeting of Association for Computational Linguistics, pp. 331–333 (1994)
Google Scholar
Shannon, C., Weaver, W.: The Mathematical Theory of Communication, Univ of Illinois Press, USA (1963)
Google Scholar
Choi, F.: Advances in domain independent linear text segmentation. In: Proceedings of the North American Chapter of the ACL, pp. 26–33 (2000)
Google Scholar
Ponte, J., Croft, W.: Text Segmentation by Topic. In: Proceedings of the First European Conference on Research and Advanced Technology for Digital Libraries, pp. 113–125 (1997)
Google Scholar
Kozima, H.: Text Segmentation Based on Similarity between Words. In: Proceedings of ACL 1993, pp. 286–288 (1993)
Google Scholar
Al-Halimi, R.: Mining Topic Signals from Text. University of Waterloo Electronic Theses (2003)
Google Scholar
Fox, C.: Lexical Analysis and stoplists. In: Information Retrieval: Data Structures and Algorithms, Prentice Hall, Englewood Cliffs (1992)
Google Scholar
Ji, X., Zha, H.: Domain-independent Text Segmentation Using Anisotropic Diffusion and Dynamic Programming. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 322–329 (2003)
Google Scholar
Hamerly, G., Elkan, C.: Learning the k in k-means. In: Thrun, S. (ed.) Advances of the Neural Information Processing Systems 16, pp. 281–289. MIT Press, Cambridge (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science, Zhejiang University, Hangzhou, 310027, China
Keke Cai, Jiajun Bu, Chun Chen & Peng Huang

Authors

Keke Cai
View author publications
You can also search for this author in PubMed Google Scholar
Jiajun Bu
View author publications
You can also search for this author in PubMed Google Scholar
Chun Chen
View author publications
You can also search for this author in PubMed Google Scholar
Peng Huang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Design, Engineering and Computing, Bournemouth University, UK
Bogdan Gabrys
Centre for SMART Systems, School of Environment and Technology, University of Brighton, BN2 4GJ, Brighton, UK
Robert J. Howlett
School of Electrical and Information Engineering, Knowledge Based Intelligent Engineering Systems Centre, University of South Australia, Mawson Lakes, 5095, SA, Australia
Lakhmi C. Jain

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cai, K., Bu, J., Chen, C., Huang, P. (2006). An Automatic Approach for Efficient Text Segmentation. In: Gabrys, B., Howlett, R.J., Jain, L.C. (eds) Knowledge-Based Intelligent Information and Engineering Systems. KES 2006. Lecture Notes in Computer Science(), vol 4251. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11892960_51

Download citation

DOI: https://doi.org/10.1007/11892960_51
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-46535-5
Online ISBN: 978-3-540-46536-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics