Abstract
A vast amount of unstructured text data, such as scientific publications, commercial reports and webpages are required to be quickly categorized into different semantic groups for facilitating online information query. However, the state-of-the art clustering methods are suffered from the huge size of documents with high-dimensional text features. In this paper, we propose an efficient clustering algorithm for large document collections, which performs clustering in three stages: 1) by using permutation test, the informative topic words are identified so as to reduce feature dimension; 2) selecting a small number of most typical documents to perform initial clustering 3) refining clustering on all documents. The algorithm was tested by the 20 newsgroup data and experimental results showed that, comparing with the methods which cluster corpus based on all document samples and full features directly, this approach significantly reduced the time cost in an order while slightly improving the clustering quality.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Banerjee, A., Dhillon, I.S., Ghosh, J., Sra, S.: Clustering on hyperspheres using Expectation Maximization (Technical Report TR-03-07). Dept of Computer Sciences, Uniersity of Texas (2003)
McCallum, A., Nigam, K., Ungar, L.H.: Efficient Clustering of High-dimensional Data Sets with Application to Reference Matching. In: Proc. 6th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (2000)
Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In: Proceeding of the AAAI2000 Workshop on Artificial Intelligence for Web Search, Austin, Texas (2000)
Cutting, D., Kager, D., Pedersen, J., Tukey, J.W.: Scatter/Gather A cluster-based approach to browsing large document collections. In: Proc. ACM SIGIR (1992)
Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: Proc. 7th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (2001)
Tantrum, J., Murua, A., Stuetzle, W.: Hierarchical model-based clustering of large datasets through fractionation and refractionation. In: Proc. 8th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (2002)
Steinbach, M., Karpis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text mining (2000)
Zamir, O., Etzioni, O.: Web document clustering: A feasibility demonstration. In: ACM SIGIR (1998)
Zhong, S., Ghosh, J.: A Comparative Study of Generative Models for Document Clustering. In: SDM Workshop on Clustering High Dimensional Data and Its Applicatons, San Francisco, CA (2003)
Zhong, S., Ghosh, J.: A unified framework for model-based clustering. Intelligent Engineering Systems Through Artificial Neural Networks (ANNIE), St. Louis, MO (2002)
Zhong, S., Ghosh, J.: A unified framework for model-based clustering and its applications to clustering time sequences (Technique Report), Dept of ECE, University of Texas at Austin (2002)
Hsing, T., Attoor, S., Dougherty, E.: Relation Between Permutation-Test P Values and Classifier Error Estimates. J. Machine Learning 52, 11–30 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Han, B., Kang, L., Song, H. (2005). An Efficient Clustering Approach for Large Document Collections. In: Li, X., Wang, S., Dong, Z.Y. (eds) Advanced Data Mining and Applications. ADMA 2005. Lecture Notes in Computer Science(), vol 3584. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11527503_29
Download citation
DOI: https://doi.org/10.1007/11527503_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-27894-8
Online ISBN: 978-3-540-31877-4
eBook Packages: Computer ScienceComputer Science (R0)