An Efficient Clustering Approach for Large Document Collections

Han, Bo; Kang, Lishan; Song, Huazhu

doi:10.1007/11527503_29

Bo Han^21,22,
Lishan Kang²¹ &
Huazhu Song²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3584))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

2340 Accesses

Abstract

A vast amount of unstructured text data, such as scientific publications, commercial reports and webpages are required to be quickly categorized into different semantic groups for facilitating online information query. However, the state-of-the art clustering methods are suffered from the huge size of documents with high-dimensional text features. In this paper, we propose an efficient clustering algorithm for large document collections, which performs clustering in three stages: 1) by using permutation test, the informative topic words are identified so as to reduce feature dimension; 2) selecting a small number of most typical documents to perform initial clustering 3) refining clustering on all documents. The algorithm was tested by the 20 newsgroup data and experimental results showed that, comparing with the methods which cluster corpus based on all document samples and full features directly, this approach significantly reduced the time cost in an order while slightly improving the clustering quality.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Banerjee, A., Dhillon, I.S., Ghosh, J., Sra, S.: Clustering on hyperspheres using Expectation Maximization (Technical Report TR-03-07). Dept of Computer Sciences, Uniersity of Texas (2003)
Google Scholar
McCallum, A., Nigam, K., Ungar, L.H.: Efficient Clustering of High-dimensional Data Sets with Application to Reference Matching. In: Proc. 6th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (2000)
Google Scholar
Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In: Proceeding of the AAAI2000 Workshop on Artificial Intelligence for Web Search, Austin, Texas (2000)
Google Scholar
Cutting, D., Kager, D., Pedersen, J., Tukey, J.W.: Scatter/Gather A cluster-based approach to browsing large document collections. In: Proc. ACM SIGIR (1992)
Google Scholar
Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: Proc. 7th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (2001)
Google Scholar
Tantrum, J., Murua, A., Stuetzle, W.: Hierarchical model-based clustering of large datasets through fractionation and refractionation. In: Proc. 8th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (2002)
Google Scholar
Steinbach, M., Karpis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text mining (2000)
Google Scholar
Zamir, O., Etzioni, O.: Web document clustering: A feasibility demonstration. In: ACM SIGIR (1998)
Google Scholar
Zhong, S., Ghosh, J.: A Comparative Study of Generative Models for Document Clustering. In: SDM Workshop on Clustering High Dimensional Data and Its Applicatons, San Francisco, CA (2003)
Google Scholar
Zhong, S., Ghosh, J.: A unified framework for model-based clustering. Intelligent Engineering Systems Through Artificial Neural Networks (ANNIE), St. Louis, MO (2002)
Google Scholar
Zhong, S., Ghosh, J.: A unified framework for model-based clustering and its applications to clustering time sequences (Technique Report), Dept of ECE, University of Texas at Austin (2002)
Google Scholar
Hsing, T., Attoor, S., Dougherty, E.: Relation Between Permutation-Test P Values and Classifier Error Estimates. J. Machine Learning 52, 11–30 (2003)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, Wuhan University, Wuhan, Hubei, 430072, P.R.China
Bo Han & Lishan Kang
Center for Information Science and Technology, Temple University, Philadelphia, PA, 19122, U.S.A
Bo Han
School of Computer Science and Technology, Wuhan University of Technology, Wuhan, Hubei, 430070, P.R.China
Huazhu Song

Authors

Bo Han
View author publications
You can also search for this author in PubMed Google Scholar
Lishan Kang
View author publications
You can also search for this author in PubMed Google Scholar
Huazhu Song
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Information Technology and Electrical Engineering, The University of Queensland, 4072, Brisbane, Queensland, Australia
Xue Li
The State Key Laboratory for Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, 430072, Wuhan, China
Shuliang Wang
School of ITEE, The Univ of Queensland, St. Lucia, 4072, QLD, Australia
Zhao Yang Dong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Han, B., Kang, L., Song, H. (2005). An Efficient Clustering Approach for Large Document Collections. In: Li, X., Wang, S., Dong, Z.Y. (eds) Advanced Data Mining and Applications. ADMA 2005. Lecture Notes in Computer Science(), vol 3584. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11527503_29

Download citation

DOI: https://doi.org/10.1007/11527503_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-27894-8
Online ISBN: 978-3-540-31877-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics