Skip to main content

An Efficient Clustering Approach for Large Document Collections

  • Conference paper
Advanced Data Mining and Applications (ADMA 2005)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3584))

Included in the following conference series:

  • 2340 Accesses

Abstract

A vast amount of unstructured text data, such as scientific publications, commercial reports and webpages are required to be quickly categorized into different semantic groups for facilitating online information query. However, the state-of-the art clustering methods are suffered from the huge size of documents with high-dimensional text features. In this paper, we propose an efficient clustering algorithm for large document collections, which performs clustering in three stages: 1) by using permutation test, the informative topic words are identified so as to reduce feature dimension; 2) selecting a small number of most typical documents to perform initial clustering 3) refining clustering on all documents. The algorithm was tested by the 20 newsgroup data and experimental results showed that, comparing with the methods which cluster corpus based on all document samples and full features directly, this approach significantly reduced the time cost in an order while slightly improving the clustering quality.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Banerjee, A., Dhillon, I.S., Ghosh, J., Sra, S.: Clustering on hyperspheres using Expectation Maximization (Technical Report TR-03-07). Dept of Computer Sciences, Uniersity of Texas (2003)

    Google Scholar 

  2. McCallum, A., Nigam, K., Ungar, L.H.: Efficient Clustering of High-dimensional Data Sets with Application to Reference Matching. In: Proc. 6th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (2000)

    Google Scholar 

  3. Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In: Proceeding of the AAAI2000 Workshop on Artificial Intelligence for Web Search, Austin, Texas (2000)

    Google Scholar 

  4. Cutting, D., Kager, D., Pedersen, J., Tukey, J.W.: Scatter/Gather A cluster-based approach to browsing large document collections. In: Proc. ACM SIGIR (1992)

    Google Scholar 

  5. Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: Proc. 7th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (2001)

    Google Scholar 

  6. Tantrum, J., Murua, A., Stuetzle, W.: Hierarchical model-based clustering of large datasets through fractionation and refractionation. In: Proc. 8th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining (2002)

    Google Scholar 

  7. Steinbach, M., Karpis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text mining (2000)

    Google Scholar 

  8. Zamir, O., Etzioni, O.: Web document clustering: A feasibility demonstration. In: ACM SIGIR (1998)

    Google Scholar 

  9. Zhong, S., Ghosh, J.: A Comparative Study of Generative Models for Document Clustering. In: SDM Workshop on Clustering High Dimensional Data and Its Applicatons, San Francisco, CA (2003)

    Google Scholar 

  10. Zhong, S., Ghosh, J.: A unified framework for model-based clustering. Intelligent Engineering Systems Through Artificial Neural Networks (ANNIE), St. Louis, MO (2002)

    Google Scholar 

  11. Zhong, S., Ghosh, J.: A unified framework for model-based clustering and its applications to clustering time sequences (Technique Report), Dept of ECE, University of Texas at Austin (2002)

    Google Scholar 

  12. Hsing, T., Attoor, S., Dougherty, E.: Relation Between Permutation-Test P Values and Classifier Error Estimates. J. Machine Learning 52, 11–30 (2003)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Han, B., Kang, L., Song, H. (2005). An Efficient Clustering Approach for Large Document Collections. In: Li, X., Wang, S., Dong, Z.Y. (eds) Advanced Data Mining and Applications. ADMA 2005. Lecture Notes in Computer Science(), vol 3584. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11527503_29

Download citation

  • DOI: https://doi.org/10.1007/11527503_29

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-27894-8

  • Online ISBN: 978-3-540-31877-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics