Skip to main content

A Fast and Effective Partitioning Algorithm for Document Clustering

  • Conference paper
Data Engineering and Management (ICDEM 2010)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 6411))

Included in the following conference series:

  • 1378 Accesses

Abstract

Fast and high quality document clustering is one of the most important tasks in the modern era of information. With the huge amount of available data and with an aim to creating better quality clusters, scores of algorithms having quality-complexity trade-offs have been proposed. Some of the proposed algorithms attempt to minimize the computational overload in terms of certain criterion functions defined for the whole set of clustering solution. In this paper, we have proposed a novel algorithm for document clustering using a graph based criterion function. Our algorithm is partitioning in nature. Most of the commonly used partitioning clustering algorithms are inflicted with the drawback of trapping into local optimum solutions. However, the algorithm proposed in this paper usually leads to the global optimum solution. Its performance enhances with the increment in the number of clusters. We have carried out sophisticated experiments wherein we have compared our algorithm with two well known document clustering algorithms viz. k-means and k-means++ algorithm. The results so obtained confirm the superiority of our algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Berkhin, P.: Survey of clustering data mining techniques, Accrue Software Paper (2002)

    Google Scholar 

  2. Hartigan, J., Wong, M.: Algorithm AS136: A k-means clustering algorithm. Applied Statistics, 100–108 (1979)

    Google Scholar 

  3. Arthur, D., Vassilvitskii, S.: K-means++: the advantages of careful seeding. In: ACM-SIAM Symposium on Discrete Algorithms (2007)

    Google Scholar 

  4. Mahdavi, M., Abolhassani, H.: Harmony k -means algorithm for document clustering. Data Mining and Knowledge Discovery (2009)

    Google Scholar 

  5. Cui, X., Potok, T.E., Palathingal, P.: Document clustering using particle swarm optimization. In: Proceedings IEEE Swarm Intelligence Symposium, pp. 185–191 (2005)

    Google Scholar 

  6. Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: Analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 881–892 (2002)

    Article  MATH  Google Scholar 

  7. Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Survey 31(3), 264–323 (1999)

    Article  Google Scholar 

  8. Guha, S., Rastogi, R., Shim, K.: Rock: A robust clustering algorithm for categorical attributes. Information Systems 25(5), 345–366 (2000)

    Article  Google Scholar 

  9. Guha, S., Rastogi, R., Shim, K.: Cure: an efficient clustering algorithm for large databases. SIGMOD Rec. 27(2), 73–84 (1998)

    Article  MATH  Google Scholar 

  10. Karypis, G., News, V.K.: Chameleon: Hierarchical clustering using dynamic modeling. Computer 32(8), 68–75 (1999)

    Article  Google Scholar 

  11. Han, E.H., Karypis, G., Kumar, V., Mobasher, B.: Hypergraph based clustering in high-dimensional data sets: A summary of results. Data Engineering Bulletin, 15–22 (1998)

    Google Scholar 

  12. Ng, R., Han, J.: Efficient and effective clustering method for spatial data mining. In: Proceedings of the 20th VLDB Conference, Santiago, Chile, pp. 144–155 (1994)

    Google Scholar 

  13. Zahn, K.: Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, 68–86 (1971)

    Google Scholar 

  14. Chandrasekharan, M., Rajagopalan, R.: An ideal seed non-hierarchical clustering algorithm for cellular manufacturing. International Journal of Production Research, 451–464 (1986)

    Google Scholar 

  15. Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall (1988)

    Google Scholar 

  16. Zhao, Y., Karypis, G.: Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning 55(3), 311–331 (2004)

    Article  MATH  Google Scholar 

  17. Zhao, Y., Karypis, G.: Evaluation of hierarchical clustering algorithms for document datasets. In: CIKM Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 515–524. ACM Press (2002)

    Google Scholar 

  18. Zha, H., He, X., Ding, C., Simon, H., Gu, M.: Bipartite graph partitioning and data clustering. In: CIKM (2001)

    Google Scholar 

  19. Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning, Technical Report, Department of Computer Science, University of Texas, Austin (2001)

    Google Scholar 

  20. Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley (1989)

    Google Scholar 

  21. Zhao, Y., Karypis, G.: Criterion functions for document clustering: Experiments and analysis, Technical Report, University of Minnesota, pp. 01–40 (2001)

    Google Scholar 

  22. Stein, B., Eissen, S.M.Z., Wißbrock, F.: On cluster validity and the information need of users. In: Proceedings Artificial Intelligence and Applications, pp. 373, 522, 531, 533 (2003)

    Google Scholar 

  23. Dataset from Karypis Lab, http://glaros.dtc.umn.edu/gkhome/fetch/sw/cluto/datasets.tar.gz

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kumar, R., Ranjan, A., Dhar, J. (2012). A Fast and Effective Partitioning Algorithm for Document Clustering. In: Kannan, R., Andres, F. (eds) Data Engineering and Management. ICDEM 2010. Lecture Notes in Computer Science, vol 6411. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-27872-3_40

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-27872-3_40

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-27871-6

  • Online ISBN: 978-3-642-27872-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics