Skip to main content

Comparison of Selected Methods for Document Clustering

  • Conference paper
  • 557 Accesses

Part of the book series: Advances in Intelligent and Soft Computing ((AINSC,volume 86))

Abstract

17 cluster analysis techniques proposed for document clustering in terms of internal and external quality measures of clustering and computing time demands are compared. These are combinations of three basic methods (direct, repeated bisection and agglomerative) and five clustering criterion functions for solution assessment (two intra − cluster, one inter − cluster, and two complex ones); all implemented in the CLUTO software package. Furthermore, in the case of the agglomerative method we also applied a single linkage and complete linkage clustering as a criterion function. Collection 20 Newsgroups, a binary vector representation of e-mail messages, was used for comparing the methods. Experiments with document clustering have proved that, from the point of view of entropy and purity, the direct method provides the best results. As regards computing time, the repeated bisection (divisive) method has been the fastest.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Andrews, N., Fox, E.: Recent Developments in Document Clustering. Tech. rep., Department of Computer Science, Virginia Tech. (2007)

    Google Scholar 

  2. Bouguila, N.: On multivariate binary data clustering and feature weighting. Computational Statistics and Data Analysis 54(1), 120–134 (2010)

    Article  Google Scholar 

  3. Gan, G., Ma, C., Wu, J.: Data Clustering: Theory, Algorithms, and Applications. SIAM, Philadelphia (2007)

    MATH  Google Scholar 

  4. Husek, D., Pokorny, J., Rezankova, H., Snasel, V.: Data clustering: From documents to the web. In: Vakali, A., Pallis, G. (eds.) Web Data Management Practices: Emerging Techniques and Technologies, pp. 1–33. Idea Group Publishing, USA (2007)

    Google Scholar 

  5. Jiang, Z., Lu, C.: A latent semantic analysis based method of getting the category attribute of words. In: ICECT 2009: Proceedings of the 2009 International Conference on Electronic Computer Technology, pp. 141–146. IEEE Computer Society, Washington (2009), doi:10.1109/ICECT.2009.19

    Chapter  Google Scholar 

  6. Karypis, G.: CLUTO: A Clustering Toolkit, Release 2.1.1. Tech. rep., University of Minnesota, Department of Computer Science, Minneapolis, MN (2003)

    Google Scholar 

  7. Li, T.: A unified view on clustering binary data. Machine Learning 62(3), 199–215 (2006), doi:10.1007/s10994-005-5316-9

    Article  Google Scholar 

  8. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval, 1st edn. Cambridge University Press, Cambridge (2009)

    Google Scholar 

  9. McCallum, A.K.: Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering (1996), http://www.cs.cmu.edu/~mccallum/bow

  10. Sevcik, R.: Classification of Electronic Documents Using Cluster Analysis. Diploma thesis, University of Economics, Prague (2010)

    Google Scholar 

  11. Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. Tech. rep., University of Minnesota, Department of Computer Science, Minneapolis, MN (2000)

    Google Scholar 

  12. Zamir, O., Etzioni, O.: Web document clustering: a feasibility demonstration. In: SIGIR 1998: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 46–54. ACM, New York (1998), doi: http://doi.acm.org/10.1145/290941.290956

    Chapter  Google Scholar 

  13. Zhao, Y., Karypis, G.: Criterion Functions for Document Clustering. Tech. rep., University of Minnesota, Department of Computer Science (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Sevcik, R., Rezankova, H., Husek, D. (2011). Comparison of Selected Methods for Document Clustering. In: Mugellini, E., Szczepaniak, P.S., Pettenati, M.C., Sokhn, M. (eds) Advances in Intelligent Web Mastering – 3. Advances in Intelligent and Soft Computing, vol 86. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-18029-3_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-18029-3_11

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-18028-6

  • Online ISBN: 978-3-642-18029-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics