Abstract
17 cluster analysis techniques proposed for document clustering in terms of internal and external quality measures of clustering and computing time demands are compared. These are combinations of three basic methods (direct, repeated bisection and agglomerative) and five clustering criterion functions for solution assessment (two intra − cluster, one inter − cluster, and two complex ones); all implemented in the CLUTO software package. Furthermore, in the case of the agglomerative method we also applied a single linkage and complete linkage clustering as a criterion function. Collection 20 Newsgroups, a binary vector representation of e-mail messages, was used for comparing the methods. Experiments with document clustering have proved that, from the point of view of entropy and purity, the direct method provides the best results. As regards computing time, the repeated bisection (divisive) method has been the fastest.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Andrews, N., Fox, E.: Recent Developments in Document Clustering. Tech. rep., Department of Computer Science, Virginia Tech. (2007)
Bouguila, N.: On multivariate binary data clustering and feature weighting. Computational Statistics and Data Analysis 54(1), 120–134 (2010)
Gan, G., Ma, C., Wu, J.: Data Clustering: Theory, Algorithms, and Applications. SIAM, Philadelphia (2007)
Husek, D., Pokorny, J., Rezankova, H., Snasel, V.: Data clustering: From documents to the web. In: Vakali, A., Pallis, G. (eds.) Web Data Management Practices: Emerging Techniques and Technologies, pp. 1–33. Idea Group Publishing, USA (2007)
Jiang, Z., Lu, C.: A latent semantic analysis based method of getting the category attribute of words. In: ICECT 2009: Proceedings of the 2009 International Conference on Electronic Computer Technology, pp. 141–146. IEEE Computer Society, Washington (2009), doi:10.1109/ICECT.2009.19
Karypis, G.: CLUTO: A Clustering Toolkit, Release 2.1.1. Tech. rep., University of Minnesota, Department of Computer Science, Minneapolis, MN (2003)
Li, T.: A unified view on clustering binary data. Machine Learning 62(3), 199–215 (2006), doi:10.1007/s10994-005-5316-9
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval, 1st edn. Cambridge University Press, Cambridge (2009)
McCallum, A.K.: Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering (1996), http://www.cs.cmu.edu/~mccallum/bow
Sevcik, R.: Classification of Electronic Documents Using Cluster Analysis. Diploma thesis, University of Economics, Prague (2010)
Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. Tech. rep., University of Minnesota, Department of Computer Science, Minneapolis, MN (2000)
Zamir, O., Etzioni, O.: Web document clustering: a feasibility demonstration. In: SIGIR 1998: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 46–54. ACM, New York (1998), doi: http://doi.acm.org/10.1145/290941.290956
Zhao, Y., Karypis, G.: Criterion Functions for Document Clustering. Tech. rep., University of Minnesota, Department of Computer Science (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sevcik, R., Rezankova, H., Husek, D. (2011). Comparison of Selected Methods for Document Clustering. In: Mugellini, E., Szczepaniak, P.S., Pettenati, M.C., Sokhn, M. (eds) Advances in Intelligent Web Mastering – 3. Advances in Intelligent and Soft Computing, vol 86. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-18029-3_11
Download citation
DOI: https://doi.org/10.1007/978-3-642-18029-3_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-18028-6
Online ISBN: 978-3-642-18029-3
eBook Packages: EngineeringEngineering (R0)