Abstract
Fast and high quality document clustering is one of the most important tasks in the modern era of information. With the huge amount of available data and with an aim to creating better quality clusters, scores of algorithms having quality-complexity trade-offs have been proposed. Some of the proposed algorithms attempt to minimize the computational overload in terms of certain criterion functions defined for the whole set of clustering solution. In this paper, we have proposed a novel algorithm for document clustering using a graph based criterion function. Our algorithm is partitioning in nature. Most of the commonly used partitioning clustering algorithms are inflicted with the drawback of trapping into local optimum solutions. However, the algorithm proposed in this paper usually leads to the global optimum solution. Its performance enhances with the increment in the number of clusters. We have carried out sophisticated experiments wherein we have compared our algorithm with two well known document clustering algorithms viz. k-means and k-means++ algorithm. The results so obtained confirm the superiority of our algorithm.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Berkhin, P.: Survey of clustering data mining techniques, Accrue Software Paper (2002)
Hartigan, J., Wong, M.: Algorithm AS136: A k-means clustering algorithm. Applied Statistics, 100–108 (1979)
Arthur, D., Vassilvitskii, S.: K-means++: the advantages of careful seeding. In: ACM-SIAM Symposium on Discrete Algorithms (2007)
Mahdavi, M., Abolhassani, H.: Harmony k -means algorithm for document clustering. Data Mining and Knowledge Discovery (2009)
Cui, X., Potok, T.E., Palathingal, P.: Document clustering using particle swarm optimization. In: Proceedings IEEE Swarm Intelligence Symposium, pp. 185–191 (2005)
Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: Analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 881–892 (2002)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Survey 31(3), 264–323 (1999)
Guha, S., Rastogi, R., Shim, K.: Rock: A robust clustering algorithm for categorical attributes. Information Systems 25(5), 345–366 (2000)
Guha, S., Rastogi, R., Shim, K.: Cure: an efficient clustering algorithm for large databases. SIGMOD Rec. 27(2), 73–84 (1998)
Karypis, G., News, V.K.: Chameleon: Hierarchical clustering using dynamic modeling. Computer 32(8), 68–75 (1999)
Han, E.H., Karypis, G., Kumar, V., Mobasher, B.: Hypergraph based clustering in high-dimensional data sets: A summary of results. Data Engineering Bulletin, 15–22 (1998)
Ng, R., Han, J.: Efficient and effective clustering method for spatial data mining. In: Proceedings of the 20th VLDB Conference, Santiago, Chile, pp. 144–155 (1994)
Zahn, K.: Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, 68–86 (1971)
Chandrasekharan, M., Rajagopalan, R.: An ideal seed non-hierarchical clustering algorithm for cellular manufacturing. International Journal of Production Research, 451–464 (1986)
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall (1988)
Zhao, Y., Karypis, G.: Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning 55(3), 311–331 (2004)
Zhao, Y., Karypis, G.: Evaluation of hierarchical clustering algorithms for document datasets. In: CIKM Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 515–524. ACM Press (2002)
Zha, H., He, X., Ding, C., Simon, H., Gu, M.: Bipartite graph partitioning and data clustering. In: CIKM (2001)
Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning, Technical Report, Department of Computer Science, University of Texas, Austin (2001)
Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley (1989)
Zhao, Y., Karypis, G.: Criterion functions for document clustering: Experiments and analysis, Technical Report, University of Minnesota, pp. 01–40 (2001)
Stein, B., Eissen, S.M.Z., Wißbrock, F.: On cluster validity and the information need of users. In: Proceedings Artificial Intelligence and Applications, pp. 373, 522, 531, 533 (2003)
Dataset from Karypis Lab, http://glaros.dtc.umn.edu/gkhome/fetch/sw/cluto/datasets.tar.gz
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kumar, R., Ranjan, A., Dhar, J. (2012). A Fast and Effective Partitioning Algorithm for Document Clustering. In: Kannan, R., Andres, F. (eds) Data Engineering and Management. ICDEM 2010. Lecture Notes in Computer Science, vol 6411. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-27872-3_40
Download citation
DOI: https://doi.org/10.1007/978-3-642-27872-3_40
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-27871-6
Online ISBN: 978-3-642-27872-3
eBook Packages: Computer ScienceComputer Science (R0)