A Fast and Effective Partitioning Algorithm for Document Clustering

Kumar, Rajeev; Ranjan, Alok; Dhar, Joydip

doi:10.1007/978-3-642-27872-3_40

Rajeev Kumar¹⁸,
Alok Ranjan¹⁸ &
Joydip Dhar¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 6411))

Included in the following conference series:

International Conference on Data Engineering and Management

1378 Accesses

Abstract

Fast and high quality document clustering is one of the most important tasks in the modern era of information. With the huge amount of available data and with an aim to creating better quality clusters, scores of algorithms having quality-complexity trade-offs have been proposed. Some of the proposed algorithms attempt to minimize the computational overload in terms of certain criterion functions defined for the whole set of clustering solution. In this paper, we have proposed a novel algorithm for document clustering using a graph based criterion function. Our algorithm is partitioning in nature. Most of the commonly used partitioning clustering algorithms are inflicted with the drawback of trapping into local optimum solutions. However, the algorithm proposed in this paper usually leads to the global optimum solution. Its performance enhances with the increment in the number of clusters. We have carried out sophisticated experiments wherein we have compared our algorithm with two well known document clustering algorithms viz. k-means and k-means++ algorithm. The results so obtained confirm the superiority of our algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Berkhin, P.: Survey of clustering data mining techniques, Accrue Software Paper (2002)
Google Scholar
Hartigan, J., Wong, M.: Algorithm AS136: A k-means clustering algorithm. Applied Statistics, 100–108 (1979)
Google Scholar
Arthur, D., Vassilvitskii, S.: K-means++: the advantages of careful seeding. In: ACM-SIAM Symposium on Discrete Algorithms (2007)
Google Scholar
Mahdavi, M., Abolhassani, H.: Harmony k -means algorithm for document clustering. Data Mining and Knowledge Discovery (2009)
Google Scholar
Cui, X., Potok, T.E., Palathingal, P.: Document clustering using particle swarm optimization. In: Proceedings IEEE Swarm Intelligence Symposium, pp. 185–191 (2005)
Google Scholar
Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: Analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 881–892 (2002)
Article MATH Google Scholar
Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Survey 31(3), 264–323 (1999)
Article Google Scholar
Guha, S., Rastogi, R., Shim, K.: Rock: A robust clustering algorithm for categorical attributes. Information Systems 25(5), 345–366 (2000)
Article Google Scholar
Guha, S., Rastogi, R., Shim, K.: Cure: an efficient clustering algorithm for large databases. SIGMOD Rec. 27(2), 73–84 (1998)
Article MATH Google Scholar
Karypis, G., News, V.K.: Chameleon: Hierarchical clustering using dynamic modeling. Computer 32(8), 68–75 (1999)
Article Google Scholar
Han, E.H., Karypis, G., Kumar, V., Mobasher, B.: Hypergraph based clustering in high-dimensional data sets: A summary of results. Data Engineering Bulletin, 15–22 (1998)
Google Scholar
Ng, R., Han, J.: Efficient and effective clustering method for spatial data mining. In: Proceedings of the 20th VLDB Conference, Santiago, Chile, pp. 144–155 (1994)
Google Scholar
Zahn, K.: Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, 68–86 (1971)
Google Scholar
Chandrasekharan, M., Rajagopalan, R.: An ideal seed non-hierarchical clustering algorithm for cellular manufacturing. International Journal of Production Research, 451–464 (1986)
Google Scholar
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall (1988)
Google Scholar
Zhao, Y., Karypis, G.: Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning 55(3), 311–331 (2004)
Article MATH Google Scholar
Zhao, Y., Karypis, G.: Evaluation of hierarchical clustering algorithms for document datasets. In: CIKM Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 515–524. ACM Press (2002)
Google Scholar
Zha, H., He, X., Ding, C., Simon, H., Gu, M.: Bipartite graph partitioning and data clustering. In: CIKM (2001)
Google Scholar
Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning, Technical Report, Department of Computer Science, University of Texas, Austin (2001)
Google Scholar
Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley (1989)
Google Scholar
Zhao, Y., Karypis, G.: Criterion functions for document clustering: Experiments and analysis, Technical Report, University of Minnesota, pp. 01–40 (2001)
Google Scholar
Stein, B., Eissen, S.M.Z., Wißbrock, F.: On cluster validity and the information need of users. In: Proceedings Artificial Intelligence and Applications, pp. 373, 522, 531, 533 (2003)
Google Scholar
Dataset from Karypis Lab, http://glaros.dtc.umn.edu/gkhome/fetch/sw/cluto/datasets.tar.gz

Download references

Author information

Authors and Affiliations

Department of Information Technology, ABV - Indian Institute of Information Technology and Management, Gwalior, India
Rajeev Kumar & Alok Ranjan
Department of Applied Sciences, ABV - Indian Institute of Information Technology and Management, Gwalior, India
Joydip Dhar

Authors

Rajeev Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Alok Ranjan
View author publications
You can also search for this author in PubMed Google Scholar
Joydip Dhar
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Bishop Heber College(Autonomous), 620017, Tiruchirappalli, India
Rajkumar Kannan
National Institute of Informatics (NII), 101-8430, Tokyo, Japan
Frederic Andres

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kumar, R., Ranjan, A., Dhar, J. (2012). A Fast and Effective Partitioning Algorithm for Document Clustering. In: Kannan, R., Andres, F. (eds) Data Engineering and Management. ICDEM 2010. Lecture Notes in Computer Science, vol 6411. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-27872-3_40

Download citation

DOI: https://doi.org/10.1007/978-3-642-27872-3_40
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-27871-6
Online ISBN: 978-3-642-27872-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics