Harmony K-means algorithm for document clustering

Mahdavi, Mehrdad; Abolhassani, Hassan

doi:10.1007/s10618-008-0123-0

Harmony K-means algorithm for document clustering

Published: 11 December 2008

Volume 18, pages 370–391, (2009)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Mehrdad Mahdavi¹ &
Hassan Abolhassani^1,2

969 Accesses
88 Citations
Explore all metrics

Abstract

Fast and high quality document clustering is a crucial task in organizing information, search engine results, enhancing web crawling, and information retrieval or filtering. Recent studies have shown that the most commonly used partition-based clustering algorithm, the K-means algorithm, is more suitable for large datasets. However, the K-means algorithm can generate a local optimal solution. In this paper we propose a novel Harmony K-means Algorithm (HKA) that deals with document clustering based on Harmony Search (HS) optimization method. It is proved by means of finite Markov chain theory that the HKA converges to the global optimum. To demonstrate the effectiveness and speed of HKA, we have applied HKA algorithms on some standard datasets. We also compare the HKA with other meta-heuristic and model-based document clustering approaches. Experimental results reveal that the HKA algorithm converges to the best known optimum faster than other methods and the quality of clusters are comparable.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aggarwal CC, Gates SC, Yu PS (1999) On the merits of building categorization systems by supervised clustering. In: Proceedings of the fifth ACM SIGKDD Int’l conference on knowledge discovery and data mining, pp 352–356
Anderberg MR (1973) Cluster analysis for applications. Academic Press Inc, New York, NY
MATH Google Scholar
Boley D, Gini M, Gross R, Han EH, Hastings K, Karypis G et al (1999) Partitioning-based clustering for web document categorization. Decis Support Syst 27(3): 329–341. doi:10.1016/S0167-9236(99)00055-X
Article Google Scholar
Cios K, Pedrycs W, Swiniarski R (1998) Data mining methods for knowledge discovery. Kluwer Academic Publishers
Coello CAC (2000) Constraint-handling using an evolutionary multi objective optimization technique. Civ Eng Environ Syst 17: 319–346. doi:10.1080/02630250008970288
Article Google Scholar
Cui X, Potok TE, Palathingal P (2005) Document clustering using particle swarm optimization. In: IEEE swarm intelligence symposium, pp 185–191
Cutting DR, Pedersen JO, Karger DR, Tukey JW (1992) Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the ACM SIGIR. Copenhagen, pp 318–329
Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Knowledge discovery and data mining, pp 269–274
Everitt B (1980) Cluster analysis, 2nd edn. Halsted Press, New York
MATH Google Scholar
Geem ZW, Kim JH, Loganathan GV (2002) Harmony search optimization: application to pipe network design. Int J Model Simul 22: 125–133
Google Scholar
Geem ZW, Tseng C, Park Y (2005) Harmony search for generalized orienteering problem: best touring in China, Springer. Lect Notes Comput Sci 3412: 741–750
Article Google Scholar
Grira N, Crucianu M, Boujemaa N (2005) Unsupervised and semi-supervised clustering: a brief survey. In: 7th ACM SIGMM international workshop on multimedia information retrieval, pp 9–16
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 264–323. CSUR. doi:10.1145/331499.331504
Jones G, Robertson AM, Santimetvirul C, Willett P (1995) Non-hierarchic document clustering using a genetic algorithm. Inform Res 1(1)
Kennedy J, Eberhart RC, Shi Y (2001) Swarm intelligence. Morgan Kaufmann, New York
Google Scholar
Klein RW, Dubes RC (1989) Experiments in projection and clustering by simulated annealing. Pattern Recognit 22: 213–220. doi:10.1016/0031-3203(89)90067-8
Article MATH Google Scholar
Labroche N, Monmarche’ N, Venturini G (2003) AntClust: ant clustering and web usage mining. In: Genetic and evolutionary computation conference, pp 25–36
Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. In: Proceedings of the fifth ACM SIGKDD Int’l conference on knowledge discovery and data mining, pp 16–22
Lee KS, Geem ZW (2004) A new meta-heuristic algorithm for continues engineering optimization: harmony search theory and practice. Comput Method Appl Mech Eng 194: 3902–3933. doi:10.1016/j.cma.2004.09.007
Article Google Scholar
Mahdavi M, Fesanghary M, Damangir E (2007) An improved harmony search algorithm for solving optimization problems. Appl Math Comput 188: 1567–1579
Article MATH MathSciNet Google Scholar
McQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth berkeley symposium on mathematical statistics and probability, pp 281–297
Merwe VD, Engelbrecht AP (2003) Data clustering using particle swarm optimization. In: Proceedings of IEEE congress on evolutionary computation (CEC 2003), Australia, pp 215–220
Omran M, Salman A, Engelbrecht AP (2002) Image classification using particle swarm optimization. In: Proceedings of the 4th Asia-Pacific conference on simulated evolution and learning (SEAL 2002), Singapore, pp 370–374
Quinlan RJ (1993) C4.5: programs for machine learning. Morgan Kaufmann
Raghavan VV, Birchand K (1979) A clustering strategy based on a formalism of the reproductive process in a natural system. In: Proceedings of the second international conference on information storage and retrieval, pp 10–22
Rudolph G (1994) Convergence analysis of canonical genetic algorithms. IEEE Trans Neural Netw 5(1): 96–101. doi:10.1109/72.265964
Article Google Scholar
Salton G (1989) Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manage 24: 513–523. doi:10.1016/0306-4573(88)90021-0
Article Google Scholar
Selim SZ, Ismail MA (1984) K-means type algorithms: a generalized convergence theorem and characterization of local optimality. IEEE Trans Pattern Anal Mach Intell 6: 81–87
Article MATH Google Scholar
Steinbach M, Karypis G, Kumar V (2000a) A comparison of document clustering techniques. KDD’2000. Technical report of University of Minnesota
Steinbach M, Karypis G, Kumar V (2000b) A comparison of document clustering techniques. In: KDD workshop on text mining
Stumme G, Hotho A, Berendt B (2001) Semantic Web Mining. Freiburg, September 3rd, 12th Europ. Conf. on Machine Learning (ECML’01)/5th Europ. Conf. on Principles and Practice of Knowledge Discovery in Databases (PKDD’01)
Stumme G, Hotho A, Berendt B (2006) Semantic Web Mining State of the art and future directions. J Web Semantics: Sci Serv Agents World Wide Web 124–143
TREC (1999) Text REtrieval conference. http://trec.nist.gov
TRECQ (1999) Text REtrieval conference relevance judgments. http://trec.nist.gov/data/qrels_eng/index.html
Zhao Y, Karypis G (2004) Empirical and theoretical comparisons of selected riterion functions for document clustering. Mach Learn 55(3): 311–331. doi:10.1023/B:MACH.0000027785.44527.d6
Article MATH Google Scholar
Zhao Y, Karypis G (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Discov 10: 141–168. doi:10.1007/s10618-005-0361-3
Article MathSciNet Google Scholar
Zhong S (2006) Semi-supervised model-based document clustering: a comparative study. Mach Learn 65(1). doi:10.1007/s10994-006-6540-7
Zhong S, Ghosh J (2003) A unified framework for model-based lustering. J Mach Learn Res 4: 1001–1037. JMLR doi:10.1162/jmlr.2003.4.6.1001
Article MathSciNet Google Scholar
Zhong S, Ghosh J (2005) Generative model-based clustering of documents: a comparative study. Knowl Inf Syst 8: 374–384. KAIS doi:10.1007/s10115-004-0194-1
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
Mehrdad Mahdavi & Hassan Abolhassani
School of Computer Science, Institute for Studies in Theoretical Physics and Mathematics (IPM), Tehran, Iran
Hassan Abolhassani

Authors

Mehrdad Mahdavi
View author publications
You can also search for this author in PubMed Google Scholar
Hassan Abolhassani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mehrdad Mahdavi.

Additional information

Responsible editor: Charu Aggarwal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mahdavi, M., Abolhassani, H. Harmony K-means algorithm for document clustering. Data Min Knowl Disc 18, 370–391 (2009). https://doi.org/10.1007/s10618-008-0123-0

Download citation

Received: 21 October 2007
Accepted: 18 November 2008
Published: 11 December 2008
Issue Date: June 2009
DOI: https://doi.org/10.1007/s10618-008-0123-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Harmony K-means algorithm for document clustering

Abstract

Access this article

Similar content being viewed by others

A novel hybrid multi-verse optimizer with K-means for text documents clustering

A Hybrid Algorithm for Document Clustering Using Optimized Kernel Matrix and Unsupervised Constraints

Algorithm for Clustering of Web Search Results from a Hyper-heuristic Approach

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Harmony K-means algorithm for document clustering

Abstract

Access this article

Similar content being viewed by others

A novel hybrid multi-verse optimizer with K-means for text documents clustering

A Hybrid Algorithm for Document Clustering Using Optimized Kernel Matrix and Unsupervised Constraints

Algorithm for Clustering of Web Search Results from a Hyper-heuristic Approach

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation