Skip to main content
Log in

Harmony K-means algorithm for document clustering

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Fast and high quality document clustering is a crucial task in organizing information, search engine results, enhancing web crawling, and information retrieval or filtering. Recent studies have shown that the most commonly used partition-based clustering algorithm, the K-means algorithm, is more suitable for large datasets. However, the K-means algorithm can generate a local optimal solution. In this paper we propose a novel Harmony K-means Algorithm (HKA) that deals with document clustering based on Harmony Search (HS) optimization method. It is proved by means of finite Markov chain theory that the HKA converges to the global optimum. To demonstrate the effectiveness and speed of HKA, we have applied HKA algorithms on some standard datasets. We also compare the HKA with other meta-heuristic and model-based document clustering approaches. Experimental results reveal that the HKA algorithm converges to the best known optimum faster than other methods and the quality of clusters are comparable.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Aggarwal CC, Gates SC, Yu PS (1999) On the merits of building categorization systems by supervised clustering. In: Proceedings of the fifth ACM SIGKDD Int’l conference on knowledge discovery and data mining, pp 352–356

  • Anderberg MR (1973) Cluster analysis for applications. Academic Press Inc, New York, NY

    MATH  Google Scholar 

  • Boley D, Gini M, Gross R, Han EH, Hastings K, Karypis G et al (1999) Partitioning-based clustering for web document categorization. Decis Support Syst 27(3): 329–341. doi:10.1016/S0167-9236(99)00055-X

    Article  Google Scholar 

  • Cios K, Pedrycs W, Swiniarski R (1998) Data mining methods for knowledge discovery. Kluwer Academic Publishers

  • Coello CAC (2000) Constraint-handling using an evolutionary multi objective optimization technique. Civ Eng Environ Syst 17: 319–346. doi:10.1080/02630250008970288

    Article  Google Scholar 

  • Cui X, Potok TE, Palathingal P (2005) Document clustering using particle swarm optimization. In: IEEE swarm intelligence symposium, pp 185–191

  • Cutting DR, Pedersen JO, Karger DR, Tukey JW (1992) Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the ACM SIGIR. Copenhagen, pp 318–329

  • Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Knowledge discovery and data mining, pp 269–274

  • Everitt B (1980) Cluster analysis, 2nd edn. Halsted Press, New York

    MATH  Google Scholar 

  • Geem ZW, Kim JH, Loganathan GV (2002) Harmony search optimization: application to pipe network design. Int J Model Simul 22: 125–133

    Google Scholar 

  • Geem ZW, Tseng C, Park Y (2005) Harmony search for generalized orienteering problem: best touring in China, Springer. Lect Notes Comput Sci 3412: 741–750

    Article  Google Scholar 

  • Grira N, Crucianu M, Boujemaa N (2005) Unsupervised and semi-supervised clustering: a brief survey. In: 7th ACM SIGMM international workshop on multimedia information retrieval, pp 9–16

  • Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 264–323. CSUR. doi:10.1145/331499.331504

  • Jones G, Robertson AM, Santimetvirul C, Willett P (1995) Non-hierarchic document clustering using a genetic algorithm. Inform Res 1(1)

  • Kennedy J, Eberhart RC, Shi Y (2001) Swarm intelligence. Morgan Kaufmann, New York

    Google Scholar 

  • Klein RW, Dubes RC (1989) Experiments in projection and clustering by simulated annealing. Pattern Recognit 22: 213–220. doi:10.1016/0031-3203(89)90067-8

    Article  MATH  Google Scholar 

  • Labroche N, Monmarche’ N, Venturini G (2003) AntClust: ant clustering and web usage mining. In: Genetic and evolutionary computation conference, pp 25–36

  • Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. In: Proceedings of the fifth ACM SIGKDD Int’l conference on knowledge discovery and data mining, pp 16–22

  • Lee KS, Geem ZW (2004) A new meta-heuristic algorithm for continues engineering optimization: harmony search theory and practice. Comput Method Appl Mech Eng 194: 3902–3933. doi:10.1016/j.cma.2004.09.007

    Article  Google Scholar 

  • Mahdavi M, Fesanghary M, Damangir E (2007) An improved harmony search algorithm for solving optimization problems. Appl Math Comput 188: 1567–1579

    Article  MATH  MathSciNet  Google Scholar 

  • McQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth berkeley symposium on mathematical statistics and probability, pp 281–297

  • Merwe VD, Engelbrecht AP (2003) Data clustering using particle swarm optimization. In: Proceedings of IEEE congress on evolutionary computation (CEC 2003), Australia, pp 215–220

  • Omran M, Salman A, Engelbrecht AP (2002) Image classification using particle swarm optimization. In: Proceedings of the 4th Asia-Pacific conference on simulated evolution and learning (SEAL 2002), Singapore, pp 370–374

  • Quinlan RJ (1993) C4.5: programs for machine learning. Morgan Kaufmann

  • Raghavan VV, Birchand K (1979) A clustering strategy based on a formalism of the reproductive process in a natural system. In: Proceedings of the second international conference on information storage and retrieval, pp 10–22

  • Rudolph G (1994) Convergence analysis of canonical genetic algorithms. IEEE Trans Neural Netw 5(1): 96–101. doi:10.1109/72.265964

    Article  Google Scholar 

  • Salton G (1989) Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley

  • Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manage 24: 513–523. doi:10.1016/0306-4573(88)90021-0

    Article  Google Scholar 

  • Selim SZ, Ismail MA (1984) K-means type algorithms: a generalized convergence theorem and characterization of local optimality. IEEE Trans Pattern Anal Mach Intell 6: 81–87

    Article  MATH  Google Scholar 

  • Steinbach M, Karypis G, Kumar V (2000a) A comparison of document clustering techniques. KDD’2000. Technical report of University of Minnesota

  • Steinbach M, Karypis G, Kumar V (2000b) A comparison of document clustering techniques. In: KDD workshop on text mining

  • Stumme G, Hotho A, Berendt B (2001) Semantic Web Mining. Freiburg, September 3rd, 12th Europ. Conf. on Machine Learning (ECML’01)/5th Europ. Conf. on Principles and Practice of Knowledge Discovery in Databases (PKDD’01)

  • Stumme G, Hotho A, Berendt B (2006) Semantic Web Mining State of the art and future directions. J Web Semantics: Sci Serv Agents World Wide Web 124–143

  • TREC (1999) Text REtrieval conference. http://trec.nist.gov

  • TRECQ (1999) Text REtrieval conference relevance judgments. http://trec.nist.gov/data/qrels_eng/index.html

  • Zhao Y, Karypis G (2004) Empirical and theoretical comparisons of selected riterion functions for document clustering. Mach Learn 55(3): 311–331. doi:10.1023/B:MACH.0000027785.44527.d6

    Article  MATH  Google Scholar 

  • Zhao Y, Karypis G (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Discov 10: 141–168. doi:10.1007/s10618-005-0361-3

    Article  MathSciNet  Google Scholar 

  • Zhong S (2006) Semi-supervised model-based document clustering: a comparative study. Mach Learn 65(1). doi:10.1007/s10994-006-6540-7

  • Zhong S, Ghosh J (2003) A unified framework for model-based lustering. J Mach Learn Res 4: 1001–1037. JMLR doi:10.1162/jmlr.2003.4.6.1001

    Article  MathSciNet  Google Scholar 

  • Zhong S, Ghosh J (2005) Generative model-based clustering of documents: a comparative study. Knowl Inf Syst 8: 374–384. KAIS doi:10.1007/s10115-004-0194-1

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mehrdad Mahdavi.

Additional information

Responsible editor: Charu Aggarwal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mahdavi, M., Abolhassani, H. Harmony K-means algorithm for document clustering. Data Min Knowl Disc 18, 370–391 (2009). https://doi.org/10.1007/s10618-008-0123-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-008-0123-0

Keywords

Navigation