Skip to main content

Advertisement

Log in

Parallel clustering of high dimensional data by integrating multi-objective genetic algorithm with divide and conquer

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

This paper applies divide and conquer approach in an iterative way to handle the clustering process. The target is a parallelized effective and efficient approach that produces the intended clustering result. We achieve scalability by first partitioning a large dataset into subsets of manageable sizes based on the specifications of the machine to be used in the clustering process; then cluster the partitions separately in parallel. The centroid of each obtained cluster is treated like the root of a tree with instances in its cluster as leaves. The partitioning and clustering process is iteratively applied on the centroids with the trees growing up until we get the final clustering; the outcome is a forest with one tree per cluster. Finally, a conquer process is performed to get the actual intended clustering, where each instance (leaf node) belongs to the final cluster represented by the root of its tree. We use multi-objective genetic algorithm combined with validity indices to decide on the number of classes. This approach fits well for interactive online clustering. It facilitates for incremental clustering because chunks of instances are clustered as stand alone sets, and then the results are merged with existing clusters. This is attractive and feasible because we consider the clustering of only centroids after the first clustering stage. The reported test results demonstrate the applicability and effectiveness of the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Aggarwal CC, Yu PS (2000) Finding generalized projected clusters in high dimensional spaces. In: Proceedings of ACM SIGMOD international conference on management of data, pp 70–81

  2. Aggarwal CC, Procopiuc CM, Wolf JL, Yu PS, Park JS (1999) Fast algorithms for projected clustering. In: Proceedings ACM SIGMOD international conference on management of data, pp 61–72

  3. Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of ACM SIGMOD international conference on management of data, pp 94–105

  4. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the international conference on very large data bases, pp 487–499

  5. Andersson J (2000) A survey of multiobjective optimization in engineering design. Technical report LiTH-IKP-R-1097, Department of Mechanical Engineering, Linkping University, Linkping, Sweden

  6. Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) Optics: ordering points to identify the clustering structure. In: Proceedings of ACM SIGMOD conference on management of data, pp 49–60

  7. Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49:803–821

    Article  MATH  MathSciNet  Google Scholar 

  8. Ben-Dor A, Shamir R, Yahkini Z (1999) Clustering gene expression patterns. J Comput Biol 6(3):281–297

    Article  Google Scholar 

  9. Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York

    MATH  Google Scholar 

  10. Bezdek JC, Boggavarapu S, Hall LO, Bensaid A (1994) Genetic algorithm guided clustering. In: Proceedings of the international conference on evolutionary computation, pp 34–39

  11. Bezdek JC, Hathaway RJ (1994) Optimization of fuzzy clustering criteria using genetic algorithms. In: Proceedings of the international conference on evolutionary computation, pp 589–594

  12. Cheng C-H, Fu AW, Zhang Y (1999) Entropy-based subspace clustering for mining numerical data. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining, pp 84–93

  13. Cole RM (1998) Clustering with genetic algorithms. Master’s thesis, Nedlands 6907, Australia

  14. Deb K, Agrawal S, Pratab A, Meyarivan T (2002) A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans Evol Comput 6(2):182–197

    Article  Google Scholar 

  15. Demiriz A, Bennett K, Embrechts M (1999) Semi-supervised clustering using genetic algorithms. Technical report 9901, Rensselaer Polytechnic Institute, Troy, NY

  16. Dimitriadou E, Dolnicar S, Weingessel A (2002) An examination of indexes for determining the number of clusters in binary data sets. Psychometrika 67(1):137–160

    Article  MathSciNet  Google Scholar 

  17. Fayyad U, Piatetsky-Shapiro G, Smyth P (1996) The KDD process for extracting useful knowledge from volumes of data. Commun ACM 39(11):27–34

    Article  Google Scholar 

  18. Fonseca CM, Flemming PJ (1995) An overview of evolutionary algorithms in multiobjective optimization. Evol Comput 3(1):1–16

    Article  Google Scholar 

  19. Fowlkes E, Mallows C (1983) A method for comparing two hierarchical clusterings. J Am Stat Assoc 78:553–569

    Article  MATH  Google Scholar 

  20. Friedman J, Meulman J (2004) Clustering objects on subsets of attributes. J R Stat Soc Ser B 4:815–849

    Article  MathSciNet  Google Scholar 

  21. Guha S, Rastogi R, Shim K (1998) Cure: an efficient clustering algorithm for large databases. In: Proceedings of ACM SIGMOD conference on management of data, pp 73–84

  22. Guha S, Rastogi R, Shim K (1999) Rock: a robust clustering algorithm for categorical attributes. In: Proceedings of IEEE international conference on data engineering, pp 512–521

  23. Hall LO, Özyurt IB, Bezdek JC (1999) Clustering with a genetically optimized approach. IEEE Trans Evol Comput 3(2):103–112

    Article  Google Scholar 

  24. Hartigan J (1975) Clustering algorithms. Wiley, New York

    MATH  Google Scholar 

  25. Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 58–65

  26. Huang JZ, Ng MK, Rong H, Li Z (2005) Automated variable weighting in k-means type clustering. IEEE Trans Pattern Anal Mach Intell 27(5):657–668

    Article  Google Scholar 

  27. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 316(3):264–323

    Article  Google Scholar 

  28. Kaski S, Nikkilä J, Kohonen T (1998) Methods for interpreting a self-organized map in data analysis. In: Proceedings of the European symposium on artificial neural networks, Brussels, Belgium, pp 185–190

  29. Kaufman L, Rouseeauw PL (1990) Finding group in data: an introduction to cluster analysis. Wiley, New York

    Google Scholar 

  30. Likas A, Vlassis N, Verbeek J (2003) The global k-means clustering algorithm. Pattern Recognit 36(2):451–461

    Article  Google Scholar 

  31. Liu Y, Özyer T, Alhajj R, Barker K (2004) Validity analysis of clustering obtained using multi-objective genetic algorithm. In: Proceedings of the international conference on intelligent systems design and applications. Lecture notes in computer science. Springer, Berlin

    Google Scholar 

  32. Liu Y, Özyer T, Alhajj R, Barker K (2005) Cluster validity analysis of alternative solutions from multi-objective optimization. In: Proceedings of SIAM international conference on data mining

  33. Liu Y, Özyer T, Alhajj R, Barker K (2005) Integrating multi-objective genetic algorithm and validity analysis for locating and ranking alternative clustering. Eur J Inform 29(1):33–40

    MATH  Google Scholar 

  34. Lu Y, Lu S, Fotouhi F, Deng Y, Brown S (2004) FGKA: a fast genetic k-means clustering algorithm. In: Proceedings of ACM symposium on applied computing, pp 162–163

  35. MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. University of California Press, Berkeley, pp 281–297

    Google Scholar 

  36. Nagesh H, Goil S, Choudhary A (1999) Mafia: efficient and scalable subspace clustering for very large data sets. Technical report, Stanford University, Northwestern University, June 1999

  37. Nakayama H (2005) Multi-objective optimization and its engineering applications. In: Branke J, Deb K, Miettinen K, Steuer RE (eds) Practical approaches to multi-objective optimization. Dagstuhl seminar proceedings, N 04461. Internationales Begegnungs- und Forschungszentrum fuer Informatik (IBFI), Schloss Dagstuhl, Germany. http://drops.dagstuhl.de/opus/volltexte/2005/234

  38. Ng R, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: Proceedings of the international conference on very large data bases, Santiago de Chile, Chile, pp 144–155

  39. Noel SE, Szu HH (1997) Multiple-resolution clustering for recursive divide and conquer. In: Szu HH (ed) Wavelet applications IV. Proceedings of SPIE, vol 3078, pp 266–279, April 1997

  40. Özyer T, Alhajj R (2006) Achieving natural clustering by validating results of iterative evolutionary clustering approach. In: Proceedings of the IEEE conference on intelligent systems

  41. Özyer T, Alhajj R (2006) Clustering by integrating multi-objective optimization with weighted k-means and validity analysis. In: 7th international conference on intelligent data engineering and automated learning. Springer, Berlin

    Google Scholar 

  42. Özyer T, Liu Y, Alhajj R, Barker K (2004) Multi-objective genetic algorithm based clustering approach and its application to gene expression data. In: Proceedings of biennial international conference on advances in information systems. Lecture notes in computer science. Springer, Berlin

    Google Scholar 

  43. Sander J, Ester M, Kriegel HP, Xu X (1998) Density-based clustering in spatial databases: the algorithm gdbscan and its applications. Data Min Knowl Discov 2(2):169–194

    Article  Google Scholar 

  44. Schikuta E, Erhart M (1998) Bang-clustering: a novel grid-clustering algorithm for huge data sets. In: SSPR/SPR, pp 867–874

  45. Scott AJ, Symons MJ (1971) Model-based Gaussian and non-Gaussian clustering. Biometrics 27:387–397

    Article  Google Scholar 

  46. Sharan R, Shamir R (2000) Click: a clustering algorithm with applications to gene expression analysis. In: Proceedings of the international conference on intelligent systems on molecular biology, pp 307–316

  47. Strehl A, Gosh J (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617

    Article  Google Scholar 

  48. Thierens D (1999) Scalability problems of simple genetic algorithms. Evol Comput 7(4):331–352

    Article  Google Scholar 

  49. Wang W, Yang J, Muntz RR (1997) Sting: a statistical information grid approach to spatial data mining. In: Proceedings of the international conference on very large databases, pp 186–195

  50. Zitzler E (1999) Evolutionary algorithms for multiobjective optimization: methods and applications. PhD thesis, Zurich: Swiss Federal Institute of Technology (ETH), Aachen, Germany

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Reda Alhajj.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Özyer, T., Alhajj, R. Parallel clustering of high dimensional data by integrating multi-objective genetic algorithm with divide and conquer. Appl Intell 31, 318–331 (2009). https://doi.org/10.1007/s10489-008-0129-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-008-0129-8

Keywords

Navigation