Parallel clustering of high dimensional data by integrating multi-objective genetic algorithm with divide and conquer

Özyer, Tansel; Alhajj, Reda

doi:10.1007/s10489-008-0129-8

Parallel clustering of high dimensional data by integrating multi-objective genetic algorithm with divide and conquer

Published: 14 May 2008

Volume 31, pages 318–331, (2009)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Tansel Özyer¹ &
Reda Alhajj²

269 Accesses
22 Citations
Explore all metrics

Abstract

This paper applies divide and conquer approach in an iterative way to handle the clustering process. The target is a parallelized effective and efficient approach that produces the intended clustering result. We achieve scalability by first partitioning a large dataset into subsets of manageable sizes based on the specifications of the machine to be used in the clustering process; then cluster the partitions separately in parallel. The centroid of each obtained cluster is treated like the root of a tree with instances in its cluster as leaves. The partitioning and clustering process is iteratively applied on the centroids with the trees growing up until we get the final clustering; the outcome is a forest with one tree per cluster. Finally, a conquer process is performed to get the actual intended clustering, where each instance (leaf node) belongs to the final cluster represented by the root of its tree. We use multi-objective genetic algorithm combined with validity indices to decide on the number of classes. This approach fits well for interactive online clustering. It facilitates for incremental clustering because chunks of instances are clustered as stand alone sets, and then the results are merged with existing clusters. This is attractive and feasible because we consider the clustering of only centroids after the first clustering stage. The reported test results demonstrate the applicability and effectiveness of the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A novel intelligent Fuzzy-AHP based evolutionary algorithm for detecting communities in complex networks

Article 29 February 2024

Elmira Pourabbasi, Vahid Majidnezhad, … Yasser jafari

Data clustering: application and trends

Article 27 November 2022

Gbeminiyi John Oyewole & George Alex Thopil

Puma optimizer (PO): a novel metaheuristic optimization algorithm and its application in machine learning

Article 19 January 2024

Benyamin Abdollahzadeh, Nima Khodadadi, … Seyedali Mirjalili

References

Aggarwal CC, Yu PS (2000) Finding generalized projected clusters in high dimensional spaces. In: Proceedings of ACM SIGMOD international conference on management of data, pp 70–81
Aggarwal CC, Procopiuc CM, Wolf JL, Yu PS, Park JS (1999) Fast algorithms for projected clustering. In: Proceedings ACM SIGMOD international conference on management of data, pp 61–72
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of ACM SIGMOD international conference on management of data, pp 94–105
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the international conference on very large data bases, pp 487–499
Andersson J (2000) A survey of multiobjective optimization in engineering design. Technical report LiTH-IKP-R-1097, Department of Mechanical Engineering, Linkping University, Linkping, Sweden
Ankerst M, Breunig MM, Kriegel HP, Sander J (1999) Optics: ordering points to identify the clustering structure. In: Proceedings of ACM SIGMOD conference on management of data, pp 49–60
Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49:803–821
Article MATH MathSciNet Google Scholar
Ben-Dor A, Shamir R, Yahkini Z (1999) Clustering gene expression patterns. J Comput Biol 6(3):281–297
Article Google Scholar
Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York
MATH Google Scholar
Bezdek JC, Boggavarapu S, Hall LO, Bensaid A (1994) Genetic algorithm guided clustering. In: Proceedings of the international conference on evolutionary computation, pp 34–39
Bezdek JC, Hathaway RJ (1994) Optimization of fuzzy clustering criteria using genetic algorithms. In: Proceedings of the international conference on evolutionary computation, pp 589–594
Cheng C-H, Fu AW, Zhang Y (1999) Entropy-based subspace clustering for mining numerical data. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining, pp 84–93
Cole RM (1998) Clustering with genetic algorithms. Master’s thesis, Nedlands 6907, Australia
Deb K, Agrawal S, Pratab A, Meyarivan T (2002) A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans Evol Comput 6(2):182–197
Article Google Scholar
Demiriz A, Bennett K, Embrechts M (1999) Semi-supervised clustering using genetic algorithms. Technical report 9901, Rensselaer Polytechnic Institute, Troy, NY
Dimitriadou E, Dolnicar S, Weingessel A (2002) An examination of indexes for determining the number of clusters in binary data sets. Psychometrika 67(1):137–160
Article MathSciNet Google Scholar
Fayyad U, Piatetsky-Shapiro G, Smyth P (1996) The KDD process for extracting useful knowledge from volumes of data. Commun ACM 39(11):27–34
Article Google Scholar
Fonseca CM, Flemming PJ (1995) An overview of evolutionary algorithms in multiobjective optimization. Evol Comput 3(1):1–16
Article Google Scholar
Fowlkes E, Mallows C (1983) A method for comparing two hierarchical clusterings. J Am Stat Assoc 78:553–569
Article MATH Google Scholar
Friedman J, Meulman J (2004) Clustering objects on subsets of attributes. J R Stat Soc Ser B 4:815–849
Article MathSciNet Google Scholar
Guha S, Rastogi R, Shim K (1998) Cure: an efficient clustering algorithm for large databases. In: Proceedings of ACM SIGMOD conference on management of data, pp 73–84
Guha S, Rastogi R, Shim K (1999) Rock: a robust clustering algorithm for categorical attributes. In: Proceedings of IEEE international conference on data engineering, pp 512–521
Hall LO, Özyurt IB, Bezdek JC (1999) Clustering with a genetically optimized approach. IEEE Trans Evol Comput 3(2):103–112
Article Google Scholar
Hartigan J (1975) Clustering algorithms. Wiley, New York
MATH Google Scholar
Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. In: Proceedings of ACM international conference on knowledge discovery and data mining, pp 58–65
Huang JZ, Ng MK, Rong H, Li Z (2005) Automated variable weighting in k-means type clustering. IEEE Trans Pattern Anal Mach Intell 27(5):657–668
Article Google Scholar
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 316(3):264–323
Article Google Scholar
Kaski S, Nikkilä J, Kohonen T (1998) Methods for interpreting a self-organized map in data analysis. In: Proceedings of the European symposium on artificial neural networks, Brussels, Belgium, pp 185–190
Kaufman L, Rouseeauw PL (1990) Finding group in data: an introduction to cluster analysis. Wiley, New York
Google Scholar
Likas A, Vlassis N, Verbeek J (2003) The global k-means clustering algorithm. Pattern Recognit 36(2):451–461
Article Google Scholar
Liu Y, Özyer T, Alhajj R, Barker K (2004) Validity analysis of clustering obtained using multi-objective genetic algorithm. In: Proceedings of the international conference on intelligent systems design and applications. Lecture notes in computer science. Springer, Berlin
Google Scholar
Liu Y, Özyer T, Alhajj R, Barker K (2005) Cluster validity analysis of alternative solutions from multi-objective optimization. In: Proceedings of SIAM international conference on data mining
Liu Y, Özyer T, Alhajj R, Barker K (2005) Integrating multi-objective genetic algorithm and validity analysis for locating and ranking alternative clustering. Eur J Inform 29(1):33–40
MATH Google Scholar
Lu Y, Lu S, Fotouhi F, Deng Y, Brown S (2004) FGKA: a fast genetic k-means clustering algorithm. In: Proceedings of ACM symposium on applied computing, pp 162–163
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. University of California Press, Berkeley, pp 281–297
Google Scholar
Nagesh H, Goil S, Choudhary A (1999) Mafia: efficient and scalable subspace clustering for very large data sets. Technical report, Stanford University, Northwestern University, June 1999
Nakayama H (2005) Multi-objective optimization and its engineering applications. In: Branke J, Deb K, Miettinen K, Steuer RE (eds) Practical approaches to multi-objective optimization. Dagstuhl seminar proceedings, N 04461. Internationales Begegnungs- und Forschungszentrum fuer Informatik (IBFI), Schloss Dagstuhl, Germany. http://drops.dagstuhl.de/opus/volltexte/2005/234
Ng R, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: Proceedings of the international conference on very large data bases, Santiago de Chile, Chile, pp 144–155
Noel SE, Szu HH (1997) Multiple-resolution clustering for recursive divide and conquer. In: Szu HH (ed) Wavelet applications IV. Proceedings of SPIE, vol 3078, pp 266–279, April 1997
Özyer T, Alhajj R (2006) Achieving natural clustering by validating results of iterative evolutionary clustering approach. In: Proceedings of the IEEE conference on intelligent systems
Özyer T, Alhajj R (2006) Clustering by integrating multi-objective optimization with weighted k-means and validity analysis. In: 7th international conference on intelligent data engineering and automated learning. Springer, Berlin
Google Scholar
Özyer T, Liu Y, Alhajj R, Barker K (2004) Multi-objective genetic algorithm based clustering approach and its application to gene expression data. In: Proceedings of biennial international conference on advances in information systems. Lecture notes in computer science. Springer, Berlin
Google Scholar
Sander J, Ester M, Kriegel HP, Xu X (1998) Density-based clustering in spatial databases: the algorithm gdbscan and its applications. Data Min Knowl Discov 2(2):169–194
Article Google Scholar
Schikuta E, Erhart M (1998) Bang-clustering: a novel grid-clustering algorithm for huge data sets. In: SSPR/SPR, pp 867–874
Scott AJ, Symons MJ (1971) Model-based Gaussian and non-Gaussian clustering. Biometrics 27:387–397
Article Google Scholar
Sharan R, Shamir R (2000) Click: a clustering algorithm with applications to gene expression analysis. In: Proceedings of the international conference on intelligent systems on molecular biology, pp 307–316
Strehl A, Gosh J (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617
Article Google Scholar
Thierens D (1999) Scalability problems of simple genetic algorithms. Evol Comput 7(4):331–352
Article Google Scholar
Wang W, Yang J, Muntz RR (1997) Sting: a statistical information grid approach to spatial data mining. In: Proceedings of the international conference on very large databases, pp 186–195
Zitzler E (1999) Evolutionary algorithms for multiobjective optimization: methods and applications. PhD thesis, Zurich: Swiss Federal Institute of Technology (ETH), Aachen, Germany

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, TOBB ETU Economics and Technology University, Sogutozu Cad No. 43, Sogutozu, 06560, Ankara, Turkey
Tansel Özyer
Department of Computer Science, University of Calgary, Calgary, Alberta, Canada
Reda Alhajj

Authors

Tansel Özyer
View author publications
You can also search for this author in PubMed Google Scholar
Reda Alhajj
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Reda Alhajj.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Özyer, T., Alhajj, R. Parallel clustering of high dimensional data by integrating multi-objective genetic algorithm with divide and conquer. Appl Intell 31, 318–331 (2009). https://doi.org/10.1007/s10489-008-0129-8

Download citation

Received: 17 May 2007
Accepted: 25 April 2008
Published: 14 May 2008
Issue Date: December 2009
DOI: https://doi.org/10.1007/s10489-008-0129-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parallel clustering of high dimensional data by integrating multi-objective genetic algorithm with divide and conquer

Abstract

Access this article

Similar content being viewed by others

A novel intelligent Fuzzy-AHP based evolutionary algorithm for detecting communities in complex networks

Data clustering: application and trends

Puma optimizer (PO): a novel metaheuristic optimization algorithm and its application in machine learning

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Parallel clustering of high dimensional data by integrating multi-objective genetic algorithm with divide and conquer

Abstract

Access this article

Similar content being viewed by others

A novel intelligent Fuzzy-AHP based evolutionary algorithm for detecting communities in complex networks

Data clustering: application and trends

Puma optimizer (PO): a novel metaheuristic optimization algorithm and its application in machine learning

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation