Abstract
This paper proposes a grid-based hierarchical clustering algorithm (GACH) as an efficient and robust method to explore clusters in high-dimensional data with no prior knowledge. It discovers the initial positions of the potential clusters automatically and then combines them hierarchically to obtain the final clusters. In this regard, GACH first projects the data patterns on a two-dimensional space (i.e., on a plane established by two features) to overcome the curse of dimensionality problem in high-dimensional data. To choose these two well-informed features, a simple and fast feature selection algorithm is proposed. Then, through meshing the plane with grid lines, GACH detects the crowded grid points. The nearest data patterns around these grid points are considered as initial members of some potential clusters. By returning the patterns back to their true dimensions, GACH refines these clusters. In the merging phase, GACH combines the closely adjacent clusters in a hierarchical bottom-up manner to construct the final clusters’ members. The main features of GACH are: (1) it automatically discovers the clusters, (2) the obtained clusters are stable, (3) it is efficient for data sets with high dimensions, and (4) its merging process involves a threshold which can be obtained in advance for well-clustered data. To assess our proposed algorithm, it is applied on some benchmark data sets and the validity of obtained clusters is compared with the results of some other clustering algorithms. This comparison shows that GACH is accurate, efficient and feasible to discover clusters in high-dimensional data.
Similar content being viewed by others
References
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic sub-space clustering of high dimensional data for data mining applications. In: Proceedings of ACM SIGMOD International Conference on MOD, pp 94–105
Asuncion A, Newman DJ (2007) UCI machine learning repository. Department of Information and Computer Science, University of California, Irvine
Benson SYL, Hong Y (2007) Assessment of microarray data clustering results based on a new geometrical index for cluster validity. Soft Comput 11(4):341–348
Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York
Chandra B, Gupta M (2013) A novel approach for distance-based semi-supervised clustering using functional link neural network. Soft Comput 17(3):369–379
Chang CI, Lin NP, Jan NY (2009) An axis shifted clustering algorithm. Tamkang J Sci Eng 12(2):183–192
Everitt B, Landau S, Leese M (2001) Cluster analysis. Arnold, London
Han J, Kamber M (2001) Data mining: concepts and techniques. Morgan Kaufmann, San Francisco
Hinneburg A, Keim D (1999) Optimal grid-clustering: toward breaking the curse of dimensionality in high-dimensional clustering. In: Proceedings of the 25th VLDB Conference, pp 506–517
Ilango M, Mohan V (2010) A survey of grid based clustering algorithms. Int J Eng Sci Tech 2(8):3441–3446
Jain A, Murty M, Flynn P (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323
Jarman IH, Etchells TA, Bacciu D, Garibaldi JM, Ellis IO, Lisboa PJG (2011) Clustering of protein expression data: a benchmark of statistical and neural approaches. Soft Comput 15(8):1459–1469
Kohavi R, Provost F (1998) Glossary of terms. Editorial for the Special Issue on Applications of Machine Learning and the Knowledge Discovery Process 30(2/3)
Krishnapuram R, Keller JM (1993) A possibilistic approach to clustering. IEEE Trans Fuzzy Syst 1(2):98–110
Mansoori EG (2011) FRBC: a fuzzy rule-based clustering algorithm. IEEE Trans Fuzzy Syst 19(5):960–971
Maulik U, Bandyopadhyay S (2002) Performance evaluation of some clustering algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell 24(12):1650–1654
McQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of Fifth Berkeley Symposium, Math Statistics and Probability, pp 281–297
Monmarché N, Slimane M, Venturini G (1999) AntClass: discovery of clusters in numeric data by an hybridization of an ant colony with the Kmeans algorithm. Internal Report No 213, E3i
Ng MK, Li MJ, Huang JZ, He Z (2007) On the impact of dissimilarity measure in \(k\)-modes clustering algorithm. IEEE Trans Pattern Anal Mach Intell 29(3):503–507
Ordonez C, Omiecinski E (2004) Efficient disk-based K-means clustering for relational databases. IEEE Trans Knowl Data Eng 16(8):909–921
Schikuta E (1993) Grid-clustering: a hierarchical clustering method for very large data sets. In: Technical Report TR-CRPC No. 93358, Center for Research on Parallel Computation, Rice University, Houston
Sheikholeslami G, Chatterjee S, Zhang A (2000) WaveCluster: a wavelet-based clustering approach for spatial data in very large databases. VLDB J Very Large Data Bases 8:289–304
Schockaert S, De-Cock M, Cornelis C, Kerr EE (2007) Clustering web search results using fuzzy ants. Int J Intell Syst 22(5):455–474
Sledge IJ, Havens TC, Huband JM, Bezdek JC, Keller JM (2009) Finding the number of clusters in ordered dissimilarities. Soft Comput 13(12):1125–1142
Vicente D, Vellido A (2004) A review of hierarchical models for data clustering and visualization. In: Gir’aldez R, Riquelme JC, Aguilar-Ruiz JS (eds) Tendencias de la Minería de Datos en España. Red Española de Minería de Datos
Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678
Yager R, Filev D (1994) Approximate clustering via the mountain method. IEEE Trans Syst Man Cybern 24(8):1279–1284
Yang W, Muntz R, Wang W, Yang J (1997) STING: a statistical information grid approach to spatial data mining. In: Proceedings of 23rd International Conference on VLDB, pp 186–195
Yue S, Wei M, Wang J, Wang H (2008) A general grid-clustering approach. Pattern Recognit Lett 29:1372–1384
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by W. Pedrycz.
Rights and permissions
About this article
Cite this article
Mansoori, E.G. GACH: a grid-based algorithm for hierarchical clustering of high-dimensional data. Soft Comput 18, 905–922 (2014). https://doi.org/10.1007/s00500-013-1105-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-013-1105-8