Skip to main content
Log in

GACH: a grid-based algorithm for hierarchical clustering of high-dimensional data

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

This paper proposes a grid-based hierarchical clustering algorithm (GACH) as an efficient and robust method to explore clusters in high-dimensional data with no prior knowledge. It discovers the initial positions of the potential clusters automatically and then combines them hierarchically to obtain the final clusters. In this regard, GACH first projects the data patterns on a two-dimensional space (i.e., on a plane established by two features) to overcome the curse of dimensionality problem in high-dimensional data. To choose these two well-informed features, a simple and fast feature selection algorithm is proposed. Then, through meshing the plane with grid lines, GACH detects the crowded grid points. The nearest data patterns around these grid points are considered as initial members of some potential clusters. By returning the patterns back to their true dimensions, GACH refines these clusters. In the merging phase, GACH combines the closely adjacent clusters in a hierarchical bottom-up manner to construct the final clusters’ members. The main features of GACH are: (1) it automatically discovers the clusters, (2) the obtained clusters are stable, (3) it is efficient for data sets with high dimensions, and (4) its merging process involves a threshold which can be obtained in advance for well-clustered data. To assess our proposed algorithm, it is applied on some benchmark data sets and the validity of obtained clusters is compared with the results of some other clustering algorithms. This comparison shows that GACH is accurate, efficient and feasible to discover clusters in high-dimensional data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  • Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic sub-space clustering of high dimensional data for data mining applications. In: Proceedings of ACM SIGMOD International Conference on MOD, pp 94–105

  • Asuncion A, Newman DJ (2007) UCI machine learning repository. Department of Information and Computer Science, University of California, Irvine

  • Benson SYL, Hong Y (2007) Assessment of microarray data clustering results based on a new geometrical index for cluster validity. Soft Comput 11(4):341–348

    Google Scholar 

  • Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum Press, New York

  • Chandra B, Gupta M (2013) A novel approach for distance-based semi-supervised clustering using functional link neural network. Soft Comput 17(3):369–379

    Google Scholar 

  • Chang CI, Lin NP, Jan NY (2009) An axis shifted clustering algorithm. Tamkang J Sci Eng 12(2):183–192

    Google Scholar 

  • Everitt B, Landau S, Leese M (2001) Cluster analysis. Arnold, London

  • Han J, Kamber M (2001) Data mining: concepts and techniques. Morgan Kaufmann, San Francisco

    Google Scholar 

  • Hinneburg A, Keim D (1999) Optimal grid-clustering: toward breaking the curse of dimensionality in high-dimensional clustering. In: Proceedings of the 25th VLDB Conference, pp 506–517

  • Ilango M, Mohan V (2010) A survey of grid based clustering algorithms. Int J Eng Sci Tech 2(8):3441–3446

    Google Scholar 

  • Jain A, Murty M, Flynn P (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323

    Article  Google Scholar 

  • Jarman IH, Etchells TA, Bacciu D, Garibaldi JM, Ellis IO, Lisboa PJG (2011) Clustering of protein expression data: a benchmark of statistical and neural approaches. Soft Comput 15(8):1459–1469

    Article  Google Scholar 

  • Kohavi R, Provost F (1998) Glossary of terms. Editorial for the Special Issue on Applications of Machine Learning and the Knowledge Discovery Process 30(2/3)

  • Krishnapuram R, Keller JM (1993) A possibilistic approach to clustering. IEEE Trans Fuzzy Syst 1(2):98–110

    Article  Google Scholar 

  • Mansoori EG (2011) FRBC: a fuzzy rule-based clustering algorithm. IEEE Trans Fuzzy Syst 19(5):960–971

    Article  Google Scholar 

  • Maulik U, Bandyopadhyay S (2002) Performance evaluation of some clustering algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell 24(12):1650–1654

    Article  Google Scholar 

  • McQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of Fifth Berkeley Symposium, Math Statistics and Probability, pp 281–297

  • Monmarché N, Slimane M, Venturini G (1999) AntClass: discovery of clusters in numeric data by an hybridization of an ant colony with the Kmeans algorithm. Internal Report No 213, E3i

  • Ng MK, Li MJ, Huang JZ, He Z (2007) On the impact of dissimilarity measure in \(k\)-modes clustering algorithm. IEEE Trans Pattern Anal Mach Intell 29(3):503–507

    Article  Google Scholar 

  • Ordonez C, Omiecinski E (2004) Efficient disk-based K-means clustering for relational databases. IEEE Trans Knowl Data Eng 16(8):909–921

    Article  Google Scholar 

  • Schikuta E (1993) Grid-clustering: a hierarchical clustering method for very large data sets. In: Technical Report TR-CRPC No. 93358, Center for Research on Parallel Computation, Rice University, Houston

  • Sheikholeslami G, Chatterjee S, Zhang A (2000) WaveCluster: a wavelet-based clustering approach for spatial data in very large databases. VLDB J Very Large Data Bases 8:289–304

    Article  Google Scholar 

  • Schockaert S, De-Cock M, Cornelis C, Kerr EE (2007) Clustering web search results using fuzzy ants. Int J Intell Syst 22(5):455–474

    Article  MATH  Google Scholar 

  • Sledge IJ, Havens TC, Huband JM, Bezdek JC, Keller JM (2009) Finding the number of clusters in ordered dissimilarities. Soft Comput 13(12):1125–1142

    Article  Google Scholar 

  • Vicente D, Vellido A (2004) A review of hierarchical models for data clustering and visualization. In: Gir’aldez R, Riquelme JC, Aguilar-Ruiz JS (eds) Tendencias de la Minería de Datos en España. Red Española de Minería de Datos

  • Xu R, Wunsch D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678

    Article  Google Scholar 

  • Yager R, Filev D (1994) Approximate clustering via the mountain method. IEEE Trans Syst Man Cybern 24(8):1279–1284

    Article  Google Scholar 

  • Yang W, Muntz R, Wang W, Yang J (1997) STING: a statistical information grid approach to spatial data mining. In: Proceedings of 23rd International Conference on VLDB, pp 186–195

  • Yue S, Wei M, Wang J, Wang H (2008) A general grid-clustering approach. Pattern Recognit Lett 29:1372–1384

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eghbal G. Mansoori.

Additional information

Communicated by W. Pedrycz.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mansoori, E.G. GACH: a grid-based algorithm for hierarchical clustering of high-dimensional data. Soft Comput 18, 905–922 (2014). https://doi.org/10.1007/s00500-013-1105-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-013-1105-8

Keywords

Navigation