Skip to main content
Log in

An application of the minimal spanning tree approach to the cluster stability problem

  • Original Paper
  • Published:
Central European Journal of Operations Research Aims and scope Submit manuscript

Abstract

Among the areas of data and text mining which are employed today in OR, science, economy and technology, clustering theory serves as a preprocessing step in the data analyzing. An important component of clustering theory is determination of the true number of clusters. This problem has not been satisfactorily solved. In our paper, this problem is addressed by the cluster stability approach. For several possible numbers of clusters, we estimate the stability of the partitions obtained from clustering of samples. Partitions are considered consistent if their clusters are stable. Clusters validity is measured by the total number of edges, in the clusters’ minimal spanning trees, connecting points from different samples. Actually, we use the Friedman and Rafsky two sample test statistic. The homogeneity hypothesis of well mingled samples, within the clusters, leads to an asymptotic normal distribution of the considered statistic. Resting upon this fact, the standard score of the mentioned edges quantity is set, and the partition quality is represented by the worst cluster, corresponding to the minimal standard score value. It is natural to expect that the true number of clusters can be characterized by the empirical distribution having the shortest left tail. The proposed methodology sequentially creates the described distribution and estimates its left-asymmetry. Several presented numerical experiments demonstrate the ability of the approach to detect the true number of clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Akteke-Öztürk B, Weber G-W, Kropat E (2008) Continuous optimization approach for minimum sum of squares. In: ISI proceedings of the 20th Mini-EURO Conference “continuous optimization and knowledge-based technologies”. Neringa, Lithuania, pp 253–258

  • Akume D, Weber G-W (2002) Cluster algorithms: theory and methods. J Comput Technol Vychisl Tekhnol 7(1): 15–27

    Google Scholar 

  • Bagirov A (2009) Large scale non smooth optimization problems in data mining. In: Proceedings of the XIII international conference applied stochastic models and data analysis (ASMDA). Vilnius

  • Bagirov A, Ugon J, Webb D (2009) A new global k-means algorithm for clustering large data sets. In: Proceedings of the XIII international conference applied stochastic models and data analysis (ASMDA). Vilnius

  • Baringhaus L, Franz C (2004) On a new multivariate two-sample test. J Multivar Anal 88(1): 190–206

    Article  Google Scholar 

  • Barzily Z, Volkovich Z, Akteke-Öztürk B, Weber G-W (2008) Cluster stability using minimal spanning trees. In: Proceedings of the 20th mini conference “continuous optimization and knowledge-based technologies”. EurOPT’, Lithuania, pp 248–253

  • Barzily Z, Volkovich Z, Akteke-Öztürk B, Weber G-W (2009) On a minimal spanning tree approach in the cluster validation problem. Informatica 20(2): 187–202

    Google Scholar 

  • Ben-Hur A, Guyon I (2003) Detecting stable clusters using principal component analysis, methods in molecular biology. In: Brownstein MJ, Kohodursky A (eds) Humana Press, MJ, pp 159–182

  • Ben-Hur A, Elisseeff A, Guyon I (2002) A stability based method for discovering structure in clustered data. In: Pacific symposium on biocomputing. pp 6–17

  • Büyükbebeci E (2009) Comparison of MARS, CMARS and CART in predicting default probabilities for emerging markets, M.Sc. term project Report/Thesis in financial mathematics. Institute of Applied Mathematics of METU, Ankara

  • Calinski R, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat 3: 1–27

    Article  Google Scholar 

  • Celeux G, Govaert G (1992) A classification EMalgorithm and two stochastic versions. Comput Stat Data Anal 14: 315–332

    Article  Google Scholar 

  • Cheng R, Milligan G (1996) Measuring the influence of individual data points in a cluster analysis. J Classif 13: 315–335

    Article  Google Scholar 

  • Conover WJ, Johnson ME, Johnson MM (1981) Comparative study of tests of homogeneity of variances, with applications to the outer continental shelf bidding data. Technometrics 23: 351–361

    Article  Google Scholar 

  • Dhillon I, Kogan J, Nicholas C (2003) Feature selection and document clustering, a comprehensive survey of text mining. In: Berry M (ed) Springer, Berlin, pp 73–100

  • Dudoit S, Fridlyand J (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol 3(7): 0036.1–0036.21

    Article  Google Scholar 

  • Duran BS (1976) A survey of nonparametric tests for scale. Commun Stat Theory Methods 5: 1287–1312

    Article  Google Scholar 

  • Friedman JH, Rafsky LC (1979) Multivariate generalizations of the Wolfowitz and Smirnov two-sample tests. Ann Stat 7: 697–717

    Article  Google Scholar 

  • Gordon AD (1999) Classification. Chapman and Hall, CRC, Boca Raton

    Google Scholar 

  • Hartigan JA (1975) Clustering algorithms. Wiley, New York

    Google Scholar 

  • Hartigan JA (1985) Statistical theory in clustering. J Classif 2: 63–76

    Article  Google Scholar 

  • Hastie T, Tibshirani R, Friedman JH (2001) The elements of statistical learning: data mining, inference and prediction. Springer, Berlin

    Google Scholar 

  • Henze N (1988) A multivariate two-sample test based on the number of nearest neighbor type coincidences. Ann Stat 16: 772–783

    Article  Google Scholar 

  • Henze N, Penrose M (1999) On the multivariate runs test. Ann Stat 27: 290–298

    Article  Google Scholar 

  • Jain AK, Moreau JV (1987) Bootstrap technique in cluster analysis. Pattern Recognit 20(5): 547–568

    Article  Google Scholar 

  • Jain A, Xu X, Ho T, Xiao F (2002) Uniformity testing using minimal spanning tree. ICPR 4: 281–284

    Google Scholar 

  • Karasözen B, Rubinov A, Weber G-W (2006) Optimization in Data Mining. Eur J Oper Res 173(3): 701–704

    Article  Google Scholar 

  • Kaufman L, Rousseeuw PJ (1990) Finding groups in data. Wiley, New York

    Book  Google Scholar 

  • Klebanov L (2005) N-distances and their applications. The Karolinum Press: Charsel University in Prague, Prague

    Google Scholar 

  • Klebanov L (2003) One class of distribution free multivariate tests. Sanct-Petersburg Math Soc Preprint, 3

  • Kropat E, Weber G-W, Pedamallu CS (2009) Regulatory networks under ellipsoidal uncertainty-optimization theory and dynamical systems. Preprint at IAM, METU

  • Krzanowski W, Lai Y (1985) A criterion for determining the number of groups in a dataset using sum of squares clustering. Biometrics 44: 23–34

    Article  Google Scholar 

  • Kuhn H (1955) The hungarian method for the assignment problem. Naval Res Logistics Q 2: 83–97

    Article  Google Scholar 

  • Lange T, Roth V, Braun M, Buhmann JM (2004) Stability-based validation of clustering solutions. Neural Comput 15(6): 1299–1323

    Article  Google Scholar 

  • Levine E, Domany E (2001) Resampling method for unsupervised estimation of cluster validity. Neural Comput 13: 2573–2593

    Article  Google Scholar 

  • Milligan G, Cooper M (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50: 159–179

    Article  Google Scholar 

  • Mufti GB, Bertrand P, El-Moubarki L (2005) Determining the number of groups from measures of cluster validity. In: Proceedigns of ASMDA 2005. pp 404–414

  • Nesetril J, Milkova E, Nesetrilova H (2001) Otakar Boruvka on minimum spanning tree problem, Translation of both the 1926 papers, comments, history. Discrete Math 3–36

  • Özögür-Akyüz S, Weber G-W (2009) Infinite kernel learning by infinite and semi-infinite programming. In: Proceedings of the second global conference on power control and optimization, AIP conference proceedings 1159. Bali, Indonesia, June 1–3, Hakim AH, Vasant P, Barsoum N (guest eds)

  • Roth V, Lange T, Braun M, Buhmann J (2002) A resampling approach to cluster validation, COMPSTAT, available at http://www.cs.uni-bonn.De/~braunm

  • Sezgin Alp Ö, Büyükbebeci E, Iscanoglu Cekic A, Yerlikaya-Özkurt F, Taylan P, Weber G-W, - CMARS and GAM & CQP—modern optimization methods applied to international credit default prediction, preprint at IAM, METU, submitted for publication

  • Smith S, Jain A (1984) Testing for uniformity in multidimensional data. IEEE Trans Pattern Anal Mach Intell 6: 73–80

    Article  Google Scholar 

  • Sugar C, James G (2003) Finding the number of clusters in a data set: an information theoretic approach. J Am Stat Assoc 98: 750–763

    Article  Google Scholar 

  • Taylan P, Weber G-W, Yerlikaya F (2008) Continuous optimization applied in MARS for modern applications in finance, science and technology. In: ISI proceedings of 20th Mini-EURO conference continuous optimization and knowledge-based technologies. EurOPT 2008 317-322, Neringa, Lithuania

  • Tibshirani R, Walther G (2005) Cluster validation by prediction strength. J Comput Graph Stat 14(3): 511–528

    Article  Google Scholar 

  • Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters via the gap statistic. J Royal Stat Soc B 63(2): 411–423

    Article  Google Scholar 

  • Varma S, Simon R (2004) Iterative class discovery and feature selection using minimal spanning trees. BMC Bioinformatics 5:126

    Google Scholar 

  • Volkovich Z, Barzily Z, Morozensky L (2006) A cluster stability criteria based on the two-sample test concept. In: Proceeding of the second workshop on algorithmic techniques for data mining (ATDM). Springer, pp 329–338

  • Volkovich Z, Barzily Z, Morozensky L (2008) A statistical model of cluster stability. Pattern Recognit 41(7): 2174–2188

    Article  Google Scholar 

  • Volkovich Z, Barzily Z, Avros R, Toledano-Kitai D (2009) On application of the K-nearest neighbors approach for cluster validation. In: Proceeding of the XIII international conference applied stochastic models and data analysis (ASMDA). Vilnius

  • Volkovich Z, Barzily Z, Weber G-W, Toledano-Kitai D (2009) Cluster stability estimation based on a minimal spanning trees approach. The second global conference on power and optimization (PCO). Bali, Indonesia

  • Weber G-W, Batmaz I, Köksal G, Taylan P, Yerlikaya-Özkurt F CMARS: A new contribution to nonparametric regression with multivariate adaptive regression splines supported by continuous optimisation, preprint at IAM, METU, submitted for publication

  • Weber G-W, Taylan P, Yildirak K, Görgülü ZK (2009) Financial regression and organization. To appear in the special issue on optimization in finance, of dynamics of continuous, discrete and impulsive systems (Series B)

  • Wishart D (1969) Mode analysis: a generalization of nearest neighbor which reduces chaining effects. Numer Taxonomy 76:282–311, AJ Cole, Academic Press, London

    Google Scholar 

  • Xu Y, Olman V, Xu D (2002) Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees. Bioinformatics 18: 535–545

    Google Scholar 

  • Zahn C (1971) Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Trans Comput C-20(1): 68–86

    Article  Google Scholar 

  • Zech G, Aslan B (2005) New test for the multivariate two-sample problem based on the concept of minimum energy. J Stat Comput Simul 75(2): 109–119

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Z. Volkovich.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Volkovich, Z., Barzily, Z., Weber, GW. et al. An application of the minimal spanning tree approach to the cluster stability problem. Cent Eur J Oper Res 20, 119–139 (2012). https://doi.org/10.1007/s10100-010-0157-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10100-010-0157-4

Keywords

Navigation