Skip to main content
Log in

On strategies for building effective ensembles of relative clustering validity criteria

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Evaluation and validation are essential tasks for achieving meaningful clustering results. Relative validity criteria are measures usually employed in practice to select and validate clustering solutions, as they enable the evaluation of single partitions and the comparison of partition pairs in relative terms based only on the data under analysis. There is a plethora of relative validity measures described in the clustering literature, thus making it difficult to choose an appropriate measure for a given application. One reason for such a variety is that no single measure can capture all different aspects of the clustering problem and, as such, each of them is prone to fail in particular application scenarios. In the present work, we take advantage of the diversity in relative validity measures from the clustering literature. Previous work showed that when randomly selecting different relative validity criteria for an ensemble (from an initial set of 28 different measures), one can expect with great certainty to only improve results over the worst criterion included in the ensemble. In this paper, we propose a method for selecting measures with minimum effectiveness and some degree of complementarity (from the same set of 28 measures) into ensembles, which show superior performance when compared to any single ensemble member (and not just the worst one) over a variety of different datasets. One can also expect greater stability in terms of evaluation over different datasets, even when considering different ensemble strategies. Our results are based on more than a thousand datasets, synthetic and real, from different sources.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. They are used only in very particular applications, such as evaluation of clustering stability via resampling [9] or assessment of diversity in clustering ensembles [44].

References

  1. Albalate A, Suendermann D (2009) A combination approach to cluster validation based on statistical quantiles. In: International joint conference on bioinformatics, systems biology and intelligent computing—IJCBS, pp 549–555

  2. Baya AE, Granitto PM (2013) How many clusters: a validation index for arbitrary-shaped clusters. IEEE/ACM Trans Comput Biol Bioinf 10(2):401–414

    Article  Google Scholar 

  3. Bezdek JC, Pal NR (1998) Some new indexes of cluster validity. IEEE Trans Syst Man Cybern B 28(3):301–315

    Article  Google Scholar 

  4. Bolshakova N, Azuaje F (2003) Cluster validation techniques for genome expression data. Sig Process 83(4):825–833

    Article  MATH  Google Scholar 

  5. Calinski RB, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat 3:1–27

    Article  MathSciNet  MATH  Google Scholar 

  6. Cormack GV, Clarke CLA, Buettcher S (2009) Reciprocal rank fusion outperforms Condorcet and individual rank learning methods. In: Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval, SIGIR ’09, pp 758–759

  7. Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1:224–227

    Article  Google Scholar 

  8. de Borda JC (1781) Mémoire sur les élections au scrutin. Histoire de l’Academie Royale des Sciences, pp 657–665

  9. Dudoit S, Fridlyand J (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol 3(7):0036.1–0036.21

    Article  Google Scholar 

  10. Dunn JC (1974) Well separated clusters and optimal fuzzy partitions. J Cybern 4:95–104

    Article  MathSciNet  MATH  Google Scholar 

  11. Dwork C, Kumar R, Naor M, Sivakumar D (2001) Rank aggregation methods for the web. In: Proceedings of the 10th international conference on World Wide Web, pp 613–622

  12. Estivill-Castro V (2002) Why so many clustering algorithms: a position paper. ACM SIGKDD Explor 4(1):65–75

    Article  MathSciNet  Google Scholar 

  13. Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675–701

    Article  MATH  Google Scholar 

  14. Färber I, Günnemann S, Kriegel HP, Kröger P, Müller E, Schubert E, Seidl T, Zimek A (2010) On using class-labels in evaluation of clusterings. In: MultiClust: 1st international workshop on discovering, summarizing and using multiple clusterings held in conjunction with KDD 2010, Washington, DC

  15. Gan G, Ma C, Wu J (2007) Data clustering: theory, algorithms, and applications. ASA-SIAM

  16. Geusebroek JM, Burghouts GJ, Smeulders AWM (2005) The Amsterdam library of object images. Int J Comput Vision 61(1):103–112

    Article  Google Scholar 

  17. Ghosh J, Acharya A (2011) Cluster ensembles. Wiley Interdiscip Rev Data Mining Knowl Discov 1(4):305–315

    Article  Google Scholar 

  18. Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inf Syst 17:107–145

    Article  MATH  Google Scholar 

  19. Hartigan JA (1975) Clustering algorithms. Wiley, New York

    MATH  Google Scholar 

  20. Hill RS (1980) A stopping rule for partitioning dendrograms. Bot Gaz 141:321–324

    Article  Google Scholar 

  21. Horta D, Campello RJGB (2012) Automatic aspect discrimination in data clustering. Pattern Recogn 45(12):4370–4388

    Article  MATH  Google Scholar 

  22. Hruschka ER, Campello RJGB, Castro LN (2004) Improving the efficiency of a clustering genetic algorithm. In: Ibero-American conference on artificial intelligence—IBERAMIA, vol 3315, pp 861–870

  23. Hruschka ER, Campello RJGB, Castro LN (2006) Evolving clusters in gene-expression data. Inf Sci 176:1898–1927

    Article  MathSciNet  Google Scholar 

  24. Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218

    Article  MATH  Google Scholar 

  25. Hubert LJ, Levin JR (1976) A general statistical framework for assessing categorical clustering in free recall. Psychol Bull 10:1072–1080

    Article  Google Scholar 

  26. Jaccard P (1901) Distribution de la florine alpine dans la bassin de dranses et dans quelques regiones voisines. Bull Soc Vaudoise Sci Nat 37:241–272

    Google Scholar 

  27. Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recogn Lett 31:651–666

    Article  Google Scholar 

  28. Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall, Englewood Cliffs

    MATH  Google Scholar 

  29. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31:264–323

    Article  Google Scholar 

  30. Kaufman L, Rousseeuw P (1990) Finding groups in data. Wiley, New York

    Book  Google Scholar 

  31. Klementiev A, Roth D, Small K (2007) An unsupervised learning algorithm for rank aggregation. In: Proceedings of the 18th European conference on machine learning (ECML), Warsaw, Poland, pp 616–623

  32. Kolde R, Laur S, Adler P, Vilo J (2012) Robust rank aggregation for gene list integration and meta-analysis. Bioinformatics 28(4):573–580

    Article  Google Scholar 

  33. Kriegel HP, Kröger P, Sander J, Zimek A (2011a) Density-based clustering. Wiley Interdiscip Rev Data Mining Knowl Discov 1(3):231–240

    Article  Google Scholar 

  34. Kriegel HP, Kröger P, Schubert E, Zimek A (2011b) Interpreting and unifying outlier scores. In: Proceedings of the 11th SIAM international conference on data mining (SDM), Mesa, AZ, pp 13–24

  35. Kuncheva L, Whitaker C (2003) Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Mach Learn 51(2):181–207

    Article  MATH  Google Scholar 

  36. Lazarevic A, Kumar V (2005) Feature bagging for outlier detection. In: Proceedings of the 11th ACM International conference on knowledge discovery and data mining (SIGKDD), Chicago, IL, pp 157–166

  37. Machado JB, Campello RJGB, Amaral WC (2007) Design of OBF-TS fuzzy models based on multiple clustering validity criteria. In: International conference on tools with artificial intelligence—ICTAI, pp 336–339

  38. Marquis de Condorcet MJANC (1785) Essai sur l’application de l’analyse à la probabilité des décisions rendues à la pluralité des voix. L’Imprimerie Royale, Paris

  39. Maulik U, Bandyopadhyay S (2002) Performance evaluation of some clustering algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell 24(12):1650–1654

    Article  Google Scholar 

  40. McQueen JB (1967) Some methods of classification and analysis of multivariate observations. 5th Berkeley symposium on mathematical statistics and probability, pp 281–297

  41. Milligan GW (1981) A monte carlo study of thirty internal criterion measures for cluster analysis. Psychometrika 46(2):187–199

    Article  MathSciNet  MATH  Google Scholar 

  42. Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2):159–179

    Article  Google Scholar 

  43. Moulavi D, Jaskowiak PA, Campello RJGB, Zimek A, Sander J (2014) Density-based clustering validation. In: Proceedings of the 14th SIAM International conference on data mining (SDM), Philadelphia, PA, pp 839–847

  44. Naldi M, Carvalho ACPLF, Campello RJGB (2013) Cluster ensemble selection based on relative validity indexes. Data Min Knowl Disc 27(2):259–289

    Article  MathSciNet  MATH  Google Scholar 

  45. Nemenyi PB (1963) Distribution-free multiple comparisons. PhD thesis, Princeton University

  46. Pakhira MK, Bandyopadhyay S, Maulik U (2004) Validity index for crisp and fuzzy clusters. Pattern Recogn 37:487–501

    Article  MATH  Google Scholar 

  47. Pihur V, Datta S, Datta S (2007) Weighted rank aggregation of cluster validation measures: a Monte Carlo cross-entropy approach. Bioinformatics 23(13):1607–1615

    Article  Google Scholar 

  48. Pihur V, Datta S, Datta S (2009) Rankaggreg, an R package for weighted rank aggregation. BMC Bioinf 10(1):62

    Article  Google Scholar 

  49. Polikar R (2012) Ensemble learning. In: Ma Y, Zhang C (eds) Ensemble machine learning. Springer, Berlin, pp 1–34

    Chapter  Google Scholar 

  50. Rabbany R, Takaffoli M, Fagnan J, Zaiane OR, Campello RJGB (2012) Relative validity criteria for community mining algorithms. IEEE/ACM international conference on advances in social networks analysis and mining—ASONAM, pp 258–265

  51. Ratkowsky DA, Lance GN (1978) A criterion for determining the number of groups in a classification. Aust Comput J 10:115–117

    Google Scholar 

  52. Rokach L (2010) Ensemble-based classifiers. Artif Intell Rev 33:1–39

    Article  Google Scholar 

  53. Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

    Article  MATH  Google Scholar 

  54. Schalekamp F, van Zuylen A (2009) Rank aggregation: together we’re strong. In: Proceedings of the workshop on algorithm engineering and experiments (ALENEX) SIAM, New York, NY, pp 38–51

  55. Schubert E, Wojdanowski R, Zimek A, Kriegel HP (2012) On evaluation of outlier rankings and outlier scores. In: Proceedings of the 12th SIAM international conference on data mining (SDM), Anaheim, CA, pp 1047–1058

  56. Sheng W, Swift S, Zhang L, Liu X (2005) A weighted sum validity function for clustering with a hybrid niching genetic algorithm. IEEE Trans Syst Man Cybern B 35(6):1156–1167

    Article  Google Scholar 

  57. Spearman C (1904) The proof and measurement of association between two things. Am J Psychol 100(3/4):441–471

    Article  Google Scholar 

  58. Vendramin L, Campello RJGB, Hruschka ER (2009) On the comparison of relative clustering validity criteria. In: Proceedings of the 9th SIAM international conference on data mining (SDM). Sparks, NV, pp 733–744

  59. Vendramin L, Campello RJGB, Hruschka ER (2010) Relative clustering validity criteria: a comparative overview. Stat Anal Data Mining 3(4):209–335

    MathSciNet  Google Scholar 

  60. Vendramin L, Jaskowiak PA, Campello RJGB (2013) On the combination of relative clustering validity criteria. In: Proceedings of the 25th international conference on scientific and statistical database management (SSDBM), Baltimore, MD, pp 4:1–4:12

  61. Xu R, Wunsch DC II (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16:645–678

    Article  Google Scholar 

  62. Yeung KY, Fraley C, Murua A, Raftery AE, Ruzzo WL (2001) Model-based clustering and data transformations for gene expression data. Bioinformatics 17(10):977–987

    Article  Google Scholar 

  63. Zimek A, Campello RJGB, Sander J (2013) Ensembles for unsupervised outlier detection: challenges and research questions. ACM SIGKDD Explor 15(1):11–22

    Article  Google Scholar 

  64. Zimek A, Campello RJGB, Sander J (2014) Data perturbation for outlier detection ensembles. In: Proceedings of the 26th international conference on scientific and statistical database management (SSDBM), Aalborg, Denmark, pp 13:1–13:12

Download references

Acknowledgments

This project was partially funded by Canadian Research Agency NSERC and by Brazilian Research Agencies CNPq and FAPESP. Pablo A. Jaskowiak thanks FAPESP (Grants #2012/15751-9 and #2011/04247-5). Ricardo J. G. B. Campello thanks CNPq (Grant #304137/2013-8) and FAPESP (Grants #2010/20032-6 and #2013/ 18698-4).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pablo A. Jaskowiak.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jaskowiak, P.A., Moulavi, D., Furtado, A.C.S. et al. On strategies for building effective ensembles of relative clustering validity criteria. Knowl Inf Syst 47, 329–354 (2016). https://doi.org/10.1007/s10115-015-0851-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-015-0851-6

Keywords

Navigation