Skip to main content
Log in

An empirical evaluation of random transformations applied to ensemble clustering

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Ensemble clustering techniques have improved in recent years, offering better average performance between domains and data sets. Benefits range from finding novelty clustering which are unattainable by any single clustering algorithm to providing clustering stability, such that the quality is little affected by noise, outliers or sampling variations. The main clustering ensemble strategies are: to combine results of different clustering algorithms; to produce different results by resampling the data, such as in bagging and boosting techniques; and to execute a given algorithm multiple times with different parameters or initialization. Often ensemble techniques are developed for supervised settings and later adapted to the unsupervised setting. Recently, Blaser and Fryzlewicz proposed an ensemble technique to classification based on resampling and transforming input data. Specifically, they employed random rotations to improve significantly Random Forests performance. In this work, we have empirically studied the effects of random transformations based in rotation matrices, Mahalanobis distance and density proximity to improve ensemble clustering. Our experiments considered 12 data sets and 25 variations of random transformations, given a total of 5580 data sets applied to 8 algorithms and evaluated by 4 clustering measures. Statistical tests identified 17 random transformations that are viable to be applied to ensembles and standard clustering algorithms, which had positive effects on cluster quality. In our results, the best performing transforms were Mahalanobis-based transformations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. Experiments indicate the algorithm is stable regarding the variation in the value of h.

  2. This technique is available in the e1071 package for R.

  3. This technique is available in the hkclustering package for R.

References

  1. Arthur D, Vassilvitskii S (2007) K-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms. Society for Industrial and Applied Mathematics, New Orleans, Louisiana, pp 1027–1035

  2. Barthélemy J, Leclerc B (1991) The median procedure for partitions. Mathematics Subject Classification 19:3–34

    MathSciNet  Google Scholar 

  3. Ben-Hur A, Elisseeff A, Guyon I (2001) A stability based method for discovering structure in clustered data. In: Pacific symposium on biocomputing. Hawaii, vol 7, pp 6–17

  4. Blaser R, Fryzlewicz P (2016) Random rotation ensembles. J Mach Learn Res 17:1–26

    MathSciNet  MATH  Google Scholar 

  5. Breiman L (1996) Bagging predictors. Machine Learning 24 (2):123–140. https://doi.org/10.1023/A:1018054314350

    Article  MATH  Google Scholar 

  6. Brijnesh JJ (2016) Condorcet’s jury theorem for consensus clustering and its implications for diversity. arXiv:1604.07711

  7. Campello RJGB, Moulavi D, Zimek A, Sander J (2015) Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Transa Knowl Discovery from Data (TKDD) 10(1):5. https://doi.org/10.1145/2733381

    Article  Google Scholar 

  8. Conover WJ, Iman RL (1979) On multiple-comparisons procedures. Los Alamos Sci. Lab. Tech. Rep. LA-7677-MS. pp, 1–14

  9. Dempster AP, Laird NM, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (methodological), pp 1–38

  10. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(Jan):1–30

    MathSciNet  MATH  Google Scholar 

  11. Diaconis P, Shahshahani M (1994) On the eigenvalues of random matrices. Journal of Applied Probability, pp 49–62. https://doi.org/10.2307/3214948

  12. Dudoit S, Fridlyand J (2003) Bagging to improve the accuracy of a clustering procedure. Bioinformatics 19(9):1090–1099

    Article  Google Scholar 

  13. Dunn JC (1974) Well-separated clusters and optimal fuzzy partitions. J Cybern 4(1):95–104. https://doi.org/10.1080/01969727408546059

    Article  MathSciNet  MATH  Google Scholar 

  14. Efron B (1979) Bootstrap methods: another look at the jackknife. The Annals of Statistics, pp 1–26

  15. Ester M, Kriegel H, Jorg S, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Conference on knowledge discovery and data mining. Portland, Oregon, USA, vol 96, pp 226–231

  16. Fern XZ, Brodley CE (2003) Random projection for high dimensional data clustering: a cluster ensemble approach. In: Proceedings of the 20th international conference on machine learning (ICML-03). Washington, DC, pp 186–193

  17. Fred ALN, Jain AK (2002) Data clustering using evidence accumulation. In: 16th international conference on pattern recognition, 2002. Proceedings. https://doi.org/10.1109/ICPR.2002.1047450, vol 4. IEEE, Quebec, pp 276–280

  18. Fred ALN, Jain AK (2005) Combining multiple clusterings using evidence accumulation. IEEE Trans Pattern Anal Mach Intell 27(6):835–850

    Article  Google Scholar 

  19. Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association 32(200):675–701

    Article  Google Scholar 

  20. Frossyniotis D, Likas A, Stafylopatis A (2004) A clustering method based on boosting. Pattern Recogn Lett 25(6):641–654

    Article  Google Scholar 

  21. Householder AS (1958) Unitary triangularization of a nonsymmetric matrix. Journal of the ACM (JACM) 5(4):339–342. https://doi.org/10.1145/320941.320947

    Article  MathSciNet  MATH  Google Scholar 

  22. Hubert L, Arabie P (1985) Comparing partitions. Journal of Classification 2(1):193–218. https://doi.org/10.1007/BF01908075

    Article  MATH  Google Scholar 

  23. Ja H, Ma W (1979) Algorithm as 136: a k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 28 (1):100–108. http://www.jstor.org/stable/2346830

    Google Scholar 

  24. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Computing Surveys (CSUR) 31(3):264–323

    Article  Google Scholar 

  25. Leisch F (1999) Bagged clustering SFB adaptive information systems and modelling in economics and management science

  26. Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 19 Jul 2017

  27. Lloyd S (1982) Least squares quantization in PCM. IEEE Transactions on Information Theory 28(2):129–137. https://doi.org/10.1109/TIT.1982.1056489

    Article  MathSciNet  Google Scholar 

  28. Mahalanobis PC (1936) On the generalised distance in statistics, pp 49–55

  29. Mehta P, Bukov M, Wang C-H, Day AGR, Richardson C, Fisher CK, Schwab DJ (2019) A high-bias, low-variance introduction to machine learning for physicists. Physics Reports. https://doi.org/10.1016/j.physrep.2019.03.001

  30. Minaei-Bidgoli B, Topchy A, Punch WF (2004) Ensembles of partitions via data resampling. In: International conference on information technology: coding and computing, 2004. Proceedings. ITCC 2004. https://doi.org/10.1109/ITCC.2004.1286629, vol 2. IEEE, Las Vegas, pp 188–192

  31. Minaei-Bidgoli B, Parvin H, Alinejad-Rokny H, Alizadeh H, Punch WF (2014) Effects of resampling method and adaptation on clustering ensemble efficacy. Artificial Intelligence Review 41(1):27–48. https://doi.org/10.1007/s10462-011-9295-x

    Article  Google Scholar 

  32. Moreau JV, Jain AK (1987) The bootstrap approach to clustering. In: Pattern recognition theory and applications. Springer, Berlin, pp 63–71

  33. Ostrovsky R, Rabani Y, Schulman LJ, Swamy C (2006) The effectiveness of lloyd-type methods for the k-means problem. In: 47th annual IEEE symposium on soundations of computer science, 2006. FOCS ’06. Washington, DC, USA, pp 165–176

  34. Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65. https://doi.org/10.1016/0377-0427(87)90125-7

    Article  MATH  Google Scholar 

  35. Schapire RE (1990) The strength of weak learnability. Machine Learning 5(2):197–227. https://doi.org/10.1023/A:1022648800760

    Article  Google Scholar 

  36. Siersdorfer S, Sizov S (2004) Restrictive clustering and metaclustering for self-organizing document collections. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval. https://doi.org/10.1145/1008992.1009032. ACM, New York, pp 226–233

  37. Silva GR, Albertini MK (2017) Using multiple clustering algorithms to generate constraint rules and create consensus clusters. In: 2017 Brazilian conference on intelligent systems (BRACIS). https://doi.org/10.1109/BRACIS.2017.78. IEEE, Uberlandia, pp 312–317

  38. Stoyanov K (2015) Hierarchical k-means clustering and its application in customer segmentation. Ph.D. thesis, University of Essex, UK

  39. Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3(Dec):583–617

    MathSciNet  MATH  Google Scholar 

  40. Strehl A, Ghosh J (2003) Cluster ensembles-a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617

    MathSciNet  MATH  Google Scholar 

  41. Topchy A, Jain AK, Punch W (2004) A mixture model for clustering ensembles. In: Proceedings of the 2004 SIAM international conference on data mining. https://doi.org/10.1137/1.9781611972740.35. SIAM, Florida, pp 379–390

  42. Topchy A, Jain AK, Punch WF (2003) Combining multiple weak clusterings. In: Third IEEE international conference on data mining, 2003. ICDM 2003. IEEE, Melbourne, pp 331–338

  43. Vendramin L, Campello RJGB, Hruschka ER (2010) Relative clustering validity criteria: a comparative overview. Statistical Analysis and Data Mining 3 (4):209–235. https://doi.org/10.1002/sam.10080

    Article  MathSciNet  MATH  Google Scholar 

  44. Witten IH, Frank E, Hall MA, Pal CJ (2016) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, San Francisco, CA, USA

  45. Wu J, Liu H, Xiong H, Cao J, Chen J (2015) K-means-based consensus clustering: a unified view. IEEE Trans Knowl Data Eng 27(1):155–169. https://doi.org/10.1109/TKDE.2014.2316512

    Article  Google Scholar 

  46. Yu Z, Luo P, You J, Wong HS, Leung H, Wu S, Zhang J, Han G (2016) Incremental semi-supervised clustering ensemble for high dimensional data clustering. IEEE Trans Knowl Data Eng 28(3):701–714

    Article  Google Scholar 

Download references

Acknowledgments

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001, and also it is funded by CAPES-PrInt internationalization funding program.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gabriel Damasceno Rodrigues.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rodrigues, G.D., Albertini, M.K. & Yang, X. An empirical evaluation of random transformations applied to ensemble clustering. Multimed Tools Appl 79, 34253–34285 (2020). https://doi.org/10.1007/s11042-020-08947-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-08947-x

Keywords

Navigation