An empirical evaluation of random transformations applied to ensemble clustering

Rodrigues, Gabriel Damasceno; Albertini, Marcelo Keese; Yang, Xiaomin

doi:10.1007/s11042-020-08947-x

An empirical evaluation of random transformations applied to ensemble clustering

Published: 28 July 2020

Volume 79, pages 34253–34285, (2020)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Gabriel Damasceno Rodrigues¹,
Marcelo Keese Albertini¹ &
Xiaomin Yang²

181 Accesses
1 Citation
Explore all metrics

Abstract

Ensemble clustering techniques have improved in recent years, offering better average performance between domains and data sets. Benefits range from finding novelty clustering which are unattainable by any single clustering algorithm to providing clustering stability, such that the quality is little affected by noise, outliers or sampling variations. The main clustering ensemble strategies are: to combine results of different clustering algorithms; to produce different results by resampling the data, such as in bagging and boosting techniques; and to execute a given algorithm multiple times with different parameters or initialization. Often ensemble techniques are developed for supervised settings and later adapted to the unsupervised setting. Recently, Blaser and Fryzlewicz proposed an ensemble technique to classification based on resampling and transforming input data. Specifically, they employed random rotations to improve significantly Random Forests performance. In this work, we have empirically studied the effects of random transformations based in rotation matrices, Mahalanobis distance and density proximity to improve ensemble clustering. Our experiments considered 12 data sets and 25 variations of random transformations, given a total of 5580 data sets applied to 8 algorithms and evaluated by 4 clustering measures. Statistical tests identified 17 random transformations that are viable to be applied to ensembles and standard clustering algorithms, which had positive effects on cluster quality. In our results, the best performing transforms were Mahalanobis-based transformations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Intra-feature Random Forest Clustering

Cluster Forests Based Fuzzy C-Means for Data Clustering

An Exploratory Study of the Inputs for Ensemble Clustering Technique as a Subset Selection Problem

Notes

Experiments indicate the algorithm is stable regarding the variation in the value of h.
This technique is available in the e1071 package for R.
This technique is available in the hkclustering package for R.

References

Arthur D, Vassilvitskii S (2007) K-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms. Society for Industrial and Applied Mathematics, New Orleans, Louisiana, pp 1027–1035
Barthélemy J, Leclerc B (1991) The median procedure for partitions. Mathematics Subject Classification 19:3–34
MathSciNet Google Scholar
Ben-Hur A, Elisseeff A, Guyon I (2001) A stability based method for discovering structure in clustered data. In: Pacific symposium on biocomputing. Hawaii, vol 7, pp 6–17
Blaser R, Fryzlewicz P (2016) Random rotation ensembles. J Mach Learn Res 17:1–26
MathSciNet MATH Google Scholar
Breiman L (1996) Bagging predictors. Machine Learning 24 (2):123–140. https://doi.org/10.1023/A:1018054314350
Article MATH Google Scholar
Brijnesh JJ (2016) Condorcet’s jury theorem for consensus clustering and its implications for diversity. arXiv:1604.07711
Campello RJGB, Moulavi D, Zimek A, Sander J (2015) Hierarchical density estimates for data clustering, visualization, and outlier detection. ACM Transa Knowl Discovery from Data (TKDD) 10(1):5. https://doi.org/10.1145/2733381
Article Google Scholar
Conover WJ, Iman RL (1979) On multiple-comparisons procedures. Los Alamos Sci. Lab. Tech. Rep. LA-7677-MS. pp, 1–14
Dempster AP, Laird NM, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (methodological), pp 1–38
Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7(Jan):1–30
MathSciNet MATH Google Scholar
Diaconis P, Shahshahani M (1994) On the eigenvalues of random matrices. Journal of Applied Probability, pp 49–62. https://doi.org/10.2307/3214948
Dudoit S, Fridlyand J (2003) Bagging to improve the accuracy of a clustering procedure. Bioinformatics 19(9):1090–1099
Article Google Scholar
Dunn JC (1974) Well-separated clusters and optimal fuzzy partitions. J Cybern 4(1):95–104. https://doi.org/10.1080/01969727408546059
Article MathSciNet MATH Google Scholar
Efron B (1979) Bootstrap methods: another look at the jackknife. The Annals of Statistics, pp 1–26
Ester M, Kriegel H, Jorg S, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Conference on knowledge discovery and data mining. Portland, Oregon, USA, vol 96, pp 226–231
Fern XZ, Brodley CE (2003) Random projection for high dimensional data clustering: a cluster ensemble approach. In: Proceedings of the 20th international conference on machine learning (ICML-03). Washington, DC, pp 186–193
Fred ALN, Jain AK (2002) Data clustering using evidence accumulation. In: 16th international conference on pattern recognition, 2002. Proceedings. https://doi.org/10.1109/ICPR.2002.1047450, vol 4. IEEE, Quebec, pp 276–280
Fred ALN, Jain AK (2005) Combining multiple clusterings using evidence accumulation. IEEE Trans Pattern Anal Mach Intell 27(6):835–850
Article Google Scholar
Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association 32(200):675–701
Article Google Scholar
Frossyniotis D, Likas A, Stafylopatis A (2004) A clustering method based on boosting. Pattern Recogn Lett 25(6):641–654
Article Google Scholar
Householder AS (1958) Unitary triangularization of a nonsymmetric matrix. Journal of the ACM (JACM) 5(4):339–342. https://doi.org/10.1145/320941.320947
Article MathSciNet MATH Google Scholar
Hubert L, Arabie P (1985) Comparing partitions. Journal of Classification 2(1):193–218. https://doi.org/10.1007/BF01908075
Article MATH Google Scholar
Ja H, Ma W (1979) Algorithm as 136: a k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 28 (1):100–108. http://www.jstor.org/stable/2346830
Google Scholar
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Computing Surveys (CSUR) 31(3):264–323
Article Google Scholar
Leisch F (1999) Bagged clustering SFB adaptive information systems and modelling in economics and management science
Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed 19 Jul 2017
Lloyd S (1982) Least squares quantization in PCM. IEEE Transactions on Information Theory 28(2):129–137. https://doi.org/10.1109/TIT.1982.1056489
Article MathSciNet Google Scholar
Mahalanobis PC (1936) On the generalised distance in statistics, pp 49–55
Mehta P, Bukov M, Wang C-H, Day AGR, Richardson C, Fisher CK, Schwab DJ (2019) A high-bias, low-variance introduction to machine learning for physicists. Physics Reports. https://doi.org/10.1016/j.physrep.2019.03.001
Minaei-Bidgoli B, Topchy A, Punch WF (2004) Ensembles of partitions via data resampling. In: International conference on information technology: coding and computing, 2004. Proceedings. ITCC 2004. https://doi.org/10.1109/ITCC.2004.1286629, vol 2. IEEE, Las Vegas, pp 188–192
Minaei-Bidgoli B, Parvin H, Alinejad-Rokny H, Alizadeh H, Punch WF (2014) Effects of resampling method and adaptation on clustering ensemble efficacy. Artificial Intelligence Review 41(1):27–48. https://doi.org/10.1007/s10462-011-9295-x
Article Google Scholar
Moreau JV, Jain AK (1987) The bootstrap approach to clustering. In: Pattern recognition theory and applications. Springer, Berlin, pp 63–71
Ostrovsky R, Rabani Y, Schulman LJ, Swamy C (2006) The effectiveness of lloyd-type methods for the k-means problem. In: 47th annual IEEE symposium on soundations of computer science, 2006. FOCS ’06. Washington, DC, USA, pp 165–176
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65. https://doi.org/10.1016/0377-0427(87)90125-7
Article MATH Google Scholar
Schapire RE (1990) The strength of weak learnability. Machine Learning 5(2):197–227. https://doi.org/10.1023/A:1022648800760
Article Google Scholar
Siersdorfer S, Sizov S (2004) Restrictive clustering and metaclustering for self-organizing document collections. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval. https://doi.org/10.1145/1008992.1009032. ACM, New York, pp 226–233
Silva GR, Albertini MK (2017) Using multiple clustering algorithms to generate constraint rules and create consensus clusters. In: 2017 Brazilian conference on intelligent systems (BRACIS). https://doi.org/10.1109/BRACIS.2017.78. IEEE, Uberlandia, pp 312–317
Stoyanov K (2015) Hierarchical k-means clustering and its application in customer segmentation. Ph.D. thesis, University of Essex, UK
Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3(Dec):583–617
MathSciNet MATH Google Scholar
Strehl A, Ghosh J (2003) Cluster ensembles-a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617
MathSciNet MATH Google Scholar
Topchy A, Jain AK, Punch W (2004) A mixture model for clustering ensembles. In: Proceedings of the 2004 SIAM international conference on data mining. https://doi.org/10.1137/1.9781611972740.35. SIAM, Florida, pp 379–390
Topchy A, Jain AK, Punch WF (2003) Combining multiple weak clusterings. In: Third IEEE international conference on data mining, 2003. ICDM 2003. IEEE, Melbourne, pp 331–338
Vendramin L, Campello RJGB, Hruschka ER (2010) Relative clustering validity criteria: a comparative overview. Statistical Analysis and Data Mining 3 (4):209–235. https://doi.org/10.1002/sam.10080
Article MathSciNet MATH Google Scholar
Witten IH, Frank E, Hall MA, Pal CJ (2016) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, San Francisco, CA, USA
Wu J, Liu H, Xiong H, Cao J, Chen J (2015) K-means-based consensus clustering: a unified view. IEEE Trans Knowl Data Eng 27(1):155–169. https://doi.org/10.1109/TKDE.2014.2316512
Article Google Scholar
Yu Z, Luo P, You J, Wong HS, Leung H, Wu S, Zhang J, Han G (2016) Incremental semi-supervised clustering ensemble for high dimensional data clustering. IEEE Trans Knowl Data Eng 28(3):701–714
Article Google Scholar

Download references

Acknowledgments

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001, and also it is funded by CAPES-PrInt internationalization funding program.

Author information

Authors and Affiliations

Faculty of Computer Science, Federal University of Uberlandia, Av. Joao Naves de Avila, 2.121, Uberlandia, 1B-150, Brazil
Gabriel Damasceno Rodrigues & Marcelo Keese Albertini
College of Electronics and Information Engineering, Sichuan University, No. 24 South Section 1, 1st Ring Road, Chengdu, Room 501 - 510, Part B, Fundamental Teaching Building, Sichuan, People’s Republic of China
Xiaomin Yang

Authors

Gabriel Damasceno Rodrigues
View author publications
You can also search for this author in PubMed Google Scholar
Marcelo Keese Albertini
View author publications
You can also search for this author in PubMed Google Scholar
Xiaomin Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gabriel Damasceno Rodrigues.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rodrigues, G.D., Albertini, M.K. & Yang, X. An empirical evaluation of random transformations applied to ensemble clustering. Multimed Tools Appl 79, 34253–34285 (2020). https://doi.org/10.1007/s11042-020-08947-x

Download citation

Received: 06 September 2019
Revised: 11 December 2019
Accepted: 17 April 2020
Published: 28 July 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s11042-020-08947-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An empirical evaluation of random transformations applied to ensemble clustering

Abstract

Access this article

Similar content being viewed by others

Intra-feature Random Forest Clustering

Cluster Forests Based Fuzzy C-Means for Data Clustering

An Exploratory Study of the Inputs for Ensemble Clustering Technique as a Subset Selection Problem

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An empirical evaluation of random transformations applied to ensemble clustering

Abstract

Access this article

Similar content being viewed by others

Intra-feature Random Forest Clustering

Cluster Forests Based Fuzzy C-Means for Data Clustering

An Exploratory Study of the Inputs for Ensemble Clustering Technique as a Subset Selection Problem

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation