Skip to main content
Log in

Effects of resampling method and adaptation on clustering ensemble efficacy

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

Clustering ensembles combine multiple partitions of data into a single clustering solution of better quality. Inspired by the success of supervised bagging and boosting algorithms, we propose non-adaptive and adaptive resampling schemes for the integration of multiple independent and dependent clusterings. We investigate the effectiveness of bagging techniques, comparing the efficacy of sampling with and without replacement, in conjunction with several consensus algorithms. In our adaptive approach, individual partitions in the ensemble are sequentially generated by clustering specially selected subsamples of the given dataset. The sampling probability for each data point dynamically depends on the consistency of its previous assignments in the ensemble. New subsamples are then drawn to increasingly focus on the problematic regions of the input feature space. A measure of data point clustering consistency is therefore defined to guide this adaptation. Experimental results show improved stability and accuracy for clustering structures obtained via bootstrapping, subsampling, and adaptive techniques. A meaningful consensus partition for an entire set of data points emerges from multiple clusterings of bootstraps and subsamples. Subsamples of small size can reduce computational cost and measurement complexity for many unsupervised data mining tasks with distributed sources of data. This empirical study also compares the performance of adaptive and non-adaptive clustering ensembles using different consensus functions on a number of datasets. By focusing attention on the data points with the least consistent clustering assignments, whether one can better approximate the inter-cluster boundaries or can at least create diversity in boundaries and this results in improving clustering accuracy and convergence speed as a function of the number of partitions in the ensemble. The comparison of adaptive and non-adaptive approaches is a new avenue for research, and this study helps to pave the way for the useful application of distributed data mining methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Aeberhard S, Coomans D, de Vel O (1992). Comparison of classifiers in high dimensional settings. Technical Report no. 92-02, Department of Computer Science and Department of Mathematics and Statistics, James Cook University of North Queensland

  • Ayad HG, Kamel MS (2008). Cumulative voting consensus method for partitions with a variable number of clusters. IEEE Trans Pattern Anal Mach Intell 30(1)

  • Barthelemy J, Leclerc B (1995) The median procedure for partition. In: Cox IJ et al (eds) Partitioning data sets. AMS DIMACS series in discrete mathematics, vol 19, pp 3–34

  • Ben-Hur A, Elisseeff A, Guyon I (2002). A stability based method for discovering structure in clustered data. In: Pacific symposium on biocomputing, vol 7, pp 6–17

  • Breiman L (1996) Bagging predictors. J Mach Learn 24(2): 123–140

    MATH  MathSciNet  Google Scholar 

  • Breiman L (1998) Arcing classifiers. Ann Stat 26(3): 801–849

    Article  MATH  MathSciNet  Google Scholar 

  • Dhillon IS, Modha DS (2000) A data-clustering algorithm on distributed memory multiprocessors. In: Proceedings of large-scale parallel KDD systems workshop, ACM SIGKDD, in large-scale parallel data mining, lecture notes in artificial intelligence, vol 1759, pp 245–260

  • Dixon JK (1979) Pattern recognition with partly missing data. IEEE Trans Syst Man Cybern SMC 9:617–621

    Google Scholar 

  • Duda RO, Hart PE, Stork DG (2001) Pattern classification. 2 (edn). John Wiley & Sons, New York

    MATH  Google Scholar 

  • Dudoit S, Fridlyand J (2003) Bagging to improve the accuracy of a clustering procedure. Bioinformatics 19(9): 1090–1099

    Article  Google Scholar 

  • Efron B (1979) Bootstrap methods: another Look at the Jackknife. Ann Stat 7: 1–26

    Article  MATH  MathSciNet  Google Scholar 

  • Fern X, Brodley CE (2003) Random projection for high dimensional data clustering: a cluster ensemble approach. In: Proceedings of 20th international conference on Machine Learning, ICML 2003

  • Fischer B, Buhmann JM (2002) Data resampling for path based clustering. In: Van Gool L (ed) Pattern recognition—-Symposium of the DAGM. Springer, LNCS, vol 2449, pp 206–214

  • Fischer B, Buhmann JM (2003) Path-based clustering for grouping of smooth curves and texture segmentation. IEEE Trans PAMI 25(4): 513–518

    Article  Google Scholar 

  • Fred ALN, Jain AK (2002) Data clustering using evidence accumulation. In: Proceedings of the 16th international conference on pattern recognition, ICPR 2002, Quebec City, pp 276–280

  • Fred ALN, Jain AK (2005) Combining multiple clustering using evidence accumulation. IEEE Trans Pattern Anal Mach Intell 27(6)

  • Frossyniotis D, Likas A, Stafylopatis A (2004) A clustering method based on boosting. Pattern Recognit Lett 25(6): 641–654

    Article  Google Scholar 

  • Hastie T, Tibshirani R, Friedman JH (2001) The elements of statistical learning. Springer, Berlin

    Book  MATH  Google Scholar 

  • Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Englewood Cliffs

    MATH  Google Scholar 

  • Jain AK, Moreau JV (1987) The bootstrap approach to clustering. In: Devijver PA, Kittler J (eds) Pattern recognition theory and applications. Springer, Berlin, pp 63–71

    Google Scholar 

  • Jiamthapthaksin R, Eick CF, Lee S (2010) GAC-GEO: a generic agglomerative clustering framework for geo-referenced datasets. Knowl Inf Syst

  • Levine E, Domany E (2001) Resampling method for unsupervised estimation of cluster validity. Neural Comput 13: 2573–2593

    Article  MATH  Google Scholar 

  • Minaei-Bidgoli B, Punch WF (2003) Using genetic algorithms for data mining optimization in an educational web-based system. GECCO :2252-2263

  • Minaei-Bidgoli B, Topchy A, Punch WF (2004a) Ensembles of partitions via data resampling. In: Proceedings of international conference on information technology, ITCC 04, Las Vegas

  • Minaei-Bidgoli B, Topchy A, Punch WF (2004b) A comparison of resampling methods for clustering ensembles. In: Proceedings of conference on machine learning methods technology and application, MLMTA 04, Las Vegas

  • Mohammadi M, Alizadeh H, Minaei-Bidgoli B (2008) Neural network ensembles using clustering ensemble and genetic algorithm. In: Proceedings of international conference on convergence and hybrid information technology, ICCIT08, 11–13 Nov 2008, published by IEEE CS, Busan, Korea

  • Monti S, Tamayo P, Mesirov J, Golub T (2003) Consensus clustering: a resamlping-based method for class discovery and visualization of gene expression microarray data. J Mach Learn 52(1)

  • Odewahn SC, Stockwell EB, Pennington RL, Humphreys RM, Zumach WA (1992) Automated star/galaxy discrimination with neural networks. Astron J 103: 308–331

    Article  Google Scholar 

  • Park BH, Kargupta H (2003) Distributed data mining. In: Ye N (eds) The handbook of data mining. Lawrence Erlbaum Associates, Hillsdale

    Google Scholar 

  • Parvin H, Alizadeh H, Minaei-Bidgoli B, Analoui M (2008a) CCHR: combination of classifiers using heuristic retraining. In: Proceedings of international conference on networked computing and advanced information management (NCM 2008), Korea, Sep 2008, published by IEEE CS

  • Parvin H, Alizadeh H, Minaei-Bidgoli B, Analoui M (2008b) An scalable method for improving the performance of classifiers in multiclass applications by pairwise classifiers and GA. In: Proceedings of international conference on networked computing and advanced information management (NCM 2008), Korea, Sep 2008, published by IEEE CS

  • Parvin H, Alizadeh H, Minaei-Bidgoli B (2008c) A new approach to improve the vote-based classifier selection. In: Proceedings of international conference on networked computing and advanced information management (NCM 2008), Korea, Sep 2008, published by IEEE CS

  • Parvin H, Alizadeh H, Moshki M, Minaei-Bidgoli B, Mozayani N (2008d) Divide & conquer classification and optimization by genetic algorithm. In: Proceedings of international conference on convergence and hybrid information technology, ICCIT08, Nov 11–13 2008, published by IEEE CS, Busan, Korea

  • Pfitzner D, Leibbrandt R, Powers D (2009) Characterization and evaluation of similarity measures for pairs of clusterings. Knowl Inf Syst

  • Roth V, Lange T, Braun M, Buhmann JM (2002) A resampling approach to cluster validation. In: Proceedings in computational statistics: 15th symposium COMPSTAT 2002. Physica-Verlag, Heidelberg, pp 123–128

  • Saha S, Bandyopadhyay S (2009) A new multiobjective clustering technique based on the concepts of stability and symmetry. Knowl Inf Syst

  • Strehl A, Ghosh J (2003) Cluster ensembles-a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3: 583–617

    MATH  MathSciNet  Google Scholar 

  • Tan PN, Steinbach M, Kumar V (2005) Introduction to data mining, 1st edn. Addison-Wesly, Reading

    Google Scholar 

  • Topchy A, Jain AK, Punch WF (2003) Combining multiple weak clusterings. In: Proceedings of 3rd IEEE international conference on data mining, pp 331–338

  • Topchy A, Jain AK, Punch WF (2004a) A mixture model for clustering ensembles. In: Proceedings of SIAM international conference on data mining, SDM 04, pp 379–390

  • Topchy A, Minaei-Bidgoli B, Jain AK, Punch WF (2004b) Adaptive clustering ensembles. In Proceedings of international conference on pattern recognition, ICPR 2004, Cambridge, UK

  • Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. ACM SIGMOD Record 25(2): 103–114

    Article  Google Scholar 

  • Zhang B, Hsu M, Forman G (2000) Accurate recasting of parameter estimation algorithms using sufficient statistics for efficient parallel speed-up demonstrated for center-based data clustering algorithms. In: Proceedings of 4th European conference on principles and practice of knowledge discovery in databases, in principles of data mining and knowledge discovery

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hamid Alinejad-Rokny.

Additional information

This work is based an earlier work of Minaei-Bidgoli et al. (2004b).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Minaei-Bidgoli, B., Parvin, H., Alinejad-Rokny, H. et al. Effects of resampling method and adaptation on clustering ensemble efficacy. Artif Intell Rev 41, 27–48 (2014). https://doi.org/10.1007/s10462-011-9295-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-011-9295-x

Keywords

Navigation