Effects of resampling method and adaptation on clustering ensemble efficacy

Minaei-Bidgoli, Behrouz; Parvin, Hamid; Alinejad-Rokny, Hamid; Alizadeh, Hosein; Punch, William F.

doi:10.1007/s10462-011-9295-x

Effects of resampling method and adaptation on clustering ensemble efficacy

Published: 27 December 2011

Volume 41, pages 27–48, (2014)
Cite this article

Artificial Intelligence Review Aims and scope Submit manuscript

Behrouz Minaei-Bidgoli¹,
Hamid Parvin¹,
Hamid Alinejad-Rokny^2,4,
Hosein Alizadeh¹ &
…
William F. Punch³

811 Accesses
Explore all metrics

Abstract

Clustering ensembles combine multiple partitions of data into a single clustering solution of better quality. Inspired by the success of supervised bagging and boosting algorithms, we propose non-adaptive and adaptive resampling schemes for the integration of multiple independent and dependent clusterings. We investigate the effectiveness of bagging techniques, comparing the efficacy of sampling with and without replacement, in conjunction with several consensus algorithms. In our adaptive approach, individual partitions in the ensemble are sequentially generated by clustering specially selected subsamples of the given dataset. The sampling probability for each data point dynamically depends on the consistency of its previous assignments in the ensemble. New subsamples are then drawn to increasingly focus on the problematic regions of the input feature space. A measure of data point clustering consistency is therefore defined to guide this adaptation. Experimental results show improved stability and accuracy for clustering structures obtained via bootstrapping, subsampling, and adaptive techniques. A meaningful consensus partition for an entire set of data points emerges from multiple clusterings of bootstraps and subsamples. Subsamples of small size can reduce computational cost and measurement complexity for many unsupervised data mining tasks with distributed sources of data. This empirical study also compares the performance of adaptive and non-adaptive clustering ensembles using different consensus functions on a number of datasets. By focusing attention on the data points with the least consistent clustering assignments, whether one can better approximate the inter-cluster boundaries or can at least create diversity in boundaries and this results in improving clustering accuracy and convergence speed as a function of the number of partitions in the ensemble. The comparison of adaptive and non-adaptive approaches is a new avenue for research, and this study helps to pave the way for the useful application of distributed data mining methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Removing Bias from Diverse Data Clusters for Ensemble Classification

An Exploratory Study of the Inputs for Ensemble Clustering Technique as a Subset Selection Problem

$$SC^2$$ : A Selection-Based Consensus Clustering Approach

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Aeberhard S, Coomans D, de Vel O (1992). Comparison of classifiers in high dimensional settings. Technical Report no. 92-02, Department of Computer Science and Department of Mathematics and Statistics, James Cook University of North Queensland
Ayad HG, Kamel MS (2008). Cumulative voting consensus method for partitions with a variable number of clusters. IEEE Trans Pattern Anal Mach Intell 30(1)
Barthelemy J, Leclerc B (1995) The median procedure for partition. In: Cox IJ et al (eds) Partitioning data sets. AMS DIMACS series in discrete mathematics, vol 19, pp 3–34
Ben-Hur A, Elisseeff A, Guyon I (2002). A stability based method for discovering structure in clustered data. In: Pacific symposium on biocomputing, vol 7, pp 6–17
Breiman L (1996) Bagging predictors. J Mach Learn 24(2): 123–140
MATH MathSciNet Google Scholar
Breiman L (1998) Arcing classifiers. Ann Stat 26(3): 801–849
Article MATH MathSciNet Google Scholar
Dhillon IS, Modha DS (2000) A data-clustering algorithm on distributed memory multiprocessors. In: Proceedings of large-scale parallel KDD systems workshop, ACM SIGKDD, in large-scale parallel data mining, lecture notes in artificial intelligence, vol 1759, pp 245–260
Dixon JK (1979) Pattern recognition with partly missing data. IEEE Trans Syst Man Cybern SMC 9:617–621
Google Scholar
Duda RO, Hart PE, Stork DG (2001) Pattern classification. 2 (edn). John Wiley & Sons, New York
MATH Google Scholar
Dudoit S, Fridlyand J (2003) Bagging to improve the accuracy of a clustering procedure. Bioinformatics 19(9): 1090–1099
Article Google Scholar
Efron B (1979) Bootstrap methods: another Look at the Jackknife. Ann Stat 7: 1–26
Article MATH MathSciNet Google Scholar
Fern X, Brodley CE (2003) Random projection for high dimensional data clustering: a cluster ensemble approach. In: Proceedings of 20th international conference on Machine Learning, ICML 2003
Fischer B, Buhmann JM (2002) Data resampling for path based clustering. In: Van Gool L (ed) Pattern recognition—-Symposium of the DAGM. Springer, LNCS, vol 2449, pp 206–214
Fischer B, Buhmann JM (2003) Path-based clustering for grouping of smooth curves and texture segmentation. IEEE Trans PAMI 25(4): 513–518
Article Google Scholar
Fred ALN, Jain AK (2002) Data clustering using evidence accumulation. In: Proceedings of the 16th international conference on pattern recognition, ICPR 2002, Quebec City, pp 276–280
Fred ALN, Jain AK (2005) Combining multiple clustering using evidence accumulation. IEEE Trans Pattern Anal Mach Intell 27(6)
Frossyniotis D, Likas A, Stafylopatis A (2004) A clustering method based on boosting. Pattern Recognit Lett 25(6): 641–654
Article Google Scholar
Hastie T, Tibshirani R, Friedman JH (2001) The elements of statistical learning. Springer, Berlin
Book MATH Google Scholar
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall, Englewood Cliffs
MATH Google Scholar
Jain AK, Moreau JV (1987) The bootstrap approach to clustering. In: Devijver PA, Kittler J (eds) Pattern recognition theory and applications. Springer, Berlin, pp 63–71
Google Scholar
Jiamthapthaksin R, Eick CF, Lee S (2010) GAC-GEO: a generic agglomerative clustering framework for geo-referenced datasets. Knowl Inf Syst
Levine E, Domany E (2001) Resampling method for unsupervised estimation of cluster validity. Neural Comput 13: 2573–2593
Article MATH Google Scholar
Minaei-Bidgoli B, Punch WF (2003) Using genetic algorithms for data mining optimization in an educational web-based system. GECCO :2252-2263
Minaei-Bidgoli B, Topchy A, Punch WF (2004a) Ensembles of partitions via data resampling. In: Proceedings of international conference on information technology, ITCC 04, Las Vegas
Minaei-Bidgoli B, Topchy A, Punch WF (2004b) A comparison of resampling methods for clustering ensembles. In: Proceedings of conference on machine learning methods technology and application, MLMTA 04, Las Vegas
Mohammadi M, Alizadeh H, Minaei-Bidgoli B (2008) Neural network ensembles using clustering ensemble and genetic algorithm. In: Proceedings of international conference on convergence and hybrid information technology, ICCIT08, 11–13 Nov 2008, published by IEEE CS, Busan, Korea
Monti S, Tamayo P, Mesirov J, Golub T (2003) Consensus clustering: a resamlping-based method for class discovery and visualization of gene expression microarray data. J Mach Learn 52(1)
Odewahn SC, Stockwell EB, Pennington RL, Humphreys RM, Zumach WA (1992) Automated star/galaxy discrimination with neural networks. Astron J 103: 308–331
Article Google Scholar
Park BH, Kargupta H (2003) Distributed data mining. In: Ye N (eds) The handbook of data mining. Lawrence Erlbaum Associates, Hillsdale
Google Scholar
Parvin H, Alizadeh H, Minaei-Bidgoli B, Analoui M (2008a) CCHR: combination of classifiers using heuristic retraining. In: Proceedings of international conference on networked computing and advanced information management (NCM 2008), Korea, Sep 2008, published by IEEE CS
Parvin H, Alizadeh H, Minaei-Bidgoli B, Analoui M (2008b) An scalable method for improving the performance of classifiers in multiclass applications by pairwise classifiers and GA. In: Proceedings of international conference on networked computing and advanced information management (NCM 2008), Korea, Sep 2008, published by IEEE CS
Parvin H, Alizadeh H, Minaei-Bidgoli B (2008c) A new approach to improve the vote-based classifier selection. In: Proceedings of international conference on networked computing and advanced information management (NCM 2008), Korea, Sep 2008, published by IEEE CS
Parvin H, Alizadeh H, Moshki M, Minaei-Bidgoli B, Mozayani N (2008d) Divide & conquer classification and optimization by genetic algorithm. In: Proceedings of international conference on convergence and hybrid information technology, ICCIT08, Nov 11–13 2008, published by IEEE CS, Busan, Korea
Pfitzner D, Leibbrandt R, Powers D (2009) Characterization and evaluation of similarity measures for pairs of clusterings. Knowl Inf Syst
Roth V, Lange T, Braun M, Buhmann JM (2002) A resampling approach to cluster validation. In: Proceedings in computational statistics: 15th symposium COMPSTAT 2002. Physica-Verlag, Heidelberg, pp 123–128
Saha S, Bandyopadhyay S (2009) A new multiobjective clustering technique based on the concepts of stability and symmetry. Knowl Inf Syst
Strehl A, Ghosh J (2003) Cluster ensembles-a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3: 583–617
MATH MathSciNet Google Scholar
Tan PN, Steinbach M, Kumar V (2005) Introduction to data mining, 1st edn. Addison-Wesly, Reading
Google Scholar
Topchy A, Jain AK, Punch WF (2003) Combining multiple weak clusterings. In: Proceedings of 3rd IEEE international conference on data mining, pp 331–338
Topchy A, Jain AK, Punch WF (2004a) A mixture model for clustering ensembles. In: Proceedings of SIAM international conference on data mining, SDM 04, pp 379–390
Topchy A, Minaei-Bidgoli B, Jain AK, Punch WF (2004b) Adaptive clustering ensembles. In Proceedings of international conference on pattern recognition, ICPR 2004, Cambridge, UK
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. ACM SIGMOD Record 25(2): 103–114
Article Google Scholar
Zhang B, Hsu M, Forman G (2000) Accurate recasting of parameter estimation algorithms using sufficient statistics for efficient parallel speed-up demonstrated for center-based data clustering algorithms. In: Proceedings of 4th European conference on principles and practice of knowledge discovery in databases, in principles of data mining and knowledge discovery

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Iran University of Scienceand Technology, Tehran, Iran
Behrouz Minaei-Bidgoli, Hamid Parvin & Hosein Alizadeh
Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran
Hamid Alinejad-Rokny
Department of Computer Science and Engineering, Michigan State University, 3115 Engineering Building, East Lansing, MI, 48824, USA
William F. Punch
7 Tir Street, Tirkhatir Street, Kafshgarkola Street, Imam Square, Ghaemshahr, Mazandaran, 4761764467, Iran
Hamid Alinejad-Rokny

Authors

Behrouz Minaei-Bidgoli
View author publications
You can also search for this author in PubMed Google Scholar
Hamid Parvin
View author publications
You can also search for this author in PubMed Google Scholar
Hamid Alinejad-Rokny
View author publications
You can also search for this author in PubMed Google Scholar
Hosein Alizadeh
View author publications
You can also search for this author in PubMed Google Scholar
William F. Punch
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hamid Alinejad-Rokny.

Additional information

This work is based an earlier work of Minaei-Bidgoli et al. (2004b).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Minaei-Bidgoli, B., Parvin, H., Alinejad-Rokny, H. et al. Effects of resampling method and adaptation on clustering ensemble efficacy. Artif Intell Rev 41, 27–48 (2014). https://doi.org/10.1007/s10462-011-9295-x

Download citation

Published: 27 December 2011
Issue Date: January 2014
DOI: https://doi.org/10.1007/s10462-011-9295-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Effects of resampling method and adaptation on clustering ensemble efficacy

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Removing Bias from Diverse Data Clusters for Ensemble Classification

An Exploratory Study of the Inputs for Ensemble Clustering Technique as a Subset Selection Problem

$$SC^2$$ : A Selection-Based Consensus Clustering Approach

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now