Abstract
Many problems can be reduced to the problem of combining multiple clusterings. In this paper, we first summarize different application scenarios of combining multiple clusterings and provide a new perspective of viewing the problem as a categorical clustering problem. We then show the connections between various consensus and clustering criteria and discuss the complexity results of the problem. Finally we propose a new method to determine the final clustering. Experiments on kinship terms and clustering popular music from heterogeneous feature sets show the effectiveness of combining multiple clusterings.
Similar content being viewed by others
References
Al-Razgan M, Domeniconi C (2006) Weighted clustering ensembles. In: Proceedings of 2006 SIAM international conference on data mining (SDM 2006)
Alhajj R, Kaya M (2008) Multi-objective genetic algorithms based automated clustering for fuzzy association rules mining. J Intell Inf Syst 31:243–264
Fred ALN, Jain A (2003) Robust data clustering. In: Proceedings of IEEE computer society conference on computer vision and pattern recognition
Arabie P, Carroll JD, Desarbo W (1987) Three-way scaling and clustering. Sage, Thousand Oaks
Argamon S, Saric M, Stein SS (2003) Style mining of electronic messages for multiple authorship discrimination: first results. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, pp 475–480
Bauer E, Kohavi R (1999) An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Mach Learn 36:105–139
Bill E (1994) Some advances in transformation-based parts of speech tagging. In: Proceedings of the twelfth national conference on artificial intelligence, vol. 1. American Association for Artificial Intelligence, Menlo Park, pp 722–727
Brucker P (1977) On the complexity of clustering problems. In: Optimization and operations research. Springer, New York, pp 45–54
Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20:37–46
David A, Panchanathan S (2000) Wavelet-histogram method for face recognition. J Electron Imaging 9:217–225
Day WHE (1986) Foreword: Comparison and consensus of classifications. J Classif 3:183–185
de Souto MCP, de Araujo DSA, da Silva BL (2006) Cluster ensemble for gene expression microarray data: accuracy and diversity. In: Proceedings of the 2006 international joint conference on neural networks
Duran BS, Odell PL (1974) Cluster analysis: a survey. Springer, New York
Everitt BS (1987) Introduction to optimization methods and their application in statistics. Chapman and Hall, London
Ferligoj A (1992) Direct multicriteria clustering algorithm. J Classif 9:43–61
Ferligoj A, Batagelj V (1983) Some types of clustering with relational constraints. Psychometrika 48:541–552
Fern X, Lin W (2008) Cluster ensemble selection. In: Proceedings of 2008 SIAM international conference on data mining (SDM 2008)
Fern XZ, Brodley CE (2003) Random projection for high dimensional data clustering: a cluster ensemble approach. In: Proceedings of the twentieth international conference on machine learning (ICML 2003). Morgan Kaufmann, San Mateo, pp 186–193
Filkov V, Skiena S (2004) Integrating microarray data by consensus clustering. Int J Artif Intell Tools, pp 863–880
Gionis A, Mannila H, Tsaparas P (2005) Clustering aggregation. In: ICDE, pp 341–352
Golub GH, Loan CFV (1991) Matrix computations. The Johns Hopkins University Press, Baltimore
Goodman LA, Kruskal WH (1954) Measures of associations for cross classification. J Am Stat Assoc 49:732–764
Gordan AD, Vichi M (1998) Partitions of partitions. J Classif 15:265–285
Gordan AD, Vichi M (2002) Obtaining partitions of a set of hard or fuzzy partitions. Classification, clustering and data analysis: recent advances and applications. Springer, Berlin, pp 75–79
Gyllenberg M, Koski T, Verlaan M (1997) Classification of binary vectors by stochastic complexity. J Multivar Anal 63:47–72
H J, Knowles J (2004) Evolutionary multiobjective clustering. In: Proceedings of the eighth international conference on parallel problem solving from nature. Springer, New York, pp 1081–1091
Hadjitodorov ST, Kuncheva LI, Todorova LP (2006) Moderate diversity for better cluster ensembles. Inform Fus 7:264–275
Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inf Syst 17:107–145
Hartigan JA (1975) Clustering algorithms. Wiley, New York
Hu X, Yoo I, Zhang X, Nanavati P, Das D (2006) Wavelet transformation and cluster ensemble for gene expression analysis. Int J Bioinform Res Appl 1:447–460
Hubert LJ, Arabie P (1985) Comparing partitions. J Classif 2:193–218
Hubert LJ, Baker FB (1978) Evaluating the conformity of sociometric measurements. Psychometrika 43:31–41
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall, New York
Kargupta H, Huang W, Sivakumar K, Johnson EL (2001) Distributed clustering using collective principal component analysis. Knowl Inf Syst 3:422–448
Katz L, Powell JH (1953) A proposed index of the conformity of one sociometric measurement to another. Psychometrika 18:249–256
Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York
Law MHC, Topchy AP, Jain AK (2004) Multiobjective data clustering. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 424–430
Li T (2005) A general model for clustering binary data. In: KDD’05: Proceeding of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining, pp 188–197
Li T (2006) A unified view on clustering binary data. Mach Learn 62:199–215
Li T, Ding C (2008) Weighted consensus clustering. In: Proceedings of 2008 SIAM international conference on data mining (SDM 2008)
Li T, Ding C, Jordan MI (2007) Solving consensus and semi-supervised clustering problems using nonnegative matrix factorization. In: Proceedings of 2007 IEEE international conference on data mining (ICDM 2007)
Li T, Ma S (2004) IFD: iterative feature and data clustering. In: Proceedings of the 2004 SIAM international conference on data mining (SDM 2004). SIAM, Philadelphia
Li T, Ma S, Ogihara M (2004a) Document clustering via adaptive subspace iteration. In: Proceedings of twenty-seventh annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2004), pp 218–225
Li T, Ogihara M (2004) Music artist style identification by semisupervised learning from both lyrics and content. In: Proceedings of the ACM conference on multimedia
Li T, Ogihara M, Li Q (2003a) A comparative study on content-based music genre classification. In: SIGIR’03. ACM, New York, pp 282–289
Li T, Ogihara M, Ma S (2004b) On combining multiple clusterings. In: CIKM, pp 294–303
Li T, Zhu S, Ogihara M (2003b) Algorithms for clustering high dimensional and distributed data. Intell Data Anal J 7:305–326
Matake N, Hiroyasu T, Miki M, Senda T (2007) Multiobjective clustering with automatic k-determination for large-scale data. In: GECCO’07: Proceedings of the 9th annual conference on genetic and evolutionary computation. ACM, New York, pp 861–868
Meila M (2003) Comparing clusterings by the variation of information. In: Proceedings of learning theory and kernel machines: 16th annual conference on learning theory and 7th kernel workshop, COLT/Kernel 2003. Springer, Berlin, pp 173–187
Messatfa H (1992) An algorithm to maximize the agreement. J Classif 9:5–15
Mirkin B (20001) Reinterpreting the category utility function. Mach Learn 45:219–228
Mitton R (1987) Spelling checkers, spelling correctors and the misspellings of poor spellers. Inf Process Manag 23:103–209
Monti S, Tamayo P, Mesirov J, Gloub T (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn J 52:91–118
Moret BM (1998) The theory of computation. Addison-Wesley, Reading
Ozyer T, Alhajj R (2008a) Deciding on number of clusters by multi-objective optimization and validity analysis. J Multi-Valued Log Soft Comput 14:457–474
Ozyer T, Alhajj R (2008b) Parallel clustering of high dimensional data by integrating multi-objective genetic algorithm with divide and conquer. Appl Intell, to appear, 2009
Ellis PWD, Whitman B, Berenzweig A, Lawrence S (2002) The quest for ground truth in musical artist similarity. In: Proceedings of 3rd international conference on music information retrieval, pp 170–177
Rosenberg S, Kim MP (1975) The method of sorting as a data gathering procedure in multivariate research. Multivar Behav Res 10:489–502
Stamatatos E, Fakotakis N, Kokkinakis G (2000) Automatic text categorization in terms of genre and author. Comput Linguist 26:471–496
Strehl A, Ghosh J (2003) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Lear Res 3:583–617
Tweedie FJ, Baayen RH (1998) How variable may a constant be? Measure of lexical richness in perspective. Comput Humanit 32:323–352
Tzanetakis G, Cook P (2002) Musical genre classification of audio signals. IEEE Trans Speech Audio Process 10
Vichi M (1999) One-mode classification of a three-way data matrix. J Classif 16:27–44
Zhao Y, Karypis G (2001) Criterion functions for document clustering: Experiments and analysis. Technical Report, Department of Computer Science, University of Minnesota
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, T., Ogihara, M. & Ma, S. On combining multiple clusterings: an overview and a new perspective. Appl Intell 33, 207–219 (2010). https://doi.org/10.1007/s10489-009-0160-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-009-0160-4