Skip to main content
Log in

On combining multiple clusterings: an overview and a new perspective

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Many problems can be reduced to the problem of combining multiple clusterings. In this paper, we first summarize different application scenarios of combining multiple clusterings and provide a new perspective of viewing the problem as a categorical clustering problem. We then show the connections between various consensus and clustering criteria and discuss the complexity results of the problem. Finally we propose a new method to determine the final clustering. Experiments on kinship terms and clustering popular music from heterogeneous feature sets show the effectiveness of combining multiple clusterings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Al-Razgan M, Domeniconi C (2006) Weighted clustering ensembles. In: Proceedings of 2006 SIAM international conference on data mining (SDM 2006)

  2. Alhajj R, Kaya M (2008) Multi-objective genetic algorithms based automated clustering for fuzzy association rules mining. J Intell Inf Syst 31:243–264

    Article  Google Scholar 

  3. Fred ALN, Jain A (2003) Robust data clustering. In: Proceedings of IEEE computer society conference on computer vision and pattern recognition

  4. Arabie P, Carroll JD, Desarbo W (1987) Three-way scaling and clustering. Sage, Thousand Oaks

    Google Scholar 

  5. Argamon S, Saric M, Stein SS (2003) Style mining of electronic messages for multiple authorship discrimination: first results. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, pp 475–480

    Chapter  Google Scholar 

  6. Bauer E, Kohavi R (1999) An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Mach Learn 36:105–139

    Article  Google Scholar 

  7. Bill E (1994) Some advances in transformation-based parts of speech tagging. In: Proceedings of the twelfth national conference on artificial intelligence, vol. 1. American Association for Artificial Intelligence, Menlo Park, pp 722–727

    Google Scholar 

  8. Brucker P (1977) On the complexity of clustering problems. In: Optimization and operations research. Springer, New York, pp 45–54

    Google Scholar 

  9. Cohen J (1960) A coefficient of agreement for nominal scales. Educ Psychol Meas 20:37–46

    Article  Google Scholar 

  10. David A, Panchanathan S (2000) Wavelet-histogram method for face recognition. J Electron Imaging 9:217–225

    Article  Google Scholar 

  11. Day WHE (1986) Foreword: Comparison and consensus of classifications. J Classif 3:183–185

    Article  Google Scholar 

  12. de Souto MCP, de Araujo DSA, da Silva BL (2006) Cluster ensemble for gene expression microarray data: accuracy and diversity. In: Proceedings of the 2006 international joint conference on neural networks

  13. Duran BS, Odell PL (1974) Cluster analysis: a survey. Springer, New York

    MATH  Google Scholar 

  14. Everitt BS (1987) Introduction to optimization methods and their application in statistics. Chapman and Hall, London

    MATH  Google Scholar 

  15. Ferligoj A (1992) Direct multicriteria clustering algorithm. J Classif 9:43–61

    Article  MATH  MathSciNet  Google Scholar 

  16. Ferligoj A, Batagelj V (1983) Some types of clustering with relational constraints. Psychometrika 48:541–552

    Article  MATH  MathSciNet  Google Scholar 

  17. Fern X, Lin W (2008) Cluster ensemble selection. In: Proceedings of 2008 SIAM international conference on data mining (SDM 2008)

  18. Fern XZ, Brodley CE (2003) Random projection for high dimensional data clustering: a cluster ensemble approach. In: Proceedings of the twentieth international conference on machine learning (ICML 2003). Morgan Kaufmann, San Mateo, pp 186–193

    Google Scholar 

  19. Filkov V, Skiena S (2004) Integrating microarray data by consensus clustering. Int J Artif Intell Tools, pp 863–880

  20. Gionis A, Mannila H, Tsaparas P (2005) Clustering aggregation. In: ICDE, pp 341–352

  21. Golub GH, Loan CFV (1991) Matrix computations. The Johns Hopkins University Press, Baltimore

    Google Scholar 

  22. Goodman LA, Kruskal WH (1954) Measures of associations for cross classification. J Am Stat Assoc 49:732–764

    Article  MATH  Google Scholar 

  23. Gordan AD, Vichi M (1998) Partitions of partitions. J Classif 15:265–285

    Article  Google Scholar 

  24. Gordan AD, Vichi M (2002) Obtaining partitions of a set of hard or fuzzy partitions. Classification, clustering and data analysis: recent advances and applications. Springer, Berlin, pp 75–79

    Google Scholar 

  25. Gyllenberg M, Koski T, Verlaan M (1997) Classification of binary vectors by stochastic complexity. J Multivar Anal 63:47–72

    Article  MATH  MathSciNet  Google Scholar 

  26. H J, Knowles J (2004) Evolutionary multiobjective clustering. In: Proceedings of the eighth international conference on parallel problem solving from nature. Springer, New York, pp 1081–1091

    Google Scholar 

  27. Hadjitodorov ST, Kuncheva LI, Todorova LP (2006) Moderate diversity for better cluster ensembles. Inform Fus 7:264–275

    Article  Google Scholar 

  28. Halkidi M, Batistakis Y, Vazirgiannis M (2001) On clustering validation techniques. J Intell Inf Syst 17:107–145

    Article  MATH  Google Scholar 

  29. Hartigan JA (1975) Clustering algorithms. Wiley, New York

    MATH  Google Scholar 

  30. Hu X, Yoo I, Zhang X, Nanavati P, Das D (2006) Wavelet transformation and cluster ensemble for gene expression analysis. Int J Bioinform Res Appl 1:447–460

    Article  Google Scholar 

  31. Hubert LJ, Arabie P (1985) Comparing partitions. J Classif 2:193–218

    Article  Google Scholar 

  32. Hubert LJ, Baker FB (1978) Evaluating the conformity of sociometric measurements. Psychometrika 43:31–41

    Article  MathSciNet  Google Scholar 

  33. Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice Hall, New York

    MATH  Google Scholar 

  34. Kargupta H, Huang W, Sivakumar K, Johnson EL (2001) Distributed clustering using collective principal component analysis. Knowl Inf Syst 3:422–448

    Article  MATH  Google Scholar 

  35. Katz L, Powell JH (1953) A proposed index of the conformity of one sociometric measurement to another. Psychometrika 18:249–256

    Article  MATH  Google Scholar 

  36. Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York

    Google Scholar 

  37. Law MHC, Topchy AP, Jain AK (2004) Multiobjective data clustering. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 424–430

  38. Li T (2005) A general model for clustering binary data. In: KDD’05: Proceeding of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining, pp 188–197

  39. Li T (2006) A unified view on clustering binary data. Mach Learn 62:199–215

    Article  Google Scholar 

  40. Li T, Ding C (2008) Weighted consensus clustering. In: Proceedings of 2008 SIAM international conference on data mining (SDM 2008)

  41. Li T, Ding C, Jordan MI (2007) Solving consensus and semi-supervised clustering problems using nonnegative matrix factorization. In: Proceedings of 2007 IEEE international conference on data mining (ICDM 2007)

  42. Li T, Ma S (2004) IFD: iterative feature and data clustering. In: Proceedings of the 2004 SIAM international conference on data mining (SDM 2004). SIAM, Philadelphia

    Google Scholar 

  43. Li T, Ma S, Ogihara M (2004a) Document clustering via adaptive subspace iteration. In: Proceedings of twenty-seventh annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2004), pp 218–225

  44. Li T, Ogihara M (2004) Music artist style identification by semisupervised learning from both lyrics and content. In: Proceedings of the ACM conference on multimedia

  45. Li T, Ogihara M, Li Q (2003a) A comparative study on content-based music genre classification. In: SIGIR’03. ACM, New York, pp 282–289

    Google Scholar 

  46. Li T, Ogihara M, Ma S (2004b) On combining multiple clusterings. In: CIKM, pp 294–303

  47. Li T, Zhu S, Ogihara M (2003b) Algorithms for clustering high dimensional and distributed data. Intell Data Anal J 7:305–326

    MATH  Google Scholar 

  48. Matake N, Hiroyasu T, Miki M, Senda T (2007) Multiobjective clustering with automatic k-determination for large-scale data. In: GECCO’07: Proceedings of the 9th annual conference on genetic and evolutionary computation. ACM, New York, pp 861–868

    Chapter  Google Scholar 

  49. Meila M (2003) Comparing clusterings by the variation of information. In: Proceedings of learning theory and kernel machines: 16th annual conference on learning theory and 7th kernel workshop, COLT/Kernel 2003. Springer, Berlin, pp 173–187

    Google Scholar 

  50. Messatfa H (1992) An algorithm to maximize the agreement. J Classif 9:5–15

    Article  MATH  MathSciNet  Google Scholar 

  51. Mirkin B (20001) Reinterpreting the category utility function. Mach Learn 45:219–228

    Article  Google Scholar 

  52. Mitton R (1987) Spelling checkers, spelling correctors and the misspellings of poor spellers. Inf Process Manag 23:103–209

    Article  Google Scholar 

  53. Monti S, Tamayo P, Mesirov J, Gloub T (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn J 52:91–118

    Article  MATH  Google Scholar 

  54. Moret BM (1998) The theory of computation. Addison-Wesley, Reading

    MATH  Google Scholar 

  55. Ozyer T, Alhajj R (2008a) Deciding on number of clusters by multi-objective optimization and validity analysis. J Multi-Valued Log Soft Comput 14:457–474

    MathSciNet  Google Scholar 

  56. Ozyer T, Alhajj R (2008b) Parallel clustering of high dimensional data by integrating multi-objective genetic algorithm with divide and conquer. Appl Intell, to appear, 2009

  57. Ellis PWD, Whitman B, Berenzweig A, Lawrence S (2002) The quest for ground truth in musical artist similarity. In: Proceedings of 3rd international conference on music information retrieval, pp 170–177

  58. Rosenberg S, Kim MP (1975) The method of sorting as a data gathering procedure in multivariate research. Multivar Behav Res 10:489–502

    Article  Google Scholar 

  59. Stamatatos E, Fakotakis N, Kokkinakis G (2000) Automatic text categorization in terms of genre and author. Comput Linguist 26:471–496

    Article  Google Scholar 

  60. Strehl A, Ghosh J (2003) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Lear Res 3:583–617

    Article  MATH  MathSciNet  Google Scholar 

  61. Tweedie FJ, Baayen RH (1998) How variable may a constant be? Measure of lexical richness in perspective. Comput Humanit 32:323–352

    Article  Google Scholar 

  62. Tzanetakis G, Cook P (2002) Musical genre classification of audio signals. IEEE Trans Speech Audio Process 10

  63. Vichi M (1999) One-mode classification of a three-way data matrix. J Classif 16:27–44

    Article  MATH  Google Scholar 

  64. Zhao Y, Karypis G (2001) Criterion functions for document clustering: Experiments and analysis. Technical Report, Department of Computer Science, University of Minnesota

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tao Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, T., Ogihara, M. & Ma, S. On combining multiple clusterings: an overview and a new perspective. Appl Intell 33, 207–219 (2010). https://doi.org/10.1007/s10489-009-0160-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-009-0160-4

Keywords

Navigation