Abstract
Data clustering is a fundamental and very popular method of data analysis. Its subjective nature, however, means that different clustering algorithms or different parameter settings can produce widely varying and sometimes conflicting results. This has led to the use of clustering comparison measures to quantify the degree of similarity between alternative clusterings. Existing measures, though, can be limited in their ability to assess similarity and sometimes generate unintuitive results. They also cannot be applied to compare clusterings which contain different data points, an activity which is important for scenarios such as data stream analysis. In this paper, we introduce a new clustering similarity measure, known as ADCO, which aims to address some limitations of existing measures, by allowing greater flexibility of comparison via the use of density profiles to characterize a clustering. In particular, it adopts a ‘data mining style’ philosophy to clustering comparison, whereby two clusterings are considered to be more similar, if they are likely to give rise to similar types of prediction models. Furthermore, we show that this new measure can be applied as a highly effective objective function within a new algorithm, known as MAXIMUS, for generating alternate clusterings.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Aggarwal CC (2003) A framework for diagnosing changes in evolving data streams. In: Proceedings of ACM SIGMOD international conference on management of data, pp 575–586
Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on very large data bases, pp 81–92
Bacardit J, Garrell JM (2004) Analysis and improvements of the adaptive discretization intervals knowledge representation. In: GECCO, vol 2, pp 726–738
Bae E, Bailey J (2006) Coala: a novel approach for the extraction of an alternate clustering of high quality and high dissimilarity. In: International conference on data mining, pp 53–62
Bae E, Bailey J, Dong G (2006) Clustering similarity comparison using density profiles. In: Australian joint conference on artificial intelligence, pp 342–351
Borg I, Groenen P (1997) Modern multidimensional scaling: theory and applications. Springer, Berlin
Caruana R, Elhawary M, Nguyen N, Smith C (2006) Meta clustering. In: International conference on data mining, pp 107–118
Chmielewski MR, Grzymala-busse JW (1996) Global discretization of continuous attributes as preprocessing for machine learning. In: International journal of approximate reasoning, pp 294–301
Davidson I (2005a) Clustering with constraints: feasibility issues and the k-means algorithm. In: SIAM international conference on data mining
Davidson I (2005b) Agglomerative hierarchical clustering with constraints: theoretical and empirical results. In: Pacific Asia conference on knowledge discovery, pp 59–70
Davidson I, Ravi S (2006) Identifying and generating easy sets of constraints for clustering. In: Conference on artificial intelligence
Dunn J (1974) A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J Cybern 3: 32–57
Ekman G (1963) A direct method for multidimensional ratio scaling. Psychometrika 28(1): 33–41
Estivill-Castro V (2002) Why so many clustering algorithms: a position paper. SIGKDD Explor Newsl 4(1): 65–75
Fayyad UM, Irani KB (1992) On the handling of continuous-valued attributes in decision tree generation. Mach Learn 8: 87–102
Fred A, Jain A (2003) Robust data clustering. In: Proceedings of conference on computer vision and pattern recognition, pp 128–133
Fred ALN, Jain AK (2005) Combining multiple clusterings using evidence accumulation. IEEE Trans Pattern Anal Mach Intell 27(6): 835–850
Gondek D (2004) Non-redundant data clustering. In: International conference on data mining, pp 75–82
Gondek D, Hofmann T (2003) Conditional information bottleneck clustering. In: International conference on data mining, pp 36–42
Gondek D, Hofmann T (2004) Non-redundant data clustering. In: International conference on data mining, pp 75–82
Gower JC, Legendre P (1986) Metric and dissimilarity properties of dissimilarity coefficients. J Classif 3: 5–48
Gregson RAM (1975) Psychometrics of similarity. Academic Press, San Diego
Hamers L, Hemeryck Y, Herweyers G, Janssen M, Keters H, Rousseau R, Vanhoutte A (1989) Similarity measures in scientometric research: the Jaccard index versus Salton’s cosine formula. Inf Process Manag 25(3): 315–318
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1): 193–218
Karypis G, Aggarwal R, Kumar V, Shekhar S (1997) Multilevel hypergraph partitioning: application in vlsi domain. In: Design automation conference, p 526
Kendall K (1999) A database of computer attacks for the evaluation of intrusion detection systems. Masters Thesis, Massachusetts Institute of Technology
Kuhn HW (1955) The Hungarian method for the assignment problem. Naval Res Logist Q 2: 83–97
Larsen B, Aone C (1999) Fast and effective text mining using linear time document clustering. In: Proceedings of the conference on knowledge discovery and data mining, pp 16–22
Meila M (2002) Comparing clusterings. Technical Report, Department of Statistics, University of Washington
Meila M (2003) Comparing clusterings—technical report. http://citeseer.ist.psu.edu/meila02comparing.html
Meila M (2005) Comparing clusterings—an axiomatic view. In: International conference on machine learning
Meilǎ M (2005) Comparing clusterings: an axiomatic view. In: Proceedings of the 22nd international conference on Machine learning, pp 577–584
Mixed Integer Linear Programming (MILP) Solver (2007). http://lpsolve.sourceforge.net
Mirkin B (2005) Clustering for data mining: a data recovery approach. Chapman and Hall/CRC, Boca Raton
Rand W (1971a) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66: 846–850
Rand WM (1971b) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66: 622–626
Ratanamahatana C (2003) Cloni: clustering of square root of n interval discretization. Data Mining IV, Info. and Comm. Tech 29
Repository U (2008) http://archive.ics.uci.edu/ml
Richeldi M, Rossotto M (1995) Class-driven statistical discretization of continuous attributes (extended abstract). In: Proceedings of the 8th European conference on machine learning. Springer, London, UK, pp 335–338
Strehl A, Ghosh J (2003) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn 3: 583–617
Streilein WW, Cunningham RK, Webster SE (2001) Improved detection of low-profile probe and denial-of-service attacks. In: Proceedings of workshop on statistical and machine learning techniques in computer intrusion detection
Sung AH, Mukkamala S (2003) Identifying important features for intrusion detection using support vector machines and neural networks. In: Proceedings of the symposium on applications and the internet (SAINT), pp 209–217
Theodoridis S, Koutroumbas K (1999) Pattern recognition. Academic Press, San Diego
Tishby N, Pereira F, Bialek W (1999) The information bottleneck method. Allerton Conference on Communication, Control and Computing, pp 368–377
Topchy A, Jain AK (2005) Clustering ensembles: models of consensus and weak partitions. IEEE Trans Pattern Anal Mach Intell 27(12): 1866–1881
Topchy AP, Law MHC, Jain AK, Fred AL (2004a) Analysis of consensus partition in cluster ensemble. In: Proceedings of the 4th IEEE international conference on data mining, pp 225–232
Topchy A, Martin H, Law C, Jain A, Fred A (2004b) Analysis of consensus partition in cluster ensemble. In: International conference on data mining, pp 225–232
Torgo L, Soares C (1998) Dynamic discretization of continuous attributes. In: Proceedings of the 6th Ibero-American conference on AI, pp 160–169
Wallace DL (1983) Comment. J Am Stat Assoc 78(383): 569–576
Yang Y, Webb GI (2009) Discretization for naive-bayes learning: managing discretization bias and variance. Mach Learn 74(1): 39–74
Zhou D, Li J, Zha H (2005) A new mallows distance based metric for comparing clusterings. In: Proceedings of the 22nd international conference on machine learning, pp 1028–1035
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Charu Aggarwal.
Part of this work appeared in a preliminary form in Bae et al. (2006). See Sect. 2 for discussion.
Rights and permissions
About this article
Cite this article
Bae, E., Bailey, J. & Dong, G. A clustering comparison measure using density profiles and its application to the discovery of alternate clusterings. Data Min Knowl Disc 21, 427–471 (2010). https://doi.org/10.1007/s10618-009-0164-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-009-0164-z