A clustering comparison measure using density profiles and its application to the discovery of alternate clusterings

Bae, Eric; Bailey, James; Dong, Guozhu

doi:10.1007/s10618-009-0164-z

A clustering comparison measure using density profiles and its application to the discovery of alternate clusterings

Published: 16 January 2010

Volume 21, pages 427–471, (2010)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Eric Bae¹,
James Bailey¹ &
Guozhu Dong²

521 Accesses
29 Citations
Explore all metrics

Abstract

Data clustering is a fundamental and very popular method of data analysis. Its subjective nature, however, means that different clustering algorithms or different parameter settings can produce widely varying and sometimes conflicting results. This has led to the use of clustering comparison measures to quantify the degree of similarity between alternative clusterings. Existing measures, though, can be limited in their ability to assess similarity and sometimes generate unintuitive results. They also cannot be applied to compare clusterings which contain different data points, an activity which is important for scenarios such as data stream analysis. In this paper, we introduce a new clustering similarity measure, known as ADCO, which aims to address some limitations of existing measures, by allowing greater flexibility of comparison via the use of density profiles to characterize a clustering. In particular, it adopts a ‘data mining style’ philosophy to clustering comparison, whereby two clusterings are considered to be more similar, if they are likely to give rise to similar types of prediction models. Furthermore, we show that this new measure can be applied as a highly effective objective function within a new algorithm, known as MAXIMUS, for generating alternate clusterings.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Identifying Effective Algorithms and Measures for Enhanced Clustering Quality: A Comprehensive Examination of Arbitrary Decisions in Hierarchical Clustering Algorithms

Article 15 March 2025

Clustering Performance Analysis

Benchmarking distance-based partitioning methods for mixed-type data

Article Open access 22 September 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Aggarwal CC (2003) A framework for diagnosing changes in evolving data streams. In: Proceedings of ACM SIGMOD international conference on management of data, pp 575–586
Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on very large data bases, pp 81–92
Bacardit J, Garrell JM (2004) Analysis and improvements of the adaptive discretization intervals knowledge representation. In: GECCO, vol 2, pp 726–738
Bae E, Bailey J (2006) Coala: a novel approach for the extraction of an alternate clustering of high quality and high dissimilarity. In: International conference on data mining, pp 53–62
Bae E, Bailey J, Dong G (2006) Clustering similarity comparison using density profiles. In: Australian joint conference on artificial intelligence, pp 342–351
Borg I, Groenen P (1997) Modern multidimensional scaling: theory and applications. Springer, Berlin
MATH Google Scholar
Caruana R, Elhawary M, Nguyen N, Smith C (2006) Meta clustering. In: International conference on data mining, pp 107–118
Chmielewski MR, Grzymala-busse JW (1996) Global discretization of continuous attributes as preprocessing for machine learning. In: International journal of approximate reasoning, pp 294–301
Davidson I (2005a) Clustering with constraints: feasibility issues and the k-means algorithm. In: SIAM international conference on data mining
Davidson I (2005b) Agglomerative hierarchical clustering with constraints: theoretical and empirical results. In: Pacific Asia conference on knowledge discovery, pp 59–70
Davidson I, Ravi S (2006) Identifying and generating easy sets of constraints for clustering. In: Conference on artificial intelligence
Dunn J (1974) A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J Cybern 3: 32–57
Article MathSciNet Google Scholar
Ekman G (1963) A direct method for multidimensional ratio scaling. Psychometrika 28(1): 33–41
Article Google Scholar
Estivill-Castro V (2002) Why so many clustering algorithms: a position paper. SIGKDD Explor Newsl 4(1): 65–75
Article MathSciNet Google Scholar
Fayyad UM, Irani KB (1992) On the handling of continuous-valued attributes in decision tree generation. Mach Learn 8: 87–102
MATH Google Scholar
Fred A, Jain A (2003) Robust data clustering. In: Proceedings of conference on computer vision and pattern recognition, pp 128–133
Fred ALN, Jain AK (2005) Combining multiple clusterings using evidence accumulation. IEEE Trans Pattern Anal Mach Intell 27(6): 835–850
Article Google Scholar
Gondek D (2004) Non-redundant data clustering. In: International conference on data mining, pp 75–82
Gondek D, Hofmann T (2003) Conditional information bottleneck clustering. In: International conference on data mining, pp 36–42
Gondek D, Hofmann T (2004) Non-redundant data clustering. In: International conference on data mining, pp 75–82
Gower JC, Legendre P (1986) Metric and dissimilarity properties of dissimilarity coefficients. J Classif 3: 5–48
Article MATH MathSciNet Google Scholar
Gregson RAM (1975) Psychometrics of similarity. Academic Press, San Diego
Google Scholar
Hamers L, Hemeryck Y, Herweyers G, Janssen M, Keters H, Rousseau R, Vanhoutte A (1989) Similarity measures in scientometric research: the Jaccard index versus Salton’s cosine formula. Inf Process Manag 25(3): 315–318
Article Google Scholar
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1): 193–218
Article Google Scholar
Karypis G, Aggarwal R, Kumar V, Shekhar S (1997) Multilevel hypergraph partitioning: application in vlsi domain. In: Design automation conference, p 526
Kendall K (1999) A database of computer attacks for the evaluation of intrusion detection systems. Masters Thesis, Massachusetts Institute of Technology
Kuhn HW (1955) The Hungarian method for the assignment problem. Naval Res Logist Q 2: 83–97
Article Google Scholar
Larsen B, Aone C (1999) Fast and effective text mining using linear time document clustering. In: Proceedings of the conference on knowledge discovery and data mining, pp 16–22
Meila M (2002) Comparing clusterings. Technical Report, Department of Statistics, University of Washington
Meila M (2003) Comparing clusterings—technical report. http://citeseer.ist.psu.edu/meila02comparing.html
Meila M (2005) Comparing clusterings—an axiomatic view. In: International conference on machine learning
Meilǎ M (2005) Comparing clusterings: an axiomatic view. In: Proceedings of the 22nd international conference on Machine learning, pp 577–584
Mixed Integer Linear Programming (MILP) Solver (2007). http://lpsolve.sourceforge.net
Mirkin B (2005) Clustering for data mining: a data recovery approach. Chapman and Hall/CRC, Boca Raton
Book MATH Google Scholar
Rand W (1971a) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66: 846–850
Article Google Scholar
Rand WM (1971b) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66: 622–626
Article Google Scholar
Ratanamahatana C (2003) Cloni: clustering of square root of n interval discretization. Data Mining IV, Info. and Comm. Tech 29
Repository U (2008) http://archive.ics.uci.edu/ml
Richeldi M, Rossotto M (1995) Class-driven statistical discretization of continuous attributes (extended abstract). In: Proceedings of the 8th European conference on machine learning. Springer, London, UK, pp 335–338
Strehl A, Ghosh J (2003) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn 3: 583–617
Article MATH MathSciNet Google Scholar
Streilein WW, Cunningham RK, Webster SE (2001) Improved detection of low-profile probe and denial-of-service attacks. In: Proceedings of workshop on statistical and machine learning techniques in computer intrusion detection
Sung AH, Mukkamala S (2003) Identifying important features for intrusion detection using support vector machines and neural networks. In: Proceedings of the symposium on applications and the internet (SAINT), pp 209–217
Theodoridis S, Koutroumbas K (1999) Pattern recognition. Academic Press, San Diego
Google Scholar
Tishby N, Pereira F, Bialek W (1999) The information bottleneck method. Allerton Conference on Communication, Control and Computing, pp 368–377
Topchy A, Jain AK (2005) Clustering ensembles: models of consensus and weak partitions. IEEE Trans Pattern Anal Mach Intell 27(12): 1866–1881
Article Google Scholar
Topchy AP, Law MHC, Jain AK, Fred AL (2004a) Analysis of consensus partition in cluster ensemble. In: Proceedings of the 4th IEEE international conference on data mining, pp 225–232
Topchy A, Martin H, Law C, Jain A, Fred A (2004b) Analysis of consensus partition in cluster ensemble. In: International conference on data mining, pp 225–232
Torgo L, Soares C (1998) Dynamic discretization of continuous attributes. In: Proceedings of the 6th Ibero-American conference on AI, pp 160–169
Wallace DL (1983) Comment. J Am Stat Assoc 78(383): 569–576
Article Google Scholar
Yang Y, Webb GI (2009) Discretization for naive-bayes learning: managing discretization bias and variance. Mach Learn 74(1): 39–74
Article Google Scholar
Zhou D, Li J, Zha H (2005) A new mallows distance based metric for comparing clusterings. In: Proceedings of the 22nd international conference on machine learning, pp 1028–1035

Download references

Author information

Authors and Affiliations

NICTA Victoria Laboratory, Department of Computer Science and Software Engineering, The University of Melbourne, Parkville, Melbourne, VIC, Australia
Eric Bae & James Bailey
Department of Computer Science and Engineering, Wright State University, Dayton, OH, USA
Guozhu Dong

Authors

Eric Bae
View author publications
You can also search for this author inPubMed Google Scholar
James Bailey
View author publications
You can also search for this author inPubMed Google Scholar
Guozhu Dong
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to James Bailey.

Additional information

Responsible editor: Charu Aggarwal.

Part of this work appeared in a preliminary form in Bae et al. (2006). See Sect. 2 for discussion.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bae, E., Bailey, J. & Dong, G. A clustering comparison measure using density profiles and its application to the discovery of alternate clusterings. Data Min Knowl Disc 21, 427–471 (2010). https://doi.org/10.1007/s10618-009-0164-z

Download citation

Received: 22 September 2008
Accepted: 28 December 2009
Published: 16 January 2010
Issue Date: November 2010
DOI: https://doi.org/10.1007/s10618-009-0164-z

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A clustering comparison measure using density profiles and its application to the discovery of alternate clusterings

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Identifying Effective Algorithms and Measures for Enhanced Clustering Quality: A Comprehensive Examination of Arbitrary Decisions in Hierarchical Clustering Algorithms

Clustering Performance Analysis

Benchmarking distance-based partitioning methods for mixed-type data

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now