skip to main content
10.1145/2484838.2484844acmotherconferencesArticle/Chapter ViewAbstractPublication PagesssdbmConference Proceedingsconference-collections
research-article

On the combination of relative clustering validity criteria

Published: 29 July 2013 Publication History

Abstract

Many different relative clustering validity criteria exist that are very useful as quantitative measures for assessing the quality of data partitions. These criteria are endowed with particular features that may make each of them more suitable for specific classes of problems. Nevertheless, the performance of each criterion is usually unknown a priori by the user. Hence, choosing a specific criterion is not a trivial task. A possible approach to circumvent this drawback consists of combining different relative criteria in order to obtain more robust evaluations. However, this approach has so far been applied in an ad-hoc fashion only; its real potential is actually not well-understood. In this paper, we present an extensive study on the combination of relative criteria considering both synthetic and real datasets. The experiments involved 28 criteria and 4 different combination strategies applied to a varied collection of data partitions produced by 5 clustering algorithms. In total, 427,680 partitions of 972 synthetic datasets and 14,000 partitions of a collection of 400 image datasets were considered. Based on the results, we discuss the shortcomings and possible benefits of combining different relative criteria into a committee.

References

[1]
A. Albalate and D. Suendermann. A combination approach to cluster validation based on statistical quantiles. In International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing --- IJCBS, pages 549--555, 2009.
[2]
J. C. Bezdek and N. R. Pal. Some new indexes of cluster validity. IEEE Transactions on Systems, Man and Cybernetics, Part B, 28(3):301--315, 1998.
[3]
N. Bolshakova and F. Azuaje. Cluster validation techniques for genome expression data. Signal Processing, 83(4):825--833, 2003.
[4]
M. B. Brown and A. B. Forsythe. Robust tests for the equality of variances. Journal of the American Statistical Association, 69(346):364--367, 1974.
[5]
R. B. Calinski and J. Harabasz. A dentrite method for cluster analysis. Communications in Statistics, 3:1--27, 1974.
[6]
R. J. G. B. Campello and E. R. Hruschka. On comparing two sequences of numbers and its applications to clustering analysis. Inf. Sciences, 179:1025--1039, 2009.
[7]
G. Casella and R. L. Berger. Statistical Inference. Duxbury Press, 2th edition, 2001.
[8]
D. L. Davies and D. W. Bouldin. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1:224--227, 1979.
[9]
J. C. Dunn. Well separated clusters and optimal fuzzy partitions. Journal of Cybernetics, 4:95--104, 1974.
[10]
B. S. Everitt, S. Landau, and M. Leese. Cluster Analysis. Arnold, 4th edition, 2001.
[11]
M. Friedman. A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics, 11(1):86--92, 1940.
[12]
J. Gao and P.-N. Tan. Converting output scores from outlier detection algorithms into probability estimates. In IEEE International Conference on Data Mining --- ICDM, pages 212--221, 2006.
[13]
J. M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders. The Amsterdam library of object images. International Journal of Computer Vision, 61(1):103--112, 2005.
[14]
J. Ghosh and A. Acharya. Cluster ensembles. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(4):305--315, 2011.
[15]
M. Halkidi, Y. Batistakis, and M. Vazirgiannis. On clustering validation techniques. Journal of Intelligent Information Systems, 17:107--145, 2001.
[16]
R. S. Hill. A stopping rule for partitioning dendrograms. Botanical Gazette, 141:321--324, 1980.
[17]
D. Horta and R. J. G. B. Campello. Automatic aspect discrimination in data clustering. Pattern Recognition, 45(12):4370--4388, 2012.
[18]
E. R. Hruschka, R. J. G. B. Campello, and L. N. Castro. Improving the efficiency of a clustering genetic algorithm. In Ibero-American Conference on Artificial Intelligence --- IBERAMIA, volume 3315, pages 861--870. 2004.
[19]
E. R. Hruschka, R. J. G. B. Campello, and L. N. Castro. Evolving clusters in gene-expression data. Information Sciences, 176:1898--1927, 2006.
[20]
L. J. Hubert and J. R. Levin. A general statistical framework for assessing categorical clustering in free recall. Psychological Bulletin, 10:1072--1080, 1976.
[21]
A. K. Jain. Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31:651--666, 2010.
[22]
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.
[23]
A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM Computing Surveys, 31:264--323, 1999.
[24]
L. Kaufman and P. Rousseeuw. Finding Groups in Data. Wiley, 1990.
[25]
H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. Interpreting and unifying outlier scores. SIAM International Conference on Data Mining --- SDM, pages 13--24, 2011.
[26]
A. Lazarevic and V. Kumar. Feature bagging for outlier detection. In ACM International Conference on Knowledge Discovery and Data Mining --- KDD, pages 157--166, 2005.
[27]
J. B. Machado, R. J. G. B. Campello, and W. C. Amaral. Design of OBF-TS fuzzy models based on multiple clustering validity criteria. In International Conference on Tools with Artificial Intelligence --- ICTAI, pages 336--339, 2007.
[28]
J. B. McQueen. Some methods of classification and analysis of multivariate observations. 5th Berkeley Symposium on Mathematical Statistics and Probability, pages 281--297, 1967.
[29]
G. W. Milligan. A monte carlo study of thirty internal criterion measures for cluster analysis. Psychometrika, 46(2):187--199, 1981.
[30]
G. W. Milligan and M. C. Cooper. An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50(2):159--179, 1985.
[31]
H. V. Nguyen, H. H. Ang, and V. Gopalkrishnan. Mining outliers with ensemble of heterogeneous detectors on random subspaces. In International Conference on Database Systems for Advanced Applications --- DASFAA, pages 368--383, 2010.
[32]
M. K. Pakhira, S. Bandyopadhyay, and U. Maulik. Validity index for crisp and fuzzy clusters. Pattern Recognition, 37:487--501, 2004.
[33]
V. Pihur, S. Datta, and S. Datta. Weighted rank aggregation of cluster validation measures: a Monte Carlo cross-entropy approach. Bioinformatics, 23(13):1607--1615, 2007.
[34]
R. Rabbany, M. Takaffoli, J. Fagnan, O. R. Zaiane, and R. J. G. B. Campello. Relative validity criteria for community mining algorithms. IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining --- ASONAM, pages 258--265, 2012.
[35]
D. A. Ratkowsky and G. N. Lance. A criterion for determining the number of groups in a classification. Australian Computer Journal, 10:115--117, 1978.
[36]
L. Rokach. Ensemble-based classifiers. Artificial Intelligence Review, 33:1--39, 2010.
[37]
P. J. Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53--65, 1987.
[38]
E. Schubert, R. Wojdanowski, A. Zimek, and H.-P. Kriegel. On evaluation of outlier rankings and outlier scores. SIAM International Conference on Data Mining --- SDM, pages 1047--1058, 2012.
[39]
W. Sheng, S. Swift, L. Zhang, and X. Liu. A weighted sum validity function for clustering with a hybrid niching genetic algorithm. IEEE Transactions on Systems, Man, and Cybernetics, Part B., 35(6):1156--1167, 2005.
[40]
L. Vendramin, R. J. G. B. Campello, and E. R. Hruschka. On the comparison of relative clustering validity criteria. SIAM International Conference on Data Mining --- SDM, pages 733--744, 2009.
[41]
L. Vendramin, R. J. G. B. Campello, and E. R. Hruschka. Relative clustering validity criteria: A comparative overview. Statistical Analysis and Data Mining, 3(4):209--335, 2010.
[42]
R. Xu and D. C. Wunsch II. Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16:645--678, 2005.

Cited By

View all
  • (2022)Electricity Pattern Analysis by Clustering Domestic Load Profiles Using Discrete Wavelet TransformEnergies10.3390/en1504135015:4(1350)Online publication date: 13-Feb-2022
  • (2022)Monitoring a Bolted Vibrating Structure Using Multiple Acoustic Emission Sensors: A BenchmarkData10.3390/data70300317:3(31)Online publication date: 2-Mar-2022
  • (2022)The area under the ROC curve as a measure of clustering qualityData Mining and Knowledge Discovery10.1007/s10618-022-00829-036:3(1219-1245)Online publication date: 1-May-2022
  • Show More Cited By

Index Terms

  1. On the combination of relative clustering validity criteria

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    SSDBM '13: Proceedings of the 25th International Conference on Scientific and Statistical Database Management
    July 2013
    401 pages
    ISBN:9781450319218
    DOI:10.1145/2484838
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 29 July 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. clustering validation
    2. combinations of validity criteria
    3. relative validity criteria

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SSDBM '13

    Acceptance Rates

    Overall Acceptance Rate 56 of 146 submissions, 38%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)12
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 08 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Electricity Pattern Analysis by Clustering Domestic Load Profiles Using Discrete Wavelet TransformEnergies10.3390/en1504135015:4(1350)Online publication date: 13-Feb-2022
    • (2022)Monitoring a Bolted Vibrating Structure Using Multiple Acoustic Emission Sensors: A BenchmarkData10.3390/data70300317:3(31)Online publication date: 2-Mar-2022
    • (2022)The area under the ROC curve as a measure of clustering qualityData Mining and Knowledge Discovery10.1007/s10618-022-00829-036:3(1219-1245)Online publication date: 1-May-2022
    • (2022)Similarity-Based Unsupervised Evaluation of Outlier DetectionSimilarity Search and Applications10.1007/978-3-031-17849-8_19(234-248)Online publication date: 5-Oct-2022
    • (2022)A Variational Bayesian Clustering Approach to Acoustic Emission Interpretation Including Soft LabelsBelief Functions: Theory and Applications10.1007/978-3-031-17801-6_3(23-32)Online publication date: 30-Sep-2022
    • (2021)Country transition index based on hierarchical clustering to predict next COVID-19 wavesScientific Reports10.1038/s41598-021-94661-z11:1Online publication date: 27-Jul-2021
    • (2020)Concept Drift Detection in Data Stream Clustering and its Application on Weather DataInternational Journal of Agricultural and Environmental Information Systems10.4018/IJAEIS.202001010411:1(67-85)Online publication date: 1-Jan-2020
    • (2020)Learning in the presence of concept recurrence in data stream clusteringJournal of Big Data10.1186/s40537-020-00354-17:1Online publication date: 15-Sep-2020
    • (2020)Weighted Cluster Ensemble Based on Partition Relevance Analysis With Reduction StepIEEE Access10.1109/ACCESS.2020.30030468(113720-113736)Online publication date: 2020
    • (2020)Ensembles of Cluster Validation Indices for Label Noise FilteringIntelligent Systems: Theory, Research and Innovation in Applications10.1007/978-3-030-38704-4_4(71-98)Online publication date: 4-Mar-2020
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media