Abstract
In the last years the availability of huge transactional and experimental data sets and the arising requirements for data mining created needs for clustering algorithms that scale and can be applied in diverse domains. Thus, a variety of algorithms have been proposed which have application in different fields and may result in different partitioning of a data set, depending on the specific clustering criterion used. Moreover, since clustering is an unsupervised process, most of the algorithms are based on assumptions in order to define a partitioning of a data set. It is then obvious that in most applications the final clustering scheme requires some sort of evaluation.
In this paper we present a clustering validity procedure, which taking in account the inherent features of a data set evaluates the results of different clustering algorithms applied to it. A validity index, S_Dbw, is defined according to wellknown clustering criteria so as to enable the selection of the algorithm providing the best partitioning of a data set. We evaluate the reliability of our approach both theoretically and experimentally, considering three representative clustering algorithms ran on synthetic and real data sets. It performed favorably in all studies, giving an indication of the algorithm that is suitable for the considered application.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, Prabhakar Raghavan, “Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications”, Proceedings of SIGMOD, 1998.
Michael J. A. Berry, Gordon Linoff. Data Mining Techniques For marketing, Sales and Customer Support. John Willey & Sons, Inc, 1996.
Martin Ester, Hans-Peter Kriegel, Jorg Sander, Michael Wimmer, Xiaowei Xu. “Incremental Clustering for Mining in a Data Warehousing Environment”,Proceedings of 24th VLDB Conference, New York, USA, 1998.
Martin Ester, Hans-Peter Kriegel, Jorg Sander, Xiaowei Xu. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”,Proceedings of 2nd Int. Conf. On Knowledge Discovery and Data Mining, Portland, OR, pp. 226–231, 1996.
Usama M. Fayyad, Gregory Piatesky-Shapiro, Padhraic Smuth and Ramasamy Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI Press 1996
Sudipto Guha, Rajeev Rastogi, Kyueseok Shim. “CURE: An Efficient Clustering Algorithm for Large Databases”, Published in the Proceedings of the ACM SIGMOD Conference, 1998.
Sudipto Guha, Rajeev Rastogi, Kyueseok Shim. “ROCK: A Robust Clustering Algorithm for Categorical Attributes”, Published in the Proceedings of the IEEE Conference on Data Engineering, 1999.
Alexander Hinneburg, Daniel Keim. “An Efficient Approach to Clustering in Large Multimedia Databases with Noise”. Proceeding of KDD’ 98, 1998.
A.K Jain, M.N. Murty, P.J. Flyn. “Data Clustering: A Review”, ACM Computing Surveys, Vol. 31, No3, September 1999.
Milligan, G.W. and Cooper, M.C. (1985), “An Examination of Procedures for Determining the Number of Clusters in a Data Set”, Psychometrika, 50, 159–179.
Raymond Ng, Jiawei Han. “Efficient and Effective Clustering Methods for Spatial Data Mining”.Proceeding of the 20th VLDB Conference,Santiago, Chile, 1994.
C. Sheikholeslami, S. Chatterjee, A. Zhang. “WaveCluster: A-MultiResolution Clustering Approach for Very Large Spatial Database”.Proceedings of 24th VLDB Conference, New York, USA, 1998.
Sharma S.C. Applied Multivariate Techniques. John Willwy & Sons, 1996.
S. Theodoridis, K. Koutroubas. Pattern recognition, Academic Press, 1999
M. Halkidi, M. Vazirgiannis, I. Batistakis. “Quality scheme assessment in the clustering process”, In Proceedings of PKDD, Lyon, France, 2000.
Ramze Rezaee, B.P.F. Lelieveldt, J.H.C Reiber. “A new cluster validity index for the fuzzy c-mean”, Pattern Recognition Letters, 19, pp237–246, 1998.
Y. Theodoridis. Spatial Datasets: an “unofficial” collection. http://dias.cti.gr/~ytheod/research/datasets/spatial.html
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Halkich, M., Vazirgiannis, M. (2001). A Data Set Oriented Approach for Clustering Algorithm Selection. In: De Raedt, L., Siebes, A. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2001. Lecture Notes in Computer Science(), vol 2168. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44794-6_14
Download citation
DOI: https://doi.org/10.1007/3-540-44794-6_14
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42534-2
Online ISBN: 978-3-540-44794-8
eBook Packages: Springer Book Archive