A Data Set Oriented Approach for Clustering Algorithm Selection

Halkich, Maria; Vazirgiannis, Michalis

doi:10.1007/3-540-44794-6_14

Maria Halkich³ &
Michalis Vazirgiannis³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2168))

Included in the following conference series:

European Conference on Principles of Data Mining and Knowledge Discovery

2882 Accesses

Abstract

In the last years the availability of huge transactional and experimental data sets and the arising requirements for data mining created needs for clustering algorithms that scale and can be applied in diverse domains. Thus, a variety of algorithms have been proposed which have application in different fields and may result in different partitioning of a data set, depending on the specific clustering criterion used. Moreover, since clustering is an unsupervised process, most of the algorithms are based on assumptions in order to define a partitioning of a data set. It is then obvious that in most applications the final clustering scheme requires some sort of evaluation.

In this paper we present a clustering validity procedure, which taking in account the inherent features of a data set evaluates the results of different clustering algorithms applied to it. A validity index, S_Dbw, is defined according to wellknown clustering criteria so as to enable the selection of the algorithm providing the best partitioning of a data set. We evaluate the reliability of our approach both theoretically and experimentally, considering three representative clustering algorithms ran on synthetic and real data sets. It performed favorably in all studies, giving an indication of the algorithm that is suitable for the considered application.

Download to read the full chapter text

Chapter PDF

Data Clustering Algorithms: Experimentation and Comparison

Partitioning and hierarchical based clustering: a comparative empirical assessment on internal and external indices, accuracy, and time

Article 29 November 2019

On the Analysis of Clustering Algorithms in Data Mining

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, Prabhakar Raghavan, “Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications”, Proceedings of SIGMOD, 1998.
Google Scholar
Michael J. A. Berry, Gordon Linoff. Data Mining Techniques For marketing, Sales and Customer Support. John Willey & Sons, Inc, 1996.
Google Scholar
Martin Ester, Hans-Peter Kriegel, Jorg Sander, Michael Wimmer, Xiaowei Xu. “Incremental Clustering for Mining in a Data Warehousing Environment”,Proceedings of 24^th VLDB Conference, New York, USA, 1998.
Google Scholar
Martin Ester, Hans-Peter Kriegel, Jorg Sander, Xiaowei Xu. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”,Proceedings of 2^nd Int. Conf. On Knowledge Discovery and Data Mining, Portland, OR, pp. 226–231, 1996.
Google Scholar
Usama M. Fayyad, Gregory Piatesky-Shapiro, Padhraic Smuth and Ramasamy Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI Press 1996
Google Scholar
Sudipto Guha, Rajeev Rastogi, Kyueseok Shim. “CURE: An Efficient Clustering Algorithm for Large Databases”, Published in the Proceedings of the ACM SIGMOD Conference, 1998.
Google Scholar
Sudipto Guha, Rajeev Rastogi, Kyueseok Shim. “ROCK: A Robust Clustering Algorithm for Categorical Attributes”, Published in the Proceedings of the IEEE Conference on Data Engineering, 1999.
Google Scholar
Alexander Hinneburg, Daniel Keim. “An Efficient Approach to Clustering in Large Multimedia Databases with Noise”. Proceeding of KDD’ 98, 1998.
Google Scholar
A.K Jain, M.N. Murty, P.J. Flyn. “Data Clustering: A Review”, ACM Computing Surveys, Vol. 31, No3, September 1999.
Google Scholar
Milligan, G.W. and Cooper, M.C. (1985), “An Examination of Procedures for Determining the Number of Clusters in a Data Set”, Psychometrika, 50, 159–179.
Article Google Scholar
Raymond Ng, Jiawei Han. “Efficient and Effective Clustering Methods for Spatial Data Mining”.Proceeding of the 20^th VLDB Conference,Santiago, Chile, 1994.
Google Scholar
C. Sheikholeslami, S. Chatterjee, A. Zhang. “WaveCluster: A-MultiResolution Clustering Approach for Very Large Spatial Database”.Proceedings of 24^th VLDB Conference, New York, USA, 1998.
Google Scholar
Sharma S.C. Applied Multivariate Techniques. John Willwy & Sons, 1996.
Google Scholar
S. Theodoridis, K. Koutroubas. Pattern recognition, Academic Press, 1999
Google Scholar
M. Halkidi, M. Vazirgiannis, I. Batistakis. “Quality scheme assessment in the clustering process”, In Proceedings of PKDD, Lyon, France, 2000.
Google Scholar
Ramze Rezaee, B.P.F. Lelieveldt, J.H.C Reiber. “A new cluster validity index for the fuzzy c-mean”, Pattern Recognition Letters, 19, pp237–246, 1998.
Article MATH Google Scholar
Y. Theodoridis. Spatial Datasets: an “unofficial” collection. http://dias.cti.gr/~ytheod/research/datasets/spatial.html

Download references

Author information

Authors and Affiliations

Department of Informatics, Athens University of Economics & Business, Patision 76, 10434, Athens, Greece (Hellas)
Maria Halkich & Michalis Vazirgiannis

Authors

Maria Halkich
View author publications
You can also search for this author in PubMed Google Scholar
Michalis Vazirgiannis
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Albert-Ludwigs University Freiburg, Georges Köhler-Allee, Geb. 079, 79110, Freiburg, Germany
Luc De Raedt
Inst.of Information and Computing Sciences Dept. of Mathematics and Computer Science, University of Utrecht, Padualaan 14, de Uithof, 3508, TB Utrecht, The Netherlands
Arno Siebes

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Halkich, M., Vazirgiannis, M. (2001). A Data Set Oriented Approach for Clustering Algorithm Selection. In: De Raedt, L., Siebes, A. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2001. Lecture Notes in Computer Science(), vol 2168. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44794-6_14

Download citation

DOI: https://doi.org/10.1007/3-540-44794-6_14
Published: 28 August 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42534-2
Online ISBN: 978-3-540-44794-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics