Abstract
Identifying clusters in a dataset is valuable. Most existing data clustering algorithms need the number of clusters as an input. The present paper introduces a graphical method that outputs the number of clusters. Once the number of clusters is calculated, other clustering algorithms may use it. This method is cubic in the number of input data. Given that most of the databases are extremely big, it is convenient to choose a random sample to make a faster detection of the clusters. However, not only is speed important but the level of confidence in this method is as well. So the level of accuracy is calculated for each run of the algorithm. Three different sampling algorithms have been used in order to choose the random sample. The efficiency of the three algorithms has been compared based on the number of clusters that they detect and the running time. In addition, depending on the number of points in each cluster, central points or spies that cover the clusters are assigned to each cluster. In order to have a complete vision of the evolution of data clustering and the future tendencies of the clusters, a time series analysis was conducted as well. All the tests have been conducted using datasets from the UCI Machine learning repository and artificially generated datasets. We present the experimental results and show the effect of sampling algorithms and the number of clusters.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Fung, G.: A comprehensive Overview of Basic Clustering Algorithms (2001)
Berkhin, P.: Survey of Clustering Data Mining Techniques. Accrue Software (2002)
Qin, H.: A review of clusteringalg orithms as Applied in IR, Master’s thesis, University of Illinois (1999)
Han, J., Kamber, M.: Data mining: concepts and techniques. Morgan Kaufmann Publishers, San Mateo (2000)
Epter, S., Krishnamoorthy, M., Zaki, M.J.: Clusterability Detection and Cluster Initialization (2002)
Masum, H.: ClusteringA lgorithms (2002), http://www.carleton.ca/~hmasum/clustering.html
Rao, V., Zhou, H., Yamanaka, Y.: Clustering algorithms, http://www-hto.usc.edu/~cbmp/Enter2000/Microarray/algorithms.htm
Kolatch, E.: ClusteringA lgorithms for Spatial Databases: A survey. Department of Computer Science, University of Maryland, College Park (2001)
Raymond, T., Han, J.: Efficient and Effective Clustering Methods for Spatial Data Mining. In: VLDB (1994)
Ester, M., Kriegel, H., Sander, J., Xu, X.: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In: KDD (1996)
Xiaowei, X., Ester, M., Kriegel, H., Sander, J.: A Distribuition-Based Clustering Algorithm for Mining in Large Spatial databases.
Wang, W., Yang, J., Muntz, R.: STING: A statistical Information Grid Approach to Spatial Data Mining. In: VLDB (1997)
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: An efficient data clustering method for very large databases. In: ACM SIGMOD Conference on Management of Data (1996)
Parker, D.S.: Stream Data Analysis in Prolog. In: Sterling, L. (ed.) The Practice of Prolog, MIT Press, Cambridge (1990)
Sheikholeslami, G., Chatterjee, S., Zhang, A.: WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases. In: Proc. 24th Int. Conf. Very Large Data Bases, VLDB (1998)
Hinneburg, A., Keim, D.: An Efficient Approach to Clustering in Large Multimedia Databases with Noise. In: Int. Conference on Knowledge Discovery in Databases (KDD 1998), New York, NY (1998)
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. In: Proceedings of the 1998 SIGMOD Conference, Seattle, Washington (1998)
Guha, S., Rastogi, R., Shim, K.: CURE: an efficient clutering algorithm for large databases. In: Proceedings of the ACM SIGMOD Conference on Management of Data (1998)
Engleman, L., Hartigan, J.: Percentage points of a test of clusters (1969)
Kaufman, L., Rousseeuw, P.: Finding groups in Data: An introduction to cluster analysis. Wiley, New York (1990)
Bezdek, D.: Pattern recognition with Fuzzy objective function algorithms. Plenum Press, New York (1981)
Knuth, D.E.: The art of Programming: Seminumerical Algorithms, 3rd edn. Addison- Wesley, Reading (1997)
Estivill-Castro, V.: Why so many clustering algorithms. A position Paper, School of Electrical Engineering and Computer Science, University of Newcastle (2002)
Kaski, S.: Clustering methods (1997), http://www.cis.ht.fi/~sami/thesis/node9.html
Fasulo, D.: An Analysis of Recent Work on Clustering Algorithms. University of Washington, BS (April 1999)
Hinneburg, A., Keim, D.: Clustering Techniques for Large Datasets from the past to the future. In: IEEE International Conference on Bioinformatics and Biomedical Egineering (2000)
Hartuv, E., Shamir, R.: A Clustering Algorithm based on Graph Connectivity. Information Processing Letters (1999)
Hamilton, H.: Clustering (2002), http://www.cs.uregina.ca/~hamilton/courses/831/notes/clustering/clustering.html
Rauber, A., Raralic, J., Pampalk, E.: Empirical Evaluation of ClusteringA lgorithms. In: Radova, Z. (ed.) Journal of information and organizational sciences (2000)
Zhang, B., Hsu, M., Forman, G.: Accurate Recasting of Parameter Estimation Algorithms using Sufficient Statistics for Efficient Parallel Speed-up. In: 4th European Conference on Principles and Practice of Knowledge Discovery in Databases, France (2000)
Bradley, P.S., Fayyad, U.: Refining initial points for k-means clustering. In: Proceedings of the International Conference on Machine Learning (1998)
Fayyad, U., Reina, C., Bradley, P.S.: Initialization of iterative refinement clustering algorithms. In: Proceedings of the International Conference on Knowledge Discovery in Databases (1998)
Kumar, V., Joshi, M.: High Performance Data Mining. University of Minnesota (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Romero, A., Krishnamoorthy, M.S. (2003). Clusterability and Centroid Approximation. In: Palade, V., Howlett, R.J., Jain, L. (eds) Knowledge-Based Intelligent Information and Engineering Systems. KES 2003. Lecture Notes in Computer Science(), vol 2774. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45226-3_88
Download citation
DOI: https://doi.org/10.1007/978-3-540-45226-3_88
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40804-8
Online ISBN: 978-3-540-45226-3
eBook Packages: Springer Book Archive