Clusterability and Centroid Approximation

Romero, Alina; Krishnamoorthy, Mukkai S.

doi:10.1007/978-3-540-45226-3_88

Alina Romero⁹ &
Mukkai S. Krishnamoorthy⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2774))

Included in the following conference series:

International Conference on Knowledge-Based and Intelligent Information and Engineering Systems

938 Accesses

Abstract

Identifying clusters in a dataset is valuable. Most existing data clustering algorithms need the number of clusters as an input. The present paper introduces a graphical method that outputs the number of clusters. Once the number of clusters is calculated, other clustering algorithms may use it. This method is cubic in the number of input data. Given that most of the databases are extremely big, it is convenient to choose a random sample to make a faster detection of the clusters. However, not only is speed important but the level of confidence in this method is as well. So the level of accuracy is calculated for each run of the algorithm. Three different sampling algorithms have been used in order to choose the random sample. The efficiency of the three algorithms has been compared based on the number of clusters that they detect and the running time. In addition, depending on the number of points in each cluster, central points or spies that cover the clusters are assigned to each cluster. In order to have a complete vision of the evolution of data clustering and the future tendencies of the clusters, a time series analysis was conducted as well. All the tests have been conducted using datasets from the UCI Machine learning repository and artificially generated datasets. We present the experimental results and show the effect of sampling algorithms and the number of clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Fung, G.: A comprehensive Overview of Basic Clustering Algorithms (2001)
Google Scholar
Berkhin, P.: Survey of Clustering Data Mining Techniques. Accrue Software (2002)
Google Scholar
Qin, H.: A review of clusteringalg orithms as Applied in IR, Master’s thesis, University of Illinois (1999)
Google Scholar
Han, J., Kamber, M.: Data mining: concepts and techniques. Morgan Kaufmann Publishers, San Mateo (2000)
Google Scholar
Epter, S., Krishnamoorthy, M., Zaki, M.J.: Clusterability Detection and Cluster Initialization (2002)
Google Scholar
Masum, H.: ClusteringA lgorithms (2002), http://www.carleton.ca/~hmasum/clustering.html
Rao, V., Zhou, H., Yamanaka, Y.: Clustering algorithms, http://www-hto.usc.edu/~cbmp/Enter2000/Microarray/algorithms.htm
Kolatch, E.: ClusteringA lgorithms for Spatial Databases: A survey. Department of Computer Science, University of Maryland, College Park (2001)
Google Scholar
Raymond, T., Han, J.: Efficient and Effective Clustering Methods for Spatial Data Mining. In: VLDB (1994)
Google Scholar
Ester, M., Kriegel, H., Sander, J., Xu, X.: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In: KDD (1996)
Google Scholar
Xiaowei, X., Ester, M., Kriegel, H., Sander, J.: A Distribuition-Based Clustering Algorithm for Mining in Large Spatial databases.
Google Scholar
Wang, W., Yang, J., Muntz, R.: STING: A statistical Information Grid Approach to Spatial Data Mining. In: VLDB (1997)
Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: An efficient data clustering method for very large databases. In: ACM SIGMOD Conference on Management of Data (1996)
Google Scholar
Parker, D.S.: Stream Data Analysis in Prolog. In: Sterling, L. (ed.) The Practice of Prolog, MIT Press, Cambridge (1990)
Google Scholar
Sheikholeslami, G., Chatterjee, S., Zhang, A.: WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases. In: Proc. 24th Int. Conf. Very Large Data Bases, VLDB (1998)
Google Scholar
Hinneburg, A., Keim, D.: An Efficient Approach to Clustering in Large Multimedia Databases with Noise. In: Int. Conference on Knowledge Discovery in Databases (KDD 1998), New York, NY (1998)
Google Scholar
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. In: Proceedings of the 1998 SIGMOD Conference, Seattle, Washington (1998)
Google Scholar
Guha, S., Rastogi, R., Shim, K.: CURE: an efficient clutering algorithm for large databases. In: Proceedings of the ACM SIGMOD Conference on Management of Data (1998)
Google Scholar
Engleman, L., Hartigan, J.: Percentage points of a test of clusters (1969)
Google Scholar
Kaufman, L., Rousseeuw, P.: Finding groups in Data: An introduction to cluster analysis. Wiley, New York (1990)
Google Scholar
Bezdek, D.: Pattern recognition with Fuzzy objective function algorithms. Plenum Press, New York (1981)
MATH Google Scholar
Knuth, D.E.: The art of Programming: Seminumerical Algorithms, 3rd edn. Addison- Wesley, Reading (1997)
Google Scholar
Estivill-Castro, V.: Why so many clustering algorithms. A position Paper, School of Electrical Engineering and Computer Science, University of Newcastle (2002)
Google Scholar
Kaski, S.: Clustering methods (1997), http://www.cis.ht.fi/~sami/thesis/node9.html
Fasulo, D.: An Analysis of Recent Work on Clustering Algorithms. University of Washington, BS (April 1999)
Google Scholar
Hinneburg, A., Keim, D.: Clustering Techniques for Large Datasets from the past to the future. In: IEEE International Conference on Bioinformatics and Biomedical Egineering (2000)
Google Scholar
Hartuv, E., Shamir, R.: A Clustering Algorithm based on Graph Connectivity. Information Processing Letters (1999)
Google Scholar
Hamilton, H.: Clustering (2002), http://www.cs.uregina.ca/~hamilton/courses/831/notes/clustering/clustering.html
Rauber, A., Raralic, J., Pampalk, E.: Empirical Evaluation of ClusteringA lgorithms. In: Radova, Z. (ed.) Journal of information and organizational sciences (2000)
Google Scholar
Zhang, B., Hsu, M., Forman, G.: Accurate Recasting of Parameter Estimation Algorithms using Sufficient Statistics for Efficient Parallel Speed-up. In: 4th European Conference on Principles and Practice of Knowledge Discovery in Databases, France (2000)
Google Scholar
Bradley, P.S., Fayyad, U.: Refining initial points for k-means clustering. In: Proceedings of the International Conference on Machine Learning (1998)
Google Scholar
Fayyad, U., Reina, C., Bradley, P.S.: Initialization of iterative refinement clustering algorithms. In: Proceedings of the International Conference on Knowledge Discovery in Databases (1998)
Google Scholar
Kumar, V., Joshi, M.: High Performance Data Mining. University of Minnesota (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY, 12180, USA
Alina Romero & Mukkai S. Krishnamoorthy

Authors

Alina Romero
View author publications
You can also search for this author in PubMed Google Scholar
Mukkai S. Krishnamoorthy
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computing Laboratory, Oxford University, Parks Road, OXI 3QD, Oxford, United Kingdom
Vasile Palade
Centre for SMART Systems, School of Environment and Technology, University of Brighton, BN2 4GJ, Brighton, UK
Robert J. Howlett
Knowledge-Based Intelligent Engineering Systems Centre, University of South Australia, Mawson Lakes, SA 5095, Adelaide, Australia
Lakhmi Jain

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Romero, A., Krishnamoorthy, M.S. (2003). Clusterability and Centroid Approximation. In: Palade, V., Howlett, R.J., Jain, L. (eds) Knowledge-Based Intelligent Information and Engineering Systems. KES 2003. Lecture Notes in Computer Science(), vol 2774. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45226-3_88

Download citation

DOI: https://doi.org/10.1007/978-3-540-45226-3_88
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40804-8
Online ISBN: 978-3-540-45226-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics