Skip to main content

Clusterability and Centroid Approximation

  • Conference paper
Knowledge-Based Intelligent Information and Engineering Systems (KES 2003)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2774))

  • 938 Accesses

Abstract

Identifying clusters in a dataset is valuable. Most existing data clustering algorithms need the number of clusters as an input. The present paper introduces a graphical method that outputs the number of clusters. Once the number of clusters is calculated, other clustering algorithms may use it. This method is cubic in the number of input data. Given that most of the databases are extremely big, it is convenient to choose a random sample to make a faster detection of the clusters. However, not only is speed important but the level of confidence in this method is as well. So the level of accuracy is calculated for each run of the algorithm. Three different sampling algorithms have been used in order to choose the random sample. The efficiency of the three algorithms has been compared based on the number of clusters that they detect and the running time. In addition, depending on the number of points in each cluster, central points or spies that cover the clusters are assigned to each cluster. In order to have a complete vision of the evolution of data clustering and the future tendencies of the clusters, a time series analysis was conducted as well. All the tests have been conducted using datasets from the UCI Machine learning repository and artificially generated datasets. We present the experimental results and show the effect of sampling algorithms and the number of clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Fung, G.: A comprehensive Overview of Basic Clustering Algorithms (2001)

    Google Scholar 

  2. Berkhin, P.: Survey of Clustering Data Mining Techniques. Accrue Software (2002)

    Google Scholar 

  3. Qin, H.: A review of clusteringalg orithms as Applied in IR, Master’s thesis, University of Illinois (1999)

    Google Scholar 

  4. Han, J., Kamber, M.: Data mining: concepts and techniques. Morgan Kaufmann Publishers, San Mateo (2000)

    Google Scholar 

  5. Epter, S., Krishnamoorthy, M., Zaki, M.J.: Clusterability Detection and Cluster Initialization (2002)

    Google Scholar 

  6. Masum, H.: ClusteringA lgorithms (2002), http://www.carleton.ca/~hmasum/clustering.html

  7. Rao, V., Zhou, H., Yamanaka, Y.: Clustering algorithms, http://www-hto.usc.edu/~cbmp/Enter2000/Microarray/algorithms.htm

  8. Kolatch, E.: ClusteringA lgorithms for Spatial Databases: A survey. Department of Computer Science, University of Maryland, College Park (2001)

    Google Scholar 

  9. Raymond, T., Han, J.: Efficient and Effective Clustering Methods for Spatial Data Mining. In: VLDB (1994)

    Google Scholar 

  10. Ester, M., Kriegel, H., Sander, J., Xu, X.: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In: KDD (1996)

    Google Scholar 

  11. Xiaowei, X., Ester, M., Kriegel, H., Sander, J.: A Distribuition-Based Clustering Algorithm for Mining in Large Spatial databases.

    Google Scholar 

  12. Wang, W., Yang, J., Muntz, R.: STING: A statistical Information Grid Approach to Spatial Data Mining. In: VLDB (1997)

    Google Scholar 

  13. Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: An efficient data clustering method for very large databases. In: ACM SIGMOD Conference on Management of Data (1996)

    Google Scholar 

  14. Parker, D.S.: Stream Data Analysis in Prolog. In: Sterling, L. (ed.) The Practice of Prolog, MIT Press, Cambridge (1990)

    Google Scholar 

  15. Sheikholeslami, G., Chatterjee, S., Zhang, A.: WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases. In: Proc. 24th Int. Conf. Very Large Data Bases, VLDB (1998)

    Google Scholar 

  16. Hinneburg, A., Keim, D.: An Efficient Approach to Clustering in Large Multimedia Databases with Noise. In: Int. Conference on Knowledge Discovery in Databases (KDD 1998), New York, NY (1998)

    Google Scholar 

  17. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. In: Proceedings of the 1998 SIGMOD Conference, Seattle, Washington (1998)

    Google Scholar 

  18. Guha, S., Rastogi, R., Shim, K.: CURE: an efficient clutering algorithm for large databases. In: Proceedings of the ACM SIGMOD Conference on Management of Data (1998)

    Google Scholar 

  19. Engleman, L., Hartigan, J.: Percentage points of a test of clusters (1969)

    Google Scholar 

  20. Kaufman, L., Rousseeuw, P.: Finding groups in Data: An introduction to cluster analysis. Wiley, New York (1990)

    Google Scholar 

  21. Bezdek, D.: Pattern recognition with Fuzzy objective function algorithms. Plenum Press, New York (1981)

    MATH  Google Scholar 

  22. Knuth, D.E.: The art of Programming: Seminumerical Algorithms, 3rd edn. Addison- Wesley, Reading (1997)

    Google Scholar 

  23. Estivill-Castro, V.: Why so many clustering algorithms. A position Paper, School of Electrical Engineering and Computer Science, University of Newcastle (2002)

    Google Scholar 

  24. Kaski, S.: Clustering methods (1997), http://www.cis.ht.fi/~sami/thesis/node9.html

  25. Fasulo, D.: An Analysis of Recent Work on Clustering Algorithms. University of Washington, BS (April 1999)

    Google Scholar 

  26. Hinneburg, A., Keim, D.: Clustering Techniques for Large Datasets from the past to the future. In: IEEE International Conference on Bioinformatics and Biomedical Egineering (2000)

    Google Scholar 

  27. Hartuv, E., Shamir, R.: A Clustering Algorithm based on Graph Connectivity. Information Processing Letters (1999)

    Google Scholar 

  28. Hamilton, H.: Clustering (2002), http://www.cs.uregina.ca/~hamilton/courses/831/notes/clustering/clustering.html

  29. Rauber, A., Raralic, J., Pampalk, E.: Empirical Evaluation of ClusteringA lgorithms. In: Radova, Z. (ed.) Journal of information and organizational sciences (2000)

    Google Scholar 

  30. Zhang, B., Hsu, M., Forman, G.: Accurate Recasting of Parameter Estimation Algorithms using Sufficient Statistics for Efficient Parallel Speed-up. In: 4th European Conference on Principles and Practice of Knowledge Discovery in Databases, France (2000)

    Google Scholar 

  31. Bradley, P.S., Fayyad, U.: Refining initial points for k-means clustering. In: Proceedings of the International Conference on Machine Learning (1998)

    Google Scholar 

  32. Fayyad, U., Reina, C., Bradley, P.S.: Initialization of iterative refinement clustering algorithms. In: Proceedings of the International Conference on Knowledge Discovery in Databases (1998)

    Google Scholar 

  33. Kumar, V., Joshi, M.: High Performance Data Mining. University of Minnesota (1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Romero, A., Krishnamoorthy, M.S. (2003). Clusterability and Centroid Approximation. In: Palade, V., Howlett, R.J., Jain, L. (eds) Knowledge-Based Intelligent Information and Engineering Systems. KES 2003. Lecture Notes in Computer Science(), vol 2774. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45226-3_88

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-45226-3_88

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-40804-8

  • Online ISBN: 978-3-540-45226-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics