Skip to main content
Log in

Using Self-Similarity to Cluster Large Data Sets

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Clustering is a widely used knowledge discovery technique. It helps uncovering structures in data that were not previously known. The clustering of large data sets has received a lot of attention in recent years, however, clustering is a still a challenging task since many published algorithms fail to do well in scaling with the size of the data set and the number of dimensions that describe the points, or in finding arbitrary shapes of clusters, or dealing effectively with the presence of noise. In this paper, we present a new clustering algorithm, based in self-similarity properties of the data sets. Self-similarity is the property of being invariant with respect to the scale used to look at the data set. While fractals are self-similar at every scale used to look at them, many data sets exhibit self-similarity over a range of scales. Self-similarity can be measured using the fractal dimension. The new algorithm which we call Fractal Clustering (FC) places points incrementally in the cluster for which the change in the fractal dimension after adding the point is the least. This is a very natural way of clustering points, since points in the same cluster have a great degree of self-similarity among them (and much less self-similarity with respect to points in other clusters). FC requires one scan of the data, is suspendable at will, providing the best answer possible at that point, and is incremental. We show via experiments that FC effectively deals with large data sets, high-dimensionality and noise and is capable of recognizing clusters of arbitrary shape.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Backer, E. 1995. Computer-Assisted Reasoning in Cluster Analysis. Prentice Hall.

  • Belussi, A. and Faloutsos, C. 1995. Estimating the selectivity of spatial queries using the ‘Correlation’ fractal dimension. In Proceedings of the International Conference on Very Large Data Bases, pp. 299–310.

  • Bradley, P.S., Fayyad, U., and Reina, C. 1998. Scaling clustering algorithms to large databases. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, New York City.

  • Bradley, P.S., Fayyad, U., and Reina, C. 1998. Scaling clustering algorithms to large databases (extended abstract). In Proceedings of the ACMSIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

  • Carbon Dioxide Information Analysis Center. Contributor: Yi-Fan, Li. 1990. Global population distribution. URLhttp://cdiac.esd.ornl.gov/ftp/db1016/.

  • Chernoff, H. 1952. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Annals of Mathematical Statistics, 23:493–509.

    Google Scholar 

  • Domingo, C., Gavaldá, R., and Watanabe, O. 1998.Practical algorithms for online selection. In Proceedings of the first International Conference on Discovery Science.

  • Domingo, C., Gavaldá, R., and Watanabe, O. 2000. Adaptive sampling algorithms for scaling up knowledge discovery algorithms.Discovery Science, 1999:172–183.

  • Domingos, P. and Hulten, G. 2000. Mining high-speed data streams. In Proceedings of the first 2000 Conference on Knowledge Discovery and Data Mining, pp. 71–80.

  • Ester, M., Kriegel, J.P., Sander, J., and Su, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pp. 226–231.

  • Faloutsos, C. and Gaede, V. 1996. Analysis of the Z-ordering method using the Hausdorff fractal dimension. In Proceedings of the International Conference on Very Large Data Bases, pp. 40–50.

  • Faloutsos, C. and Kamel, I. 1997. Relaxing the uniformity and independence assumptions, using the concept of fractal dimensions. Journal of Computer and System Sciences, 55(2):229–240.

    Google Scholar 

  • Faloutsos, C., Matias, Y., and Silberschatz, A. 1996. Modeling skewed distributions using multifractals and the ‘80-20 law’. In Proceedings of the International Conference on Very Large Data Bases, pp. 307–317.

  • Fisher, D.H. 1996. Iterative optimization and simplification of hierarchical clusterings. Journal of AI Research,4:147–180.

    Google Scholar 

  • Fukunaga, K. 1990. Introduction to Statistical Pattern Recognition. San Diego, California: Academic Press.

    Google Scholar 

  • Gluck, M.A. and Corter, J.E. 1985. Information, uncertainty, and the utility of categories. In Proceedings of the Seventh Annual Conference of the Cognitive Science Society, Irvine, CA.

  • Grassberger, P. 1983. Generalized dimensions of strange attractors. Physics Letters, 97A:227–230.

    Google Scholar 

  • Grassberger, P. and Procaccia, I. 1983. Characterization of strange attractors. Physical Review Letters, 50(5):346–349.

    Google Scholar 

  • Guha, S., Rastogi, R., and Shim, K. 1998. CURE: An efficient clustering algorithm for large databases. In Proceedings of the ACM SIGMOD Conference on Management of Data, Seattle, Washington, pp. 73–84.

  • Hinneburg, A. and Keim, D. 1999. Clustering techniques for large data sets: From the past to the future. Tutorial Notes for ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

  • Jain, A. and Dubes, R.C. 1988. Algorithms for Clustering Data. Englewood Cliffs, New Jersey: Prentice Hall.

    Google Scholar 

  • Lauritzen, S.L. 1995. The EM algorithm for graphical association models with missing data. Computational Statistics and Data Analysis, 19:101–201.

    Google Scholar 

  • Liebovitch, L.S. and Toth, T. 1989. A fast algorithm to determine fractal dimensions by box counting. Physics Letters, A141:386–390.

    Google Scholar 

  • Lipton, R.J. and Naughton, J.F. 1995.Query size estimation by adaptive sampling. Journal of Computer Systems Science, 51:18–25.

    Google Scholar 

  • Lipton, R.J., Naughton, J.F., Schneider, D.A., and Seshadri, S. 1993. Efficient sampling strategies for relational database operations. Theoretical Computer Science, 116:195–226.

    Google Scholar 

  • Mandelbrot, B.B. 1983. The Fractal Geometry of Nature. New York: Freeman.

    Google Scholar 

  • Ng, R.T. and Han, J. 1994. Efficient and effective clustering methods for spatial data mining. In Proceedings of the 20th Very Large Data Bases Conference, pp. 144–155.

  • Samet, H. 1990. Applications of Spatial Data Structures. Addison-Wesley.

  • Sarraille, J. and DiFalco, P. FD3. http://tori.postech.ac.kr/softwares/.

  • Schikuta, E. 1996. Grid clustering: An efficient hierarchical method for very large data sets. In Proceedings of the 13th Conference on Pattern Recognition, IEEE Computer Society Press, pp. 101–105.

  • Schroeder, M. 1991.Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise. New York: W.H. Freeman.

    Google Scholar 

  • Selim, S.Z. and Ismail, M.A. 1984. K-means-type Algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-6(1).

  • Sheikholeslami, G., Chatterjee, S., and Zhang, A. 1998. WaveCluster: A multi-resolution clustering approach for very large spatial databases. In Proceedings of the 24thVery Large Data Bases Conference, pp. 428–439.

  • Wang, W., Yand, J., and Muntz, R. 1997. STING: A statistical information grid approach to spatial data mining. In Proceedings of the 23rd Very Large Data Bases Conference, pp. 186–195.

  • Watanabe, O. 2000. Simple sampling techniques for discovery science. IEICE Transactions on Information and Systems.

  • Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: A efficient data clustering method for very large databases. In Proceedings of the ACM SIGMOD Conference on Management of Data, Montreal, Canada, pp. 103–114.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Barbará, D., Chen, P. Using Self-Similarity to Cluster Large Data Sets. Data Mining and Knowledge Discovery 7, 123–152 (2003). https://doi.org/10.1023/A:1022493416690

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1022493416690

Navigation