Skip to main content

Finding Irregularly Shaped Clusters Based on Entropy

  • Conference paper
Advances in Data Mining. Applications and Theoretical Aspects (ICDM 2010)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6171))

Included in the following conference series:

  • 2452 Accesses

Abstract

In data clustering the more traditional algorithms are based on similarity criteria which depend on a metric distance. This fact imposes important constraints on the shape of the clusters found. These shapes generally are hyperspherical in the metric’s space due to the fact that each element in a cluster lies within a radial distance relative to a given center. In this paper we propose a clustering algorithm that does not depend on simple distance metrics and, therefore, allows us to find clusters with arbitrary shapes in n-dimensional space. Our proposal is based on some concepts stemming from Shannon’s information theory and evolutionary computation. Here each cluster consists of a subset of the data where entropy is minimized. This is a highly non-linear and usually non-convex optimization problem which disallows the use of traditional optimization techniques. To solve it we apply a rugged genetic algorithm (the so-called Vasconcelos’ GA). In order to test the efficiency of our proposal we artificially created several sets of data with known properties in a tridimensional space. The result of applying our algorithm has shown that it is able to find highly irregular clusters that traditional algorithms cannot. Some previous work is based on algorithms relying on similar approaches (such as ENCLUS’ and CLIQUE’s). The differences between such approaches and ours are also discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Cha, S.H.: Taxonomy of Nominal Type Histogram Distance Measures, Massachusetts (2008)

    Google Scholar 

  2. Mahalanobis, P.C.: On the genaralized distance in statistics (1936)

    Google Scholar 

  3. Bhattacharyya, A.: On a measure of divergence between two statistical populations defined by probability distributions, Calcutta (1943)

    Google Scholar 

  4. Pollard, D.E.: A user’s guide to measure theoretic probability. Cambridge University Press, Cambridge (2002)

    MATH  Google Scholar 

  5. Yang, G.L., Le Cam, L.M.: Asymptotics in Statistics: Some Basic Concepts. Springer, Berlin (2000)

    MATH  Google Scholar 

  6. Li, X., Wai, M., Kwong Li, C.: Determining the Optimal Number of Clusters by an Extended RPCL Algorithm. Hong Kong Polytechnic University, Hong Kong (1999)

    Google Scholar 

  7. MacQueen, J.B.: Some Methods for Classification and Analysis of Multivariate Observations. In: Proceedings of 5th Berkley Sysmposium on Mathematical Statiscs and Probability, Berkley, pp. 281–297 (1967)

    Google Scholar 

  8. Ng, R., Han, J.: Effecient and Effective Clustering Methods for Spatial Data Mining, Santiago de Chile (1994)

    Google Scholar 

  9. Zhang, T., Ramakrishnman, R., Linvy, M.: BIRCH: An Efficient Method for Very Large Databases, Montreal, Canada (1996)

    Google Scholar 

  10. Guha, S., Rastogi, R., Shim, K.: An efificient Clustering Algorithm for Large Databases (1998)

    Google Scholar 

  11. Ester, M., Kriegel, H., Sander, J., Xu, X.: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Portland, pp. 226–223 (1996)

    Google Scholar 

  12. Hinneburg, A., Keim, D.: An Efficient Approach to Clustering in Large Multimedia Databases with noise (2000)

    Google Scholar 

  13. Wang, W., Yang, J., Muntz, R.: STING: A Statistical Information Grid Approach to Spatial Data. In: Proceedings of the 23rd VLDB Conference, Athens (1997)

    Google Scholar 

  14. Sheikholeslami, G., Chatterjee, S., Zhang, A.: Wavecluster: A multi-resolution clustering. In: Proceedings of the 24th VLDB conference (1998)

    Google Scholar 

  15. Dunn, J.C.: A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters, pp. 32–57 (1973)

    Google Scholar 

  16. Kohonen, T.: Self-Organizing Maps. Series in Information Sciences (1995)

    Google Scholar 

  17. Halkidi, M., Batistakis, Y., Vzirgiannis, M.: On Clustering Validation Techniques, pp. 107-145 (2001)

    Google Scholar 

  18. Cheng, C., Fu, A.W., Zhang, Y.: Entropy- based Subspace Clustering for Mining Numerical Data (1998)

    Google Scholar 

  19. Barbará, D., Julia, C., Li, Y.: COOLCAT: An entropy-based algorithm for categorical clustering, George Mason University (2001)

    Google Scholar 

  20. Shannon, C.E.: A mathematical theory of communication, pp. 379–423 (1948)

    Google Scholar 

  21. Kolmogorov, A.N.: Three approaches to the quantitative definition of information, pp. 1–7 (1948)

    Google Scholar 

  22. Gray, R.M.: Entropy and Information Theory. Springer, Heidelberg (2008)

    Google Scholar 

  23. Bäck, T.: Evolutionary Algorithms in Theory and Practice. Oxford University Press, Oxford (1996)

    MATH  Google Scholar 

  24. Rudolph, G.: Convergence Analysis of Canonical Genetic Algorithms. IEEE Transactions on Neural Networks (1994)

    Google Scholar 

  25. Forrest, S., Mitchell, M.: What makes a problem hard for a genetic algorithm? Machine Learning (1993)

    Google Scholar 

  26. Kuri, A.: A Methodology for the Statistical Characterization of Genetic Algorithms, pp. 79–88. Springer, págs (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kuri-Morales, A., Aldana-Bobadilla, E. (2010). Finding Irregularly Shaped Clusters Based on Entropy. In: Perner, P. (eds) Advances in Data Mining. Applications and Theoretical Aspects. ICDM 2010. Lecture Notes in Computer Science(), vol 6171. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14400-4_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-14400-4_5

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-14399-1

  • Online ISBN: 978-3-642-14400-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics