Skip to main content

What are Clusters in High Dimensions and are they Difficult to Find?

  • Conference paper
  • First Online:
Clustering High--Dimensional Data (CHDD 2012)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7627))

Included in the following conference series:

Abstract

The distribution of distances between points in a high-dimensional data set tends to look quite different from the distribution of the distances in a low-dimensional data set. Concentration of norm is one of the phenomena from which high-dimensional data sets can suffer. It means that in high dimensions – under certain general assumptions – the relative distances from any point to its closest and farthest neighbour tend to be almost identical. Since cluster analysis is usually based on distances, such effects must be taken into account and their influence on cluster analysis needs to be considered. This paper investigates consequences that the special properties of high-dimensional data have for cluster analysis. We discuss questions like when clustering in high dimensions is meaningful at all, can the clusters just be artifacts and what are the algorithmic problems for clustering methods in high dimensions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 34.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 44.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    For fuzzy clustering, the number of clusters is usually denoted by c, so that fuzzy c-means clustering (FCM) is the common term. But for consistency reasons, we always denote the number of clusters by k.

  2. 2.

    In Fig. 10 the separation between the two clusters is chosen larger for illustration purposes.

References

  1. Bellmann, R.: Adaptive Control Processes: A Guided Tour. Princeton University Press, Princeton (1961)

    Book  Google Scholar 

  2. Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is nearest neighbor meaningful? In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 217–235. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  3. Durrant, R.J., Kabán, A.: When is ‘nearest neighbour’ meaningful: a converse theorem and implications. J. Complex. 25(4), 385–397 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  4. François, D., Wertz, V., Verleysen, M.: The concentration of fractional distances. IEEE Trans. Knowl. Data Eng. 19(7), 873–886 (2007)

    Article  Google Scholar 

  5. Aggarwal, C.C.: Re-designing distance functions and distance-based applications for high dimensional data. SIGMOD Rec. 30(1), 13–18 (2001)

    Article  Google Scholar 

  6. Hsu, C.M., Chen, M.S.: On the design and applicability of distance functions in high-dimensional data space. IEEE Trans. Knowl. Data Eng. 21(4), 523–536 (2009)

    Article  Google Scholar 

  7. Jayaram, B., Klawonn, F.: Can unbounded distance measures mitigate the curse of dimensionality? Int. J. Data Min. Model. Manag. 4, 361–383 (2012)

    Google Scholar 

  8. Radovanović, M., Nanopoulus, A., Ivanović, M.: Hubs in space: popular nearest neighbors in high-dimensional data. Mach. Learn. Res. 11, 2487–2531 (2010)

    MathSciNet  MATH  Google Scholar 

  9. Low, T., Borgelt, C., Stober, S., Nürnbberger, A.: The hubness phenomenon: fact or artifact? In: Borgelt, C., Ángeles Gil, M., Sousa, J., Verleysen, M. (eds.) Towards Advanced Data Analysis by Combining Soft Computing and Statistics, pp. 267–278. Springer, Berlin (2013)

    Chapter  Google Scholar 

  10. Evertt, B., Landau, S.: Cluster Analysis, 5th edn. Wiley, Chichester (2011)

    Book  Google Scholar 

  11. Berthold, M., Borgelt, C., Höppner, F., Klawonn, F.: Guide to Intelligent Data Analysis: How to Intelligently Make Sense of Real Data. Springer, London (2010)

    Book  MATH  Google Scholar 

  12. Duda, R., Hart, P.: Pattern Classification and Scene Analysis. Wiley, New York (1973)

    MATH  Google Scholar 

  13. Dunn, J.: A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. Cybern. Syst. 3(3), 32–57 (1973)

    MathSciNet  MATH  Google Scholar 

  14. Bezdek, J.: Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York (1981)

    Book  MATH  Google Scholar 

  15. Jayaram, B., Klawonn, F.: Can fuzzy clustering avoid local minima and undesired partitions? In: Moewes, C., Nürnberger, A. (eds.) Computational Intelligence in Intelligent Data Analysis, pp. 31–44. Springer, Berlin (2012)

    Google Scholar 

  16. Gustafson, D., Kessel, W.: Fuzzy clustering with a fuzzy covariance matrix. In: IEEE CDC, San Diego, pp. 761–766 (1979)

    Google Scholar 

  17. Keller, A., Klawonn, F.: Adaptation of cluster sizes in objective function based fuzzy clustering. In: Leondes, C. (ed.) Intelligent Systems: Technology and Applications. Database and Learning Systems, vol. IV. CRC Press, Boca Raton (2003)

    Google Scholar 

  18. Bezdek, J., Keller, J., Krishnapuram, R., Pal, N.: Fuzzy Models and Algorithms for Pattern Recognition and Image Processing. Kluwer, Boston (1999)

    Book  MATH  Google Scholar 

  19. Höppner, F., Klawonn, F., Kruse, R., Runkler, T.: Fuzzy Cluster Analysis. Wiley, Chichester (1999)

    MATH  Google Scholar 

  20. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), pp. 226–231. AAAI Press (1996)

    Google Scholar 

  21. Hinneburg, A., Gabriel, H.H.: Denclue 2.0: fast clustering based on kernel density estimation. In: Proceedings of the 7th International Symposium on Intelligent Data Analysis, pp. 70–80 (2007)

    Google Scholar 

  22. Ankerst, M., Breunig, M., Kriegel, H.P., Sander, J.: Optics: ordering points to identify the clustering structure. In: Proceedings of ACM SIGMOD 1999, pp. 49–60. ACM Press (1999)

    Google Scholar 

  23. Kriegel, H.P., Kröger, P., Zimek, A.: Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowl. Discov. Data 3(1), 1–58 (2009)

    Article  Google Scholar 

  24. Kerr, G., Ruskin, H., Crane, M.: Techniques for clustering gene expression data. Comput. Biol. Med. 38(3), 383–393 (2008)

    Article  Google Scholar 

  25. Pommerenke, C., Müsken, M., Becker, T., Dötsch, A., Klawonn, F., Häussler, S.: Global genotype-phenotype correlations in pseudomonas aeruginosa. PLoS Pathogenes 6(8) (2010). doi:10.1371/journal.ppat.1001074

  26. Hinneburg, A., Aggarwal, C., Keim, D.: What is the nearest neighbor in high dimensional spaces? In: Abbadi, A.E., Brodie, M.L., Chakravarthy, S., Dayal, U., Kamel, N., Schlageter, G., Whang, K.Y. (eds.) VLDB, pp. 506–515. Morgan Kaufmann, San Francisco (2000)

    Google Scholar 

  27. Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the surprising behavior of distance metrics in high dimensional space. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, p. 420. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  28. Cook, D., Buja, A., Cabrera, J.: Projection pursuit indices based on orthonormal function expansion. J. Comput. Graph. Stat. 2, 225–250 (1993)

    Article  Google Scholar 

  29. Wang, S., Zhu, J.: Variable selection for model-based high-dimensional clustering and its application to microarray data. Biometrics 64, 440–448 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  30. Winkler, R., Klawonn, F., Kruse, R.: Fuzzy c-means in high dimensional spaces. Fuzzy Syst. Appl. 1, 1–17 (2011)

    Google Scholar 

  31. Höppner, F., Klawonn, F.: A contribution to convergence theory of fuzzy c-means and its derivatives. IEEE Trans. Fuzzy Syst. 11, 682–694 (2003)

    Article  Google Scholar 

  32. Klawonn, F., Höppner, F.: What is fuzzy about fuzzy clustering? understanding and improving the concept of the fuzzifier. In: Berthold, M.R., Lenz, H.J., Bradley, E., Kruse, R., Borgelt, C. (eds.) Advances in Intelligent Data Analysis, vol. V, pp. 254–264. Springer, Berlin (2003)

    Google Scholar 

  33. Borgelt, C.: Resampling for fuzzy clustering. Int. J. Uncertainty Fuzziness Knowl. Based Syst. 15, 595–614 (2007)

    Article  MATH  Google Scholar 

  34. Borgelt, C.: Prototype-based Classification and Clustering. Habilitation thesis, Otto-von-Guericke-University Magdeburg (2006)

    Google Scholar 

  35. Himmelspach, L., Conrad, S.: Clustering approaches for data with missing values: comparison and evaluation. ICDIM 2010, 19–28 (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Frank Klawonn .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Klawonn, F., Höppner, F., Jayaram, B. (2015). What are Clusters in High Dimensions and are they Difficult to Find?. In: Masulli, F., Petrosino, A., Rovetta, S. (eds) Clustering High--Dimensional Data. CHDD 2012. Lecture Notes in Computer Science(), vol 7627. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-48577-4_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-48577-4_2

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-48576-7

  • Online ISBN: 978-3-662-48577-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics