Abstract
The distribution of distances between points in a high-dimensional data set tends to look quite different from the distribution of the distances in a low-dimensional data set. Concentration of norm is one of the phenomena from which high-dimensional data sets can suffer. It means that in high dimensions – under certain general assumptions – the relative distances from any point to its closest and farthest neighbour tend to be almost identical. Since cluster analysis is usually based on distances, such effects must be taken into account and their influence on cluster analysis needs to be considered. This paper investigates consequences that the special properties of high-dimensional data have for cluster analysis. We discuss questions like when clustering in high dimensions is meaningful at all, can the clusters just be artifacts and what are the algorithmic problems for clustering methods in high dimensions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
For fuzzy clustering, the number of clusters is usually denoted by c, so that fuzzy c-means clustering (FCM) is the common term. But for consistency reasons, we always denote the number of clusters by k.
- 2.
In Fig. 10 the separation between the two clusters is chosen larger for illustration purposes.
References
Bellmann, R.: Adaptive Control Processes: A Guided Tour. Princeton University Press, Princeton (1961)
Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is nearest neighbor meaningful? In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 217–235. Springer, Heidelberg (1998)
Durrant, R.J., Kabán, A.: When is ‘nearest neighbour’ meaningful: a converse theorem and implications. J. Complex. 25(4), 385–397 (2009)
François, D., Wertz, V., Verleysen, M.: The concentration of fractional distances. IEEE Trans. Knowl. Data Eng. 19(7), 873–886 (2007)
Aggarwal, C.C.: Re-designing distance functions and distance-based applications for high dimensional data. SIGMOD Rec. 30(1), 13–18 (2001)
Hsu, C.M., Chen, M.S.: On the design and applicability of distance functions in high-dimensional data space. IEEE Trans. Knowl. Data Eng. 21(4), 523–536 (2009)
Jayaram, B., Klawonn, F.: Can unbounded distance measures mitigate the curse of dimensionality? Int. J. Data Min. Model. Manag. 4, 361–383 (2012)
Radovanović, M., Nanopoulus, A., Ivanović, M.: Hubs in space: popular nearest neighbors in high-dimensional data. Mach. Learn. Res. 11, 2487–2531 (2010)
Low, T., Borgelt, C., Stober, S., Nürnbberger, A.: The hubness phenomenon: fact or artifact? In: Borgelt, C., Ángeles Gil, M., Sousa, J., Verleysen, M. (eds.) Towards Advanced Data Analysis by Combining Soft Computing and Statistics, pp. 267–278. Springer, Berlin (2013)
Evertt, B., Landau, S.: Cluster Analysis, 5th edn. Wiley, Chichester (2011)
Berthold, M., Borgelt, C., Höppner, F., Klawonn, F.: Guide to Intelligent Data Analysis: How to Intelligently Make Sense of Real Data. Springer, London (2010)
Duda, R., Hart, P.: Pattern Classification and Scene Analysis. Wiley, New York (1973)
Dunn, J.: A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. Cybern. Syst. 3(3), 32–57 (1973)
Bezdek, J.: Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York (1981)
Jayaram, B., Klawonn, F.: Can fuzzy clustering avoid local minima and undesired partitions? In: Moewes, C., Nürnberger, A. (eds.) Computational Intelligence in Intelligent Data Analysis, pp. 31–44. Springer, Berlin (2012)
Gustafson, D., Kessel, W.: Fuzzy clustering with a fuzzy covariance matrix. In: IEEE CDC, San Diego, pp. 761–766 (1979)
Keller, A., Klawonn, F.: Adaptation of cluster sizes in objective function based fuzzy clustering. In: Leondes, C. (ed.) Intelligent Systems: Technology and Applications. Database and Learning Systems, vol. IV. CRC Press, Boca Raton (2003)
Bezdek, J., Keller, J., Krishnapuram, R., Pal, N.: Fuzzy Models and Algorithms for Pattern Recognition and Image Processing. Kluwer, Boston (1999)
Höppner, F., Klawonn, F., Kruse, R., Runkler, T.: Fuzzy Cluster Analysis. Wiley, Chichester (1999)
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), pp. 226–231. AAAI Press (1996)
Hinneburg, A., Gabriel, H.H.: Denclue 2.0: fast clustering based on kernel density estimation. In: Proceedings of the 7th International Symposium on Intelligent Data Analysis, pp. 70–80 (2007)
Ankerst, M., Breunig, M., Kriegel, H.P., Sander, J.: Optics: ordering points to identify the clustering structure. In: Proceedings of ACM SIGMOD 1999, pp. 49–60. ACM Press (1999)
Kriegel, H.P., Kröger, P., Zimek, A.: Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans. Knowl. Discov. Data 3(1), 1–58 (2009)
Kerr, G., Ruskin, H., Crane, M.: Techniques for clustering gene expression data. Comput. Biol. Med. 38(3), 383–393 (2008)
Pommerenke, C., Müsken, M., Becker, T., Dötsch, A., Klawonn, F., Häussler, S.: Global genotype-phenotype correlations in pseudomonas aeruginosa. PLoS Pathogenes 6(8) (2010). doi:10.1371/journal.ppat.1001074
Hinneburg, A., Aggarwal, C., Keim, D.: What is the nearest neighbor in high dimensional spaces? In: Abbadi, A.E., Brodie, M.L., Chakravarthy, S., Dayal, U., Kamel, N., Schlageter, G., Whang, K.Y. (eds.) VLDB, pp. 506–515. Morgan Kaufmann, San Francisco (2000)
Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the surprising behavior of distance metrics in high dimensional space. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, p. 420. Springer, Heidelberg (2000)
Cook, D., Buja, A., Cabrera, J.: Projection pursuit indices based on orthonormal function expansion. J. Comput. Graph. Stat. 2, 225–250 (1993)
Wang, S., Zhu, J.: Variable selection for model-based high-dimensional clustering and its application to microarray data. Biometrics 64, 440–448 (2008)
Winkler, R., Klawonn, F., Kruse, R.: Fuzzy c-means in high dimensional spaces. Fuzzy Syst. Appl. 1, 1–17 (2011)
Höppner, F., Klawonn, F.: A contribution to convergence theory of fuzzy c-means and its derivatives. IEEE Trans. Fuzzy Syst. 11, 682–694 (2003)
Klawonn, F., Höppner, F.: What is fuzzy about fuzzy clustering? understanding and improving the concept of the fuzzifier. In: Berthold, M.R., Lenz, H.J., Bradley, E., Kruse, R., Borgelt, C. (eds.) Advances in Intelligent Data Analysis, vol. V, pp. 254–264. Springer, Berlin (2003)
Borgelt, C.: Resampling for fuzzy clustering. Int. J. Uncertainty Fuzziness Knowl. Based Syst. 15, 595–614 (2007)
Borgelt, C.: Prototype-based Classification and Clustering. Habilitation thesis, Otto-von-Guericke-University Magdeburg (2006)
Himmelspach, L., Conrad, S.: Clustering approaches for data with missing values: comparison and evaluation. ICDIM 2010, 19–28 (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Klawonn, F., Höppner, F., Jayaram, B. (2015). What are Clusters in High Dimensions and are they Difficult to Find?. In: Masulli, F., Petrosino, A., Rovetta, S. (eds) Clustering High--Dimensional Data. CHDD 2012. Lecture Notes in Computer Science(), vol 7627. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-48577-4_2
Download citation
DOI: https://doi.org/10.1007/978-3-662-48577-4_2
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-48576-7
Online ISBN: 978-3-662-48577-4
eBook Packages: Computer ScienceComputer Science (R0)