skip to main content
10.1145/1553374.1553485acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicmlConference Proceedingsconference-collections
research-article

Nearest neighbors in high-dimensional data: the emergence and influence of hubs

Published:14 June 2009Publication History

ABSTRACT

High dimensionality can pose severe difficulties, widely recognized as different aspects of the curse of dimensionality. In this paper we study a new aspect of the curse pertaining to the distribution of k-occurrences, i.e., the number of times a point appears among the k nearest neighbors of other points in a data set. We show that, as dimensionality increases, this distribution becomes considerably skewed and hub points emerge (points with very high k-occurrences). We examine the origin of this phenomenon, showing that it is an inherent property of high-dimensional vector space, and explore its influence on applications based on measuring distances in vector spaces, notably classification, clustering, and information retrieval.

References

  1. Aggarwal, C. C., Hinneburg, A., & Keim, D. A. (2001). On the surprising behavior of distance metrics in high dimensional spaces. Proc. Int. Conf. on Database Theory (pp. 420--434). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Aucouturier, J.-J., & Pachet, F. (2007). A scale-free distribution of false positives for a large class of audio similarity measures. Pattern Recognition, 41, 272--284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Beyer, K. S., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is "nearest neighbor" meaningful? Proc. Int. Conf. on Database Theory (pp. 217--235). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bishop, C. M. (2006). Pattern recognition and machine learning. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Caruana, R., Karampatziakis, N., & Yessenalina, A. (2008). An empirical evaluation of supervised learning in high dimensions. Proc. Int. Conf. on Machine Learning (pp. 96--103). Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chapelle, O., Schölkopf, B., & Zien, A. (Eds.). (2006). Semi-supervised learning. The MIT Press.Google ScholarGoogle Scholar
  7. Demartines, P. (1994). Analyse de données par réseaux de neurones auto-organisés. Doctoral dissertation, Institut Nat'l Polytechnique de Grenoble, France.Google ScholarGoogle Scholar
  8. Doddington, G., Liggett, W., Martin, A., Przybocki, M., & Reynolds, D. (1998). SHEEP, GOATS, LAMBS and WOLVES: A statistical analysis of speaker performance in the NIST 1998 speaker recognition evaluation. Proc. Int. Conf. on Spoken Language Processing. Paper 0608.Google ScholarGoogle Scholar
  9. Erdős, P., & Réényi, A. (1959). On random graphs. Publicationes Mathematicae Debrecen, 6, 290--297.Google ScholarGoogle ScholarCross RefCross Ref
  10. François, D., Wertz, V., & Verleysen, M. (2007). The concentration of fractional distances. IEEE Transactions on Knowledge and Data Engineering, 19, 873--886. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Hicklin, A., Watson, C., & Ulery, B. (2005). The myth of goats: How many people have fingerprints that are hard to match? (Technical Report). National Institute of Standards and Technology.Google ScholarGoogle Scholar
  12. Levina, E., & Bickel, P. J. (2005). Maximum likelihood estimation of intrinsic dimension. Advances in Neural Information Processing Systems 17 (pp. 777--784).Google ScholarGoogle Scholar
  13. Meilă, M., & Shi, J. (2001). Learning segmentation by random walks. Advances in Neural Information Processing Systems 13 (pp. 873--879).Google ScholarGoogle Scholar
  14. Newman, C. M., Rinott, Y., & Tversky, A. (1983). Nearest neighbors and voronoi regions in certain point processes. Advances in Applied Probability, 15, 726--751.Google ScholarGoogle ScholarCross RefCross Ref
  15. Singh, A., Ferhatosmanoğlu, H., & Tosun, A. S. (2003). High dimensional reverse nearest neighbor queries. Proc. Int. Conf. on Information and Knowledge Management (pp. 91--98). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Tan, P.-N., Steinbach, M., & Kumar, V. (2005). Introduction to data mining. Addison Wesley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Yao, Y.-C., & Simons, G. (1996). A large-dimensional independent and identically distributed property for nearest neighbor counts in Poisson processes. Annals of Applied Probability, 6, 561--571.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Nearest neighbors in high-dimensional data: the emergence and influence of hubs

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Other conferences
              ICML '09: Proceedings of the 26th Annual International Conference on Machine Learning
              June 2009
              1331 pages
              ISBN:9781605585161
              DOI:10.1145/1553374

              Copyright © 2009 Copyright 2009 by the author(s)/owner(s).

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 14 June 2009

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article

              Acceptance Rates

              Overall Acceptance Rate140of548submissions,26%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader