skip to main content
10.1145/1553374.1553485acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicmlConference Proceedingsconference-collections
research-article

Nearest neighbors in high-dimensional data: the emergence and influence of hubs

Published: 14 June 2009 Publication History

Abstract

High dimensionality can pose severe difficulties, widely recognized as different aspects of the curse of dimensionality. In this paper we study a new aspect of the curse pertaining to the distribution of k-occurrences, i.e., the number of times a point appears among the k nearest neighbors of other points in a data set. We show that, as dimensionality increases, this distribution becomes considerably skewed and hub points emerge (points with very high k-occurrences). We examine the origin of this phenomenon, showing that it is an inherent property of high-dimensional vector space, and explore its influence on applications based on measuring distances in vector spaces, notably classification, clustering, and information retrieval.

References

[1]
Aggarwal, C. C., Hinneburg, A., & Keim, D. A. (2001). On the surprising behavior of distance metrics in high dimensional spaces. Proc. Int. Conf. on Database Theory (pp. 420--434).
[2]
Aucouturier, J.-J., & Pachet, F. (2007). A scale-free distribution of false positives for a large class of audio similarity measures. Pattern Recognition, 41, 272--284.
[3]
Beyer, K. S., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is "nearest neighbor" meaningful? Proc. Int. Conf. on Database Theory (pp. 217--235).
[4]
Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.
[5]
Caruana, R., Karampatziakis, N., & Yessenalina, A. (2008). An empirical evaluation of supervised learning in high dimensions. Proc. Int. Conf. on Machine Learning (pp. 96--103).
[6]
Chapelle, O., Schölkopf, B., & Zien, A. (Eds.). (2006). Semi-supervised learning. The MIT Press.
[7]
Demartines, P. (1994). Analyse de données par réseaux de neurones auto-organisés. Doctoral dissertation, Institut Nat'l Polytechnique de Grenoble, France.
[8]
Doddington, G., Liggett, W., Martin, A., Przybocki, M., & Reynolds, D. (1998). SHEEP, GOATS, LAMBS and WOLVES: A statistical analysis of speaker performance in the NIST 1998 speaker recognition evaluation. Proc. Int. Conf. on Spoken Language Processing. Paper 0608.
[9]
Erdős, P., & Réényi, A. (1959). On random graphs. Publicationes Mathematicae Debrecen, 6, 290--297.
[10]
François, D., Wertz, V., & Verleysen, M. (2007). The concentration of fractional distances. IEEE Transactions on Knowledge and Data Engineering, 19, 873--886.
[11]
Hicklin, A., Watson, C., & Ulery, B. (2005). The myth of goats: How many people have fingerprints that are hard to match? (Technical Report). National Institute of Standards and Technology.
[12]
Levina, E., & Bickel, P. J. (2005). Maximum likelihood estimation of intrinsic dimension. Advances in Neural Information Processing Systems 17 (pp. 777--784).
[13]
Meilă, M., & Shi, J. (2001). Learning segmentation by random walks. Advances in Neural Information Processing Systems 13 (pp. 873--879).
[14]
Newman, C. M., Rinott, Y., & Tversky, A. (1983). Nearest neighbors and voronoi regions in certain point processes. Advances in Applied Probability, 15, 726--751.
[15]
Singh, A., Ferhatosmanoğlu, H., & Tosun, A. S. (2003). High dimensional reverse nearest neighbor queries. Proc. Int. Conf. on Information and Knowledge Management (pp. 91--98).
[16]
Tan, P.-N., Steinbach, M., & Kumar, V. (2005). Introduction to data mining. Addison Wesley.
[17]
Yao, Y.-C., & Simons, G. (1996). A large-dimensional independent and identically distributed property for nearest neighbor counts in Poisson processes. Annals of Applied Probability, 6, 561--571.

Cited By

View all
  • (2024)Opinion Dynamics With Set-Based Confidence: Convergence Criteria and Periodic SolutionsIEEE Control Systems Letters10.1109/LCSYS.2024.34792758(2373-2378)Online publication date: 2024
  • (2024)DELVE: feature selection for preserving biological trajectories in single-cell dataNature Communications10.1038/s41467-024-46773-z15:1Online publication date: 29-Mar-2024
  • (2023)Representation Learning and Spectral Clustering for the Development and External Validation of Dynamic Sepsis Phenotypes: Observational Cohort StudyJournal of Medical Internet Research10.2196/4561425(e45614)Online publication date: 23-Jun-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICML '09: Proceedings of the 26th Annual International Conference on Machine Learning
June 2009
1331 pages
ISBN:9781605585161
DOI:10.1145/1553374

Sponsors

  • NSF
  • Microsoft Research: Microsoft Research
  • MITACS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2009

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

ICML '09
Sponsor:
  • Microsoft Research

Acceptance Rates

Overall Acceptance Rate 140 of 548 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)40
  • Downloads (Last 6 weeks)10
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Opinion Dynamics With Set-Based Confidence: Convergence Criteria and Periodic SolutionsIEEE Control Systems Letters10.1109/LCSYS.2024.34792758(2373-2378)Online publication date: 2024
  • (2024)DELVE: feature selection for preserving biological trajectories in single-cell dataNature Communications10.1038/s41467-024-46773-z15:1Online publication date: 29-Mar-2024
  • (2023)Representation Learning and Spectral Clustering for the Development and External Validation of Dynamic Sepsis Phenotypes: Observational Cohort StudyJournal of Medical Internet Research10.2196/4561425(e45614)Online publication date: 23-Jun-2023
  • (2023)Random forest kernel for high-dimension low sample size classificationStatistics and Computing10.1007/s11222-023-10309-034:1Online publication date: 20-Oct-2023
  • (2022)Bi-shifting semantic auto-encoder for zero-shot learningElectronic Research Archive10.3934/era.202200830:1(140-167)Online publication date: 2022
  • (2022)A Neighborhood Framework for Resource-Lean Content FlaggingTransactions of the Association for Computational Linguistics10.1162/tacl_a_0047210(484-502)Online publication date: 4-May-2022
  • (2022)Person Authentication using Visual Representations of Keyboard Typing Dynamics2022 Ninth International Conference on Social Networks Analysis, Management and Security (SNAMS)10.1109/SNAMS58071.2022.10062739(1-6)Online publication date: 29-Nov-2022
  • (2022)Research on End Face Defect Detection of Small Sample Workpiece Based on Meta Learning2022 4th International Symposium on Smart and Healthy Cities (ISHC)10.1109/ISHC56805.2022.00017(37-40)Online publication date: Dec-2022
  • (2022)Updated review of advances in microRNAs and complex diseases: taxonomy, trends and challenges of computational modelsBriefings in Bioinformatics10.1093/bib/bbac35823:5Online publication date: 2-Sep-2022
  • (2022)Fast Hubness-Reduced Nearest Neighbor Search for Entity Alignment in Knowledge GraphsSN Computer Science10.1007/s42979-022-01417-13:6Online publication date: 1-Oct-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media