research-article

Nearest neighbors in high-dimensional data: the emergence and influence of hubs

Authors:

Miloš Radovanović,

Alexandros Nanopoulos,

Mirjana IvanovićAuthors Info & Claims

ICML '09: Proceedings of the 26th Annual International Conference on Machine Learning

Pages 865 - 872

https://doi.org/10.1145/1553374.1553485

Published: 14 June 2009 Publication History

Abstract

High dimensionality can pose severe difficulties, widely recognized as different aspects of the curse of dimensionality. In this paper we study a new aspect of the curse pertaining to the distribution of k-occurrences, i.e., the number of times a point appears among the k nearest neighbors of other points in a data set. We show that, as dimensionality increases, this distribution becomes considerably skewed and hub points emerge (points with very high k-occurrences). We examine the origin of this phenomenon, showing that it is an inherent property of high-dimensional vector space, and explore its influence on applications based on measuring distances in vector spaces, notably classification, clustering, and information retrieval.

References

[1]

Aggarwal, C. C., Hinneburg, A., & Keim, D. A. (2001). On the surprising behavior of distance metrics in high dimensional spaces. Proc. Int. Conf. on Database Theory (pp. 420--434).

Digital Library

[2]

Aucouturier, J.-J., & Pachet, F. (2007). A scale-free distribution of false positives for a large class of audio similarity measures. Pattern Recognition, 41, 272--284.

Digital Library

[3]

Beyer, K. S., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is "nearest neighbor" meaningful? Proc. Int. Conf. on Database Theory (pp. 217--235).

Digital Library

[4]

Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.

Digital Library

[5]

Caruana, R., Karampatziakis, N., & Yessenalina, A. (2008). An empirical evaluation of supervised learning in high dimensions. Proc. Int. Conf. on Machine Learning (pp. 96--103).

Digital Library

[6]

Chapelle, O., Schölkopf, B., & Zien, A. (Eds.). (2006). Semi-supervised learning. The MIT Press.

[7]

Demartines, P. (1994). Analyse de données par réseaux de neurones auto-organisés. Doctoral dissertation, Institut Nat'l Polytechnique de Grenoble, France.

[8]

Doddington, G., Liggett, W., Martin, A., Przybocki, M., & Reynolds, D. (1998). SHEEP, GOATS, LAMBS and WOLVES: A statistical analysis of speaker performance in the NIST 1998 speaker recognition evaluation. Proc. Int. Conf. on Spoken Language Processing. Paper 0608.

[9]

Erdős, P., & Réényi, A. (1959). On random graphs. Publicationes Mathematicae Debrecen, 6, 290--297.

[10]

François, D., Wertz, V., & Verleysen, M. (2007). The concentration of fractional distances. IEEE Transactions on Knowledge and Data Engineering, 19, 873--886.

Digital Library

[11]

Hicklin, A., Watson, C., & Ulery, B. (2005). The myth of goats: How many people have fingerprints that are hard to match? (Technical Report). National Institute of Standards and Technology.

[12]

Levina, E., & Bickel, P. J. (2005). Maximum likelihood estimation of intrinsic dimension. Advances in Neural Information Processing Systems 17 (pp. 777--784).

[13]

Meilă, M., & Shi, J. (2001). Learning segmentation by random walks. Advances in Neural Information Processing Systems 13 (pp. 873--879).

[14]

Newman, C. M., Rinott, Y., & Tversky, A. (1983). Nearest neighbors and voronoi regions in certain point processes. Advances in Applied Probability, 15, 726--751.

[15]

Singh, A., Ferhatosmanoğlu, H., & Tosun, A. S. (2003). High dimensional reverse nearest neighbor queries. Proc. Int. Conf. on Information and Knowledge Management (pp. 91--98).

Digital Library

[16]

Tan, P.-N., Steinbach, M., & Kumar, V. (2005). Introduction to data mining. Addison Wesley.

Digital Library

[17]

Yao, Y.-C., & Simons, G. (1996). A large-dimensional independent and identically distributed property for nearest neighbor counts in Poisson processes. Annals of Applied Probability, 6, 561--571.

Cited By

Zabarianska IProskurnikov A(2024)Opinion Dynamics With Set-Based Confidence: Convergence Criteria and Periodic SolutionsIEEE Control Systems Letters10.1109/LCSYS.2024.34792758(2373-2378)Online publication date: 2024
https://doi.org/10.1109/LCSYS.2024.3479275
Ranek JStallaert WMilner JRedick MWolff SBeltran AStanley NPurvis J(2024)DELVE: feature selection for preserving biological trajectories in single-cell dataNature Communications10.1038/s41467-024-46773-z15:1Online publication date: 29-Mar-2024
https://doi.org/10.1038/s41467-024-46773-z
Boussina AWardi GShashikumar SMalhotra AZheng KNemati S(2023)Representation Learning and Spectral Clustering for the Development and External Validation of Dynamic Sepsis Phenotypes: Observational Cohort StudyJournal of Medical Internet Research10.2196/4561425(e45614)Online publication date: 23-Jun-2023
https://doi.org/10.2196/45614
Show More Cited By

Index Terms

Nearest neighbors in high-dimensional data: the emergence and influence of hubs
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
      2. Unsupervised learning
        Cluster analysis
    2. Machine learning approaches
      1. Classification and regression trees
  2. Modeling and simulation
    1. Model development and analysis
      1. Model verification and validation
      2. Modeling methodologies

Recommendations

Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data

Different aspects of the curse of dimensionality are known to present serious challenges to various machine-learning methods and tasks. This paper explores a new aspect of the dimensionality curse, referred to as hubness, that affects the distribution ...
Enhanced algorithm for high-dimensional data classification

Graphical abstractIllustration of the decision hyperplanes generated by TSSVM, MCVSVM, and LMLP on an artificial dataset. Display Omitted HighlightsIn the case of the singularity of the within-class scatter matrix, the drawbacks of both MCVSVM and LMLP ...
Constrained discriminant neighborhood embedding for high dimensional data feature extraction

When handling pattern classification problem such as face recognition and digital handwriting identification, image data is always represented to high dimensional vectors, from which discriminant features are extracted using dimensionality reduction ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICML '09: Proceedings of the 26th Annual International Conference on Machine Learning

June 2009

1331 pages

ISBN:9781605585161

DOI:10.1145/1553374

General Chair:
Andrea Danyluk
Williams College
,
Program Chairs:
Léon Bottou
NEC Laboratories America
,
Michael Littman
Rutgers University

Copyright © 2009 Copyright 2009 by the author(s)/owner(s).

Sponsors

NSF
Microsoft Research: Microsoft Research
MITACS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

ICML '09

Sponsor:

Microsoft Research

ICML '09: The 26th Annual International Conference on Machine Learning held in conjunction with the 2007 International Conference on Inductive Logic Programming

June 14 - 18, 2009

Quebec, Montreal, Canada

Acceptance Rates

Overall Acceptance Rate 140 of 548 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

63
Total Citations
View Citations
458
Total Downloads

Downloads (Last 12 months)40
Downloads (Last 6 weeks)10

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zabarianska IProskurnikov A(2024)Opinion Dynamics With Set-Based Confidence: Convergence Criteria and Periodic SolutionsIEEE Control Systems Letters10.1109/LCSYS.2024.34792758(2373-2378)Online publication date: 2024
https://doi.org/10.1109/LCSYS.2024.3479275
Ranek JStallaert WMilner JRedick MWolff SBeltran AStanley NPurvis J(2024)DELVE: feature selection for preserving biological trajectories in single-cell dataNature Communications10.1038/s41467-024-46773-z15:1Online publication date: 29-Mar-2024
https://doi.org/10.1038/s41467-024-46773-z
Boussina AWardi GShashikumar SMalhotra AZheng KNemati S(2023)Representation Learning and Spectral Clustering for the Development and External Validation of Dynamic Sepsis Phenotypes: Observational Cohort StudyJournal of Medical Internet Research10.2196/4561425(e45614)Online publication date: 23-Jun-2023
https://doi.org/10.2196/45614
Cavalheiro LBernard SBarddal JHeutte L(2023)Random forest kernel for high-dimension low sample size classificationStatistics and Computing10.1007/s11222-023-10309-034:1Online publication date: 20-Oct-2023
https://dl.acm.org/doi/10.1007/s11222-023-10309-0
Wang Y(2022)Bi-shifting semantic auto-encoder for zero-shot learningElectronic Research Archive10.3934/era.202200830:1(140-167)Online publication date: 2022
https://doi.org/10.3934/era.2022008
Sarwar SZlatkova DHardalov MDinkov YAugenstein INakov P(2022)A Neighborhood Framework for Resource-Lean Content FlaggingTransactions of the Association for Computational Linguistics10.1162/tacl_a_0047210(484-502)Online publication date: 4-May-2022
https://doi.org/10.1162/tacl_a_00472
Peška LVeselý PSkopal TBuza K(2022)Person Authentication using Visual Representations of Keyboard Typing Dynamics2022 Ninth International Conference on Social Networks Analysis, Management and Security (SNAMS)10.1109/SNAMS58071.2022.10062739(1-6)Online publication date: 29-Nov-2022
https://doi.org/10.1109/SNAMS58071.2022.10062739
Yang YXia DSun HLuo DLuo W(2022)Research on End Face Defect Detection of Small Sample Workpiece Based on Meta Learning2022 4th International Symposium on Smart and Healthy Cities (ISHC)10.1109/ISHC56805.2022.00017(37-40)Online publication date: Dec-2022
https://doi.org/10.1109/ISHC56805.2022.00017
Huang LZhang LChen X(2022)Updated review of advances in microRNAs and complex diseases: taxonomy, trends and challenges of computational modelsBriefings in Bioinformatics10.1093/bib/bbac35823:5Online publication date: 2-Sep-2022
https://doi.org/10.1093/bib/bbac358
Obraczka DRahm E(2022)Fast Hubness-Reduced Nearest Neighbor Search for Entity Alignment in Knowledge GraphsSN Computer Science10.1007/s42979-022-01417-13:6Online publication date: 1-Oct-2022
https://doi.org/10.1007/s42979-022-01417-1
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten