ABSTRACT
High dimensionality can pose severe difficulties, widely recognized as different aspects of the curse of dimensionality. In this paper we study a new aspect of the curse pertaining to the distribution of k-occurrences, i.e., the number of times a point appears among the k nearest neighbors of other points in a data set. We show that, as dimensionality increases, this distribution becomes considerably skewed and hub points emerge (points with very high k-occurrences). We examine the origin of this phenomenon, showing that it is an inherent property of high-dimensional vector space, and explore its influence on applications based on measuring distances in vector spaces, notably classification, clustering, and information retrieval.
- Aggarwal, C. C., Hinneburg, A., & Keim, D. A. (2001). On the surprising behavior of distance metrics in high dimensional spaces. Proc. Int. Conf. on Database Theory (pp. 420--434). Google ScholarDigital Library
- Aucouturier, J.-J., & Pachet, F. (2007). A scale-free distribution of false positives for a large class of audio similarity measures. Pattern Recognition, 41, 272--284. Google ScholarDigital Library
- Beyer, K. S., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is "nearest neighbor" meaningful? Proc. Int. Conf. on Database Theory (pp. 217--235). Google ScholarDigital Library
- Bishop, C. M. (2006). Pattern recognition and machine learning. Springer. Google ScholarDigital Library
- Caruana, R., Karampatziakis, N., & Yessenalina, A. (2008). An empirical evaluation of supervised learning in high dimensions. Proc. Int. Conf. on Machine Learning (pp. 96--103). Google ScholarDigital Library
- Chapelle, O., Schölkopf, B., & Zien, A. (Eds.). (2006). Semi-supervised learning. The MIT Press.Google Scholar
- Demartines, P. (1994). Analyse de données par réseaux de neurones auto-organisés. Doctoral dissertation, Institut Nat'l Polytechnique de Grenoble, France.Google Scholar
- Doddington, G., Liggett, W., Martin, A., Przybocki, M., & Reynolds, D. (1998). SHEEP, GOATS, LAMBS and WOLVES: A statistical analysis of speaker performance in the NIST 1998 speaker recognition evaluation. Proc. Int. Conf. on Spoken Language Processing. Paper 0608.Google Scholar
- Erdős, P., & Réényi, A. (1959). On random graphs. Publicationes Mathematicae Debrecen, 6, 290--297.Google ScholarCross Ref
- François, D., Wertz, V., & Verleysen, M. (2007). The concentration of fractional distances. IEEE Transactions on Knowledge and Data Engineering, 19, 873--886. Google ScholarDigital Library
- Hicklin, A., Watson, C., & Ulery, B. (2005). The myth of goats: How many people have fingerprints that are hard to match? (Technical Report). National Institute of Standards and Technology.Google Scholar
- Levina, E., & Bickel, P. J. (2005). Maximum likelihood estimation of intrinsic dimension. Advances in Neural Information Processing Systems 17 (pp. 777--784).Google Scholar
- Meilă, M., & Shi, J. (2001). Learning segmentation by random walks. Advances in Neural Information Processing Systems 13 (pp. 873--879).Google Scholar
- Newman, C. M., Rinott, Y., & Tversky, A. (1983). Nearest neighbors and voronoi regions in certain point processes. Advances in Applied Probability, 15, 726--751.Google ScholarCross Ref
- Singh, A., Ferhatosmanoğlu, H., & Tosun, A. S. (2003). High dimensional reverse nearest neighbor queries. Proc. Int. Conf. on Information and Knowledge Management (pp. 91--98). Google ScholarDigital Library
- Tan, P.-N., Steinbach, M., & Kumar, V. (2005). Introduction to data mining. Addison Wesley. Google ScholarDigital Library
- Yao, Y.-C., & Simons, G. (1996). A large-dimensional independent and identically distributed property for nearest neighbor counts in Poisson processes. Annals of Applied Probability, 6, 561--571.Google ScholarCross Ref
Index Terms
- Nearest neighbors in high-dimensional data: the emergence and influence of hubs
Recommendations
Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data
Different aspects of the curse of dimensionality are known to present serious challenges to various machine-learning methods and tasks. This paper explores a new aspect of the dimensionality curse, referred to as hubness, that affects the distribution ...
Enhanced algorithm for high-dimensional data classification
Graphical abstractIllustration of the decision hyperplanes generated by TSSVM, MCVSVM, and LMLP on an artificial dataset. Display Omitted HighlightsIn the case of the singularity of the within-class scatter matrix, the drawbacks of both MCVSVM and LMLP ...
Constrained discriminant neighborhood embedding for high dimensional data feature extraction
When handling pattern classification problem such as face recognition and digital handwriting identification, image data is always represented to high dimensional vectors, from which discriminant features are extracted using dimensionality reduction ...
Comments