Instability results for Euclidean distance, nearest neighbor search on high dimensional Gaussian data

https://doi.org/10.1016/j.ipl.2021.106115Get rights and content

Highlights

  • Sufficient conditions are provided for nearest-neighbor instability to hold (to fail) on Gaussian data.

  • The data size must grow sub-exponentially (exponentially) in a certain function of the covariance matrix.

  • A strategy is provided for generalizing the results to a wider class of data distributions.

Abstract

In 1998, Beyer et al. described a nearest neighbor query as unstable if the query point has nearly identical distance from all points in the dataset. Subsequently, researchers have proven that, as data dimensionality goes to infinity, the probability of query instability approaches one for various kinds of data distributions, dataset size functions, and distance metrics. This paper addresses the problem of characterizing query instability behavior over centered Gaussian data generation distributions and Euclidean distance. Sufficient conditions are established on the covariance matrices and dataset size function under which the probability of query instability approaches one. Furthermore, conditions are also established under which the query instability probability is strictly bounded away from one for a non-vanishing set of query points.

Introduction

Nearest neighbor search on high-dimensional data is a widely-studied problem, in part, because commonly used distance functions can exhibit different behavior in low versus high-dimensional spaces. To analyze this behavior, Beyer et al. [3] described a nearest neighbor query, with respect to a reference query point qdRd, as unstable if qd has nearly identical distance from points in the dataset (see Fig. 2 in [3]).

This paper addresses the problem raised in [5], namely, characterizing query instability behavior over centered Gaussian data generation distributions and Euclidean distance. Sufficient conditions are established on the covariance matrices and dataset sizes under which the probability of query instability approaches one. Furthermore, conditions are also established under which the query instability probability is strictly bounded away from one for a non-vanishing set of query points. The focus of this paper is on theoretical behavior of the, so-called, ‘curse of dimensionality’.

Given x,yRd, the Euclidean distance between x and y is denoted ||xy|| where ||.|| is the two-norm. Given n(.):NN, a d-dimensional, size2 n(d) dataset is represented by i.i.d. random vectors X1, …, Xn(d)N[0,Σd], the d-variate Normal distribution with mean zero and covariance matrix Σd. This matrix has orthogonal decomposition VdΛdVdT with diagonal entries of Λd denoted λ1ddλ2dλd>0. A sequence of distributions N[0,Σd]d=1 and a dataset size function n(d) admit nearest neighbor instability with respect to Euclidean distance if for any ϵ>0 and any qdRdd=1, it is the case that limdPr[maxi=1n(d)||Xiqd||(1+ϵ)mini=1n(d)||Xiqd||]=1.

On the other hand, N[0,Σd]d=1 and n(d) fail to admit nearest neighbor instability3 with respect to Euclidean distance if there exists ϵ0>0, α0>0, ψ0<1, and query set sequence QdRdd=1, such that for large4 d:

(a) Pr[X1Qd]α0,

(b) Pr[maxi=1n(d)||Xiqd||(1+ϵ0)mini=1n(d)||Xiqd||]ψ0 for all qdQd.

Theorem 1

Assume i=1ddλiλ1d as d.

(1) Assume n(d) grows slower5 than exp(i=1ddλiλ1d). Then N[0,Σd]d=1 and n(d) admit nearest neighbor instability with respect to Euclidean distance.

(2) Assume, for large d, n(d)exp(49i=1ddλiλ1d). Then N[0,Σd]d=1 and n(d) fail to admit nearest neighbor instability.

Section snippets

Related work

For simplicity, related work is described using Euclidean distance. However, the results described therein are not limited as such. Related work is organized into four groups.

Technical results proven in the appendix

Lemma 1

Pr[||X1||49i=1ddλi](2π)exp((49λ1d)i=1ddλi).

Lemma 2

Fix qdRd.

(1) E[||X1qd||2]=i=1ddλi+||qd||2.

(2) Var[||X1qd||2]=2i=1ddλi2+4qdTΛdqd.

(3) i=1ddλi+||qd||2E[||X1qd||](162)i=1ddλi+||qd||2.

Lemma 3

Fix ϵ>0 and qdRd.

(1)Pr[maxi=1n(d)||Xiqd||(1+ϵ)mini=1n(d)||Xiqd||]Pr[|||X1qd||E[||X1qd||]|E[||X1qd||](ϵ2+ϵ)]n(d). (2)Pr[maxi=1n(d)||Xiqd||(1+ϵ)mini=1n(d)||Xiqd||]Pr[2(1+ϵ)E[||Xn(d)qd||](1+ϵ)||Xn(d)qd||]+Pr[2(1+ϵ)E[||Xn(d)qd||]>maxi=1n(d)1||Xiqd||].

Proof of Theorem 1, part (1)

Fix ϵ>0 and qdRd. Let Z denote a

Directions for future work

One direction involves generalizing Theorem 1 to a wider class of data generation distributions, namely, those whose p.d.f. has the form f:xRdexp(U(x)) where U is a real-valued function satisfying the following condition. There exists constant c>0 such that H[U]cId is positive definite where H[U] denotes the Hessian matrix. A generalization of part (1) of Theorem 1 would follow from Theorem 5.2.15 in [14]. A generalization of part (2) would be more difficult to prove.

Another direction for

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (17)

There are more references available in the full text version of this article.

Cited by (20)

View all citing articles on Scopus

Approved for Public Release; Distribution Unlimited. Public Release Case Number 20-1792.

1

The author's affiliation with The MITRE Corporation is provided for identification purposes only, and is not intended to convey or imply MITRE's concurrence with, or support for, the positions, opinions, or viewpoints expressed by the author. ©2020 The MITRE Corporation. All rights reserved.

View full text