Instability results for Euclidean distance, nearest neighbor search on high dimensional Gaussian data

doi:10.1016/j.ipl.2021.106115

Information Processing Letters

Volume 169, August 2021, 106115

https://doi.org/10.1016/j.ipl.2021.106115 Get rights and content

Highlights

•
Sufficient conditions are provided for nearest-neighbor instability to hold (to fail) on Gaussian data.
•
The data size must grow sub-exponentially (exponentially) in a certain function of the covariance matrix.
•
A strategy is provided for generalizing the results to a wider class of data distributions.

Abstract

In 1998, Beyer et al. described a nearest neighbor query as unstable if the query point has nearly identical distance from all points in the dataset. Subsequently, researchers have proven that, as data dimensionality goes to infinity, the probability of query instability approaches one for various kinds of data distributions, dataset size functions, and distance metrics. This paper addresses the problem of characterizing query instability behavior over centered Gaussian data generation distributions and Euclidean distance. Sufficient conditions are established on the covariance matrices and dataset size function under which the probability of query instability approaches one. Furthermore, conditions are also established under which the query instability probability is strictly bounded away from one for a non-vanishing set of query points.

Introduction

Nearest neighbor search on high-dimensional data is a widely-studied problem, in part, because commonly used distance functions can exhibit different behavior in low versus high-dimensional spaces. To analyze this behavior, Beyer et al. [3] described a nearest neighbor query, with respect to a reference query point $q_{d} \in R^{d}$ , as unstable if $q_{d}$ has nearly identical distance from points in the dataset (see Fig. 2 in [3]).

This paper addresses the problem raised in [5], namely, characterizing query instability behavior over centered Gaussian data generation distributions and Euclidean distance. Sufficient conditions are established on the covariance matrices and dataset sizes under which the probability of query instability approaches one. Furthermore, conditions are also established under which the query instability probability is strictly bounded away from one for a non-vanishing set of query points. The focus of this paper is on theoretical behavior of the, so-called, ‘curse of dimensionality’.

Given $x, y \in R^{d}$ , the Euclidean distance between x and y is denoted $| | x - y | |$ where $| | . | |$ is the two-norm. Given $n (.) : N \to N$ , a d-dimensional, size² $n (d)$ dataset is represented by i.i.d. random vectors $X_{1}$ , …, $X_{n (d)} \sim N [0, Σ_{d}]$ , the d-variate Normal distribution with mean zero and covariance matrix $Σ_{d}$ . This matrix has orthogonal decomposition $V_{d} Λ_{d} V_{d}^{T}$ with diagonal entries of $Λ_{d}$ denoted ${}_{d}λ_{1} \geq_{d} λ_{2} \geq \dots \geq_{d} λ_{d} > 0$ . A sequence of distributions ${〈 N [0, Σ_{d}] 〉}_{d = 1}^{\infty}$ and a dataset size function $n (d)$ admit nearest neighbor instability with respect to Euclidean distance if for any $ϵ > 0$ and any ${〈 q_{d} \in R^{d} 〉}_{d = 1}^{\infty}$ , it is the case that $\lim_{d \to \infty} P r [\max_{i = 1}^{n (d)} | | X_{i} - q_{d} | | \leq (1 + ϵ) \min_{i = 1}^{n (d)} | | X_{i} - q_{d} | |] = 1$ .

On the other hand, ${〈 N [0, Σ_{d}] 〉}_{d = 1}^{\infty}$ and $n (d)$ fail to admit nearest neighbor instability³ with respect to Euclidean distance if there exists $ϵ_{0} > 0$ , $α_{0} > 0$ , $ψ_{0} < 1$ , and query set sequence ${〈 Q_{d} \subseteq R^{d} 〉}_{d = 1}^{\infty}$ , such that for large⁴ d:

(a) $P r [X_{1} \in Q_{d}] \geq α_{0}$ ,

(b) $P r [\max_{i = 1}^{n (d)} | | X_{i} - q_{d} | | \leq (1 + ϵ_{0}) \min_{i = 1}^{n (d)} | | X_{i} - q_{d} | |] \leq ψ_{0}$ for all $q_{d} \in Q_{d}$ .

Theorem 1

Assume $\frac{{\sum_{i = 1}^{d}}_{d} λ_{i}}{{}_{d}λ_{1}} \to \infty$ as $d \to \infty$ .

(1) Assume $n (d)$ grows slower⁵ than $e x p (\frac{{\sum_{i = 1}^{d}}_{d} λ_{i}}{{}_{d}λ_{1}})$ . Then ${〈 N [0, Σ_{d}] 〉}_{d = 1}^{\infty}$ and $n (d)$ admit nearest neighbor instability with respect to Euclidean distance.

(2) Assume, for large d, $n (d) \geq e x p (49 \frac{{\sum_{i = 1}^{d}}_{d} λ_{i}}{{}_{d}λ_{1}})$ . Then ${〈 N [0, Σ_{d}] 〉}_{d = 1}^{\infty}$ and $n (d)$ fail to admit nearest neighbor instability.

Section snippets

Related work

For simplicity, related work is described using Euclidean distance. However, the results described therein are not limited as such. Related work is organized into four groups.

Technical results proven in the appendix

Lemma 1

$P r [| | X_{1} | | \geq \sqrt{49 {\sum_{i = 1}^{d}}_{d} λ_{i}}] \geq (\frac{2}{\sqrt{π}}) e x p (- (\frac{49}{{}_{d}λ_{1}}) \sum_{i = 1}^{d}_{d} λ_{i})$ .

Lemma 2

Fix $q_{d} \in R^{d}$ .

(1) $E [| | X_{1} - q_{d} | |^{2}] = \sum_{i = 1}^{d}_{d} λ_{i} + | | q_{d} | |^{2}$ .

(2) $V a r [| | X_{1} - q_{d} | |^{2}] = 2 \sum_{i = 1}^{d}_{d} λ_{i}^{2} + 4 q_{d}^{T} Λ_{d} q_{d}$ .

(3) $\sqrt{{\sum_{i = 1}^{d}}_{d} λ_{i} + | | q_{d} | |^{2}} \geq E [| | X_{1} - q_{d} | |] \geq (\frac{1}{6 \sqrt{2}}) \sqrt{{\sum_{i = 1}^{d}}_{d} λ_{i} + | | q_{d} | |^{2}}$ .

Lemma 3

Fix $ϵ > 0$ and $q_{d} \in R^{d}$ .

(1) $P r [\max_{i = 1}^{n (d)} | | X_{i} - q_{d} | | \leq (1 + ϵ) \min_{i = 1}^{n (d)} | | X_{i} - q_{d} | |] \geq P r {[| | | X_{1} - q_{d} | | - E [| | X_{1} - q_{d} | |] | \leq E [| | X_{1} - q_{d} | |] (\frac{ϵ}{2 + ϵ})]}^{n (d)} .$ (2) $P r [\max_{i = 1}^{n (d)} | | X_{i} - q_{d} | | \leq (1 + ϵ) \min_{i = 1}^{n (d)} | | X_{i} - q_{d} | |] \leq P r [2 (1 + ϵ) E [| | X_{n (d)} - q_{d} | |] \leq (1 + ϵ) | | X_{n (d)} - q_{d} | |] + P r [2 (1 + ϵ) E [| | X_{n (d)} - q_{d} | |] > \max_{i = 1}^{n (d) - 1} | | X_{i} - q_{d} | |] .$

Proof of Theorem 1, part (1)

Fix $ϵ > 0$ and $q_{d} \in R^{d}$ . Let Z denote a

Directions for future work

One direction involves generalizing Theorem 1 to a wider class of data generation distributions, namely, those whose p.d.f. has the form $f : x \in R^{d} \mapsto e x p (- U (x))$ where U is a real-valued function satisfying the following condition. There exists constant $c > 0$ such that $H [U] - c I_{d}$ is positive definite where $H [U]$ denotes the Hessian matrix. A generalization of part (1) of Theorem 1 would follow from Theorem 5.2.15 in [14]. A generalization of part (2) would be more difficult to prove.

Another direction for

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (17)

C. Giannella
New instability results for high-dimensional nearest neighbor search
Inf. Process. Lett.
(2009)
V. Pestov
On the geometry of similarity search: dimensionality curse and concentration of measure
Inf. Process. Lett.
(2000)
V. Pestov
Indexability, concentration, and VC theory
J. Discret. Algorithms
(2012)
D. Zanger
Concentration of measure and cluster analysis
Stat. Probab. Lett.
(2003)
F. Angiulli
On the behavior of intrinsically high-dimensional spaces: distances, direct and reverse nearest neighbors, and hubness
J. Mach. Learn. Res.
(2018)
K. Beyer et al.
When Is ‘Nearest Neighbor’ Meaningful?
(1998)
G. Biau et al.
High-dimensional p-norms

There are more references available in the full text version of this article.

Cited by (20)

Fault-tolerant scheduling of graph-based loads on fog/cloud environments with multi-level queues and LSTM-based workload prediction
2023, Computer Networks
Workflow scheduling in the fog/cloud environment is a complex challenge, particularly with the increasing number of IoT devices. Many workflow applications have time constraints. The duplication approach is one of the methods that can help tasks to meet their deadline. This paper presents an architecture that uses a multi-criteria scheduling algorithm and multi-level queues. The proposed architecture determines a duplication coefficient for each queue, and tasks are placed in these queues using a triple-tuple approach composed of the tasks' priority, time pressure, and workload prediction of required resources. The proposed architecture uses various methods for calculating desired parameters. CPE is used to calculate the priority of tasks. Workload prediction is performed by Long-Short Term Memory (LSTM) neural network which is known for its forecasting capabilities. To evaluate the effectiveness of our proposed method, we conducted a software simulation study comparing it with four recent algorithms. The simulation results confirmed the superiority of our approach in different areas. The throughput is improved by 48.6%, communication cost was reduced by 56%. Also, make-span, waiting time, and the parallelism degree are improved between 15 to 62%.
Medical image retrieval via nearest neighbor search on pre-trained image features
2023, Knowledge-Based Systems
Nearest neighbor search, also known as NNS, is a technique used to locate the points in a high-dimensional space closest to a given query point. This technique has multiple applications in medicine, such as searching large medical imaging databases, disease classification, and diagnosis. However, when the number of points is significantly large, the brute-force approach for finding the nearest neighbor becomes computationally infeasible. Therefore, various approaches have been developed to make the search faster and more efficient to support the applications. With a focus on medical imaging, this paper proposes DenseLinkSearch (DLS), an effective and efficient algorithm that searches and retrieves the relevant images from heterogeneous sources of medical images. Towards this, given a medical database, the proposed algorithm builds an index that consists of pre-computed links of each point in the database. The search algorithm utilizes the index to efficiently traverse the database in search of the nearest neighbor. We also explore the role of medical image feature representation in content-based medical image retrieval tasks. We propose a Transformer-based feature representation technique that outperformed the existing pre-trained Transformer-based approaches on benchmark medical image retrieval datasets. We extensively tested the proposed NNS approach and compared the performance with state-of-the-art NNS approaches on benchmark datasets and our created medical image datasets. The proposed approach outperformed the existing approaches in terms of retrieving accurate neighbors and retrieval speed. In comparison to the existing approximate NNS approaches, our proposed DLS approach outperformed them in terms of lower average time per query and $\geq 99 %$ R@10 on 11 out of 13 benchmark datasets. We also found that the proposed medical feature representation approach is better for representing medical images compared to the existing pre-trained image models. The proposed feature extraction strategy obtained an improvement of 9.37%, 7.0%, and 13.33% in terms of P@5, P@10, and P@20, respectively, in comparison to the best-performing pre-trained image model. The source code and datasets of our experiments are available at https://github.com/deepaknlp/DLS.
Experimental study of the performance degradation of proton exchange membrane fuel cell based on a multi-module stack under selected load profiles by clustering algorithm
2023, Energy
In order to investigate the impact of actual vehicle conditions on the durability of fuel cell, the hierarchical clustering algorithm is used to select two representative driving cycles, which are urban driving cycle and expressway driving cycle. Then a durability test is performed under two dynamic conditions and one steady-state condition. This study chooses a newly designed fuel cell with four modules which can be applied different loading conditions simultaneously. It ensures operating conditions of each module are consistent during the experiment. Eventually, the polarization curve, electrochemical impedance spectroscopy and voltage are measured as health assessment indexes. The degradation of different modules can be compared more precisely by above methods. The results show that the fuel cells have the fastest performance degradation under urban driving cycle, followed by expressway driving cycle, finally steady-state condition. There is a certain time threshold, ohmic and activation resistance has a significant increase and fuel cell has a significant performance degradation after it, the severe driving conditions cause it to be advanced. Otherwise, the voltage degrades faster at high currents by extracting the voltage at different currents. The voltage degradation percentages of module 1 and 2 at 40A both exceed the threshold of 10%.
Secure Euclidean random distribution for patients’ magnetic resonance imaging privacy protection
2024, Bulletin of Electrical Engineering and Informatics
A Machine Learning Workflow to Support the Identification of Subsurface Resource Analogs
2024, Energy Exploration and Exploitation
Evaluating the Stability of Deep Learning Latent Feature Spaces
2024, arXiv

View all citing articles on Scopus

^☆: Approved for Public Release; Distribution Unlimited. Public Release Case Number 20-1792.

¹: The author's affiliation with The MITRE Corporation is provided for identification purposes only, and is not intended to convey or imply MITRE's concurrence with, or support for, the positions, opinions, or viewpoints expressed by the author. ©2020 The MITRE Corporation. All rights reserved.

View full text

Instability results for Euclidean distance, nearest neighbor search on high dimensional Gaussian data☆

Highlights

Abstract

Introduction

Section snippets

Related work

Technical results proven in the appendix

Proof of Theorem 1, part (1)

Directions for future work

Declaration of Competing Interest

Inf. Process. Lett.

Inf. Process. Lett.

J. Discret. Algorithms

Stat. Probab. Lett.

On the behavior of intrinsically high-dimensional spaces: distances, direct and reverse nearest neighbors, and hubness

J. Mach. Learn. Res.

When Is ‘Nearest Neighbor’ Meaningful?

High-dimensional p-norms