Randomized Partition Trees for Nearest Neighbor Search

Dasgupta, Sanjoy; Sinha, Kaushik

doi:10.1007/s00453-014-9885-5

Randomized Partition Trees for Nearest Neighbor Search

Published: 09 May 2014

Volume 72, pages 237–263, (2015)
Cite this article

Algorithmica Aims and scope Submit manuscript

Sanjoy Dasgupta¹ &
Kaushik Sinha²

559 Accesses
Explore all metrics

Abstract

The $k$-d tree was one of the first spatial data structures proposed for nearest neighbor search. Its efficacy is diminished in high-dimensional spaces, but several variants, with randomization and overlapping cells, have proved to be successful in practice. We analyze three such schemes. We show that the probability that they fail to find the nearest neighbor, for any data set and any query point, is directly related to a simple potential function that captures the difficulty of the point configuration. We then bound this potential function in several situations of interest: when the data are drawn from a doubling measure; when the data and query distributions are identical and are supported on a set of bounded doubling dimension; and when the data are documents from a topic model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Confirmation Sampling for Exact Nearest Neighbor Search

Heuristics for Computing k-Nearest Neighbors Graphs

Faster Dual-Tree Traversal for Nearest Neighbor Search

References

Ailon, N., Chazelle, B.: The fast Johnson-Lindenstrauss transform and approximate nearest neighbors. SIAM J. Comput. 39, 302–322 (2009)
Article MATH MathSciNet Google Scholar
Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1), 117–122 (2008)
Article Google Scholar
Arya, S., Mount, D., Netanyahu, N., Silverman, R., Wu, A.: An optimal algorithm for approximate nearest neighbor searching. J. ACM 45, 891–923 (1998)
Article MATH MathSciNet Google Scholar
Assouad, P.: Plongements lipschitziens dans ${\mathbb{R}}^n$. Bull. Soc. Math. France 111(4), 429–448 (1983)
MATH MathSciNet Google Scholar
Bentley, J.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)
Article MATH MathSciNet Google Scholar
Beygelzimer, A., Kakade, S., Langford, J.: Cover trees for nearest neighbor. In: Proceedings of the 23rd International Conference on Machine Learning (2006)
Cayton, L., Dasgupta, S.: A learning framework for nearest-neighbor search. In: Advances in Neural Information Processing Systems (2007)
Clarkson, K.: Nearest neighbor queries in metric spaces. Discret. Comput. Geom. 22, 63–93 (1999)
Article MATH MathSciNet Google Scholar
Clarkson, K.: Nearest-neighbor searching and metric space dimensions. In: Darrell, T., Indyk, P. (eds.) Nearest-Neighbor Methods for Learning and Vision: Theory and Practice. MIT Press, Cambridge (2005)
Google Scholar
Dasgupta, S., Freund, Y.: Random projection trees and low dimensional manifolds. In: ACM Symposium on Theory of, Computing, pp. 537–546 (2008)
Dasgupta, S., Sinha, K.: Randomized partition trees for exact nearest neighbor search. In: 26th Annual Conference on Learning Theory (2013)
Gupta, A., Krauthgamer, R., Lee, J.R.: Bounded geometries, fractals, and low-distortion embeddings. In: 44th Annual IEEE Symposium on Foundations of Computer, Science, pp. 534–543 (2003)
Karger, D., Ruhl, M.: Finding nearest neighbors in growth-restricted metrics. In: ACM Symposium on Theory of, Computing, pp. 741–750 (2002)
Kleinberg, J.: Two algorithms for nearest-neighbor search in high dimensions. In: 29th ACM Symposium on Theory of, Computing (1997)
Krauthgamer, R., Lee, J.: Navigating nets: simple algorithms for proximity search. In: ACM-SIAM Symposium on Discrete Algorithms (2004)
Liu, T., Moore, A., Gray, A., Yang, K.: An investigation of practical approximate nearest neighbor algorithms. In: Advances in Neural Information Processing Systems (2004)
Maneewongvatana, S., Mount, D.: The analysis of a probabilistic approach to nearest neighbor searching. In: Seventh International Worshop on Algorithms and Data Structures, pp. 276–286 (2001)
McFee, B., Lanckriet, G.: Large-scale music similarity search with spatial trees. In: 12th Conference of the International Society for Music Retrieval (2011)
Stone, C.: Consistent nonparametric regression. Ann. Stat. 5, 595–645 (1977)
Article MATH Google Scholar

Download references

Acknowledgments

The authors are grateful to the National Science Foundation for support under grant IIS-1162581, and to the anonymous reviewers for their detailed feedback.

Author information

Authors and Affiliations

University of California, San Diego, CA, USA
Sanjoy Dasgupta
Wichita State University, Wichita, KS, USA
Kaushik Sinha

Authors

Sanjoy Dasgupta
View author publications
You can also search for this author inPubMed Google Scholar
Kaushik Sinha
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Sanjoy Dasgupta.

Additional information

A preliminary abstract of this work appeared in [11].

Appendices

Appendix A: Summation Lemma

Lemma 6

Suppose that for some constants $A, B > 0$ and $d_o \ge 1$,

$$\begin{aligned} F(m) \le A \left( \frac{B}{m} \right) ^{1/d_o} \end{aligned}$$

for all integers $m \ge n_o$. Pick any $0 < \beta < 1$ and define $\ell = \log _{1/\beta } (n/n_o)$. Assume for convenience that $\ell $ is an integer. Then:

$$\begin{aligned} \sum _{i = 0}^\ell F(\beta ^i n) \le \frac{A d_o}{1-\beta } \left( \frac{B}{n_o} \right) ^{1/d_o} \end{aligned}$$

and, if $n_o \ge B(A/2)^{d_o}$,

$$\begin{aligned} \sum _{i=0}^\ell F(\beta ^i n) \ln \frac{2e}{F(\beta ^i n)} \le \frac{A d_o}{1-\beta } \left( \frac{B}{n_o} \right) ^{1/d_o} \left( \frac{1}{1-\beta } \ln \frac{1}{\beta } + \ln \frac{2e}{A} + \frac{1}{d_o} \ln \frac{n_o}{B} \right) . \end{aligned}$$

Proof

Writing the first series in reverse,

$$\begin{aligned} \sum _{i = 0}^{\ell } F(\beta ^i n) =\sum _{i=0}^\ell F\left( \frac{n_o}{\beta ^i} \right)&\le \sum _{i=0}^\ell A \left( \frac{B \beta ^i}{n_o} \right) ^{1/d_o} \\&= A \left( \frac{B}{n_o} \right) ^{1/d_o} \sum _{i=0}^\ell \beta ^{i/d_o} \\&\le \frac{A}{1-\beta ^{1/d_o}} \left( \frac{B}{n_o} \right) ^{1/d_o} \le \frac{A d_o}{1-\beta } \left( \frac{B}{n_o} \right) ^{1/d_o}. \end{aligned}$$

The last inequality is obtained by using

$$\begin{aligned} (1-x)^p \ge 1-px \hbox {for}\,0 < x < 1, p \ge 1 \end{aligned}$$

to get $(1 - (1-\beta )/d_o)^{d_o} \ge \beta $ and thus $1-\beta ^{1/d_o} \ge (1-\beta )/d_o$.

Now we move on to the second bound. The lower bound on $n_o$ implies that $A (B/m)^{1/d_o} \le 2$ for all $m \ge n_o$. Since $x \ln (2e/x)$ is increasing when $x \le 2$, we have

$$\begin{aligned} \sum _{i=0}^\ell F(\beta ^i n) \ln \frac{2e}{F(\beta ^i n)}&\le \sum _{i=0}^\ell A \left( \frac{B}{\beta ^i n} \right) ^{1/d_o} \ln \frac{2e}{A (B/(\beta ^i n))^{1/d_o}}. \end{aligned}$$

The lemma now follows from algebraic manipulations that invoke the first bound as well as the inequality

$$\begin{aligned} \sum _{i = 0}^\ell i A \left( \frac{B\beta ^i}{n_o} \right) ^{1/d_o} \le \frac{A d_o^2}{(1-\beta )^2} \left( \frac{B}{n_o} \right) ^{1/d_o} , \end{aligned}$$

which in turn follows from

$$\begin{aligned} \sum _{i = 0}^\ell i \beta ^{i/d_o}&\le \sum _{i=1}^\infty i \beta ^{i/d_o} =\sum _{i=1}^\infty \sum _{j = i}^\infty \beta ^{j/d_o} =\sum _{i=1}^\infty \frac{\beta ^{i/d_o}}{1-\beta ^{1/d_o}} =\frac{\beta ^{1/d_o}}{(1-\beta ^{1/d_o})^2} \\&\le \frac{d_o^2}{(1-\beta )^2}. \end{aligned}$$

Appendix B: Clarkson’s Lemma

Suppose we are given a finite set of points $S \subset {\mathbb {R}}^d$. How many of these points can have a specific $x \in S$ as one of their $\ell $ nearest neighbors? Stone [19] showed that the answer is $\le \ell \gamma _d$, where $\gamma _d$ is a constant exponential in $d$ but independent of $|S|$ and $\ell $. This was a key step towards establishing the universal consistency of nearest neighbor classification in Euclidean spaces.

Clarkson [9] extended this result to metric spaces of bounded doubling dimension and to approximate nearest neighbors. Before stating his result, we introduce some notation. For any point $z \in {\mathbb {R}}^d$, any set $A \subset {\mathbb {R}}^d$, and any integer $\ell \ge 1$, let $\text{ NN }_\ell (z,A)$ denote the $\ell $th nearest neighbor of $z$ in $A$, breaking ties arbitrarily. For $\gamma \ge 1$, we say $x \in A$ is an $(\ell ,\gamma )$-NN of $z$ in $A$ if

$$\begin{aligned} \Vert x - z \Vert \le \gamma \Vert z - \text{ NN }_\ell (z,A) \Vert , \end{aligned}$$

that is, $x$ is at most $\gamma $ times further away than $z$’s $\ell $th nearest neighbor.

Recall also that we define the aspect ratio of a finite set $S \subset {\mathbb {R}}^d$ to be

$$\begin{aligned} \varDelta (S) =\frac{\max _{x,y \in S} \Vert x-y\Vert }{\min _{x, y \in S, x \ne y} \Vert x-y\Vert }. \end{aligned}$$

The following is shown in [9, Lemma 5.1].

Lemma 7

Pick any integer $\ell \ge 1$ and any $\gamma \ge 1$. If a finite set $S \subset {\mathbb {R}}^d$ has doubling dimension $d_o$, then any $s \in S$ can be an $(\ell ,\gamma )$-NN nearest neighbor of at most $(8\gamma )^{d_o} \ell \log _2 \varDelta (S)$ other points of $S$.

Proof

Pick any $s \in S$ and any $r > 0$. Consider the annulus $A_r = \{x \in S: r < \Vert x - s\Vert \le 2r\}$. By Lemma 8, $A_r$ can be covered by $\le (8 \gamma )^{d_o}$ balls of radius $r/(2\gamma )$. Consider any such ball $B$: if $B \cap A_r$ contains $\ge \ell +1$ points, then each of these points has $\ell $ neighbors within distance $r/\gamma $, and thus does not have $s$ as an $(\ell , \gamma )$-NN. Therefore, there are at most $\ell (8 \gamma )^{d_o}$ points in $A_r$ that have $s$ as an $(\ell , \gamma )$-NN.

We finish by noticing that by the definition of aspect ratio, $S$ can be covered by $\log _2 \varDelta (S)$ annuli $A_r$, with successively doubling radii.

Lemma 8

Suppose $S \subset {\mathbb {R}}^d$ has doubling dimension $d_o$. Pick any $r \ge \epsilon > 0$. If $B$ is a ball of radius $r$, then $S \cap B$ can covered by $(2r/\epsilon )^{d_o}$ balls of radius $\epsilon $.

Proof

By the definition of doubling dimension, $S \cap B$ can be covered by $2^{d_o}$ balls of radius $r/2$, and thus $2^{2d_o}$ balls of radius $r/4$, and so on. More generally, $S \cap B$ can be covered by $2^{\ell d_o}$ balls of radius $r/2^\ell $ for any integer $\ell \ge 0$. Now take $\ell = \lceil \log _2 (r/\epsilon ) \rceil \le \log _2 (2r/\epsilon )$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dasgupta, S., Sinha, K. Randomized Partition Trees for Nearest Neighbor Search. Algorithmica 72, 237–263 (2015). https://doi.org/10.1007/s00453-014-9885-5

Download citation

Received: 24 September 2013
Accepted: 25 April 2014
Published: 09 May 2014
Issue Date: May 2015
DOI: https://doi.org/10.1007/s00453-014-9885-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Randomized Partition Trees for Nearest Neighbor Search

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Confirmation Sampling for Exact Nearest Neighbor Search

Heuristics for Computing k-Nearest Neighbors Graphs

Faster Dual-Tree Traversal for Nearest Neighbor Search

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix A: Summation Lemma

Lemma 6

Proof

Appendix B: Clarkson’s Lemma

Lemma 7

Proof

Lemma 8

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now