Abstract
The \(k\)-d tree was one of the first spatial data structures proposed for nearest neighbor search. Its efficacy is diminished in high-dimensional spaces, but several variants, with randomization and overlapping cells, have proved to be successful in practice. We analyze three such schemes. We show that the probability that they fail to find the nearest neighbor, for any data set and any query point, is directly related to a simple potential function that captures the difficulty of the point configuration. We then bound this potential function in several situations of interest: when the data are drawn from a doubling measure; when the data and query distributions are identical and are supported on a set of bounded doubling dimension; and when the data are documents from a topic model.



Similar content being viewed by others
References
Ailon, N., Chazelle, B.: The fast Johnson-Lindenstrauss transform and approximate nearest neighbors. SIAM J. Comput. 39, 302–322 (2009)
Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1), 117–122 (2008)
Arya, S., Mount, D., Netanyahu, N., Silverman, R., Wu, A.: An optimal algorithm for approximate nearest neighbor searching. J. ACM 45, 891–923 (1998)
Assouad, P.: Plongements lipschitziens dans \({\mathbb{R}}^n\). Bull. Soc. Math. France 111(4), 429–448 (1983)
Bentley, J.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)
Beygelzimer, A., Kakade, S., Langford, J.: Cover trees for nearest neighbor. In: Proceedings of the 23rd International Conference on Machine Learning (2006)
Cayton, L., Dasgupta, S.: A learning framework for nearest-neighbor search. In: Advances in Neural Information Processing Systems (2007)
Clarkson, K.: Nearest neighbor queries in metric spaces. Discret. Comput. Geom. 22, 63–93 (1999)
Clarkson, K.: Nearest-neighbor searching and metric space dimensions. In: Darrell, T., Indyk, P. (eds.) Nearest-Neighbor Methods for Learning and Vision: Theory and Practice. MIT Press, Cambridge (2005)
Dasgupta, S., Freund, Y.: Random projection trees and low dimensional manifolds. In: ACM Symposium on Theory of, Computing, pp. 537–546 (2008)
Dasgupta, S., Sinha, K.: Randomized partition trees for exact nearest neighbor search. In: 26th Annual Conference on Learning Theory (2013)
Gupta, A., Krauthgamer, R., Lee, J.R.: Bounded geometries, fractals, and low-distortion embeddings. In: 44th Annual IEEE Symposium on Foundations of Computer, Science, pp. 534–543 (2003)
Karger, D., Ruhl, M.: Finding nearest neighbors in growth-restricted metrics. In: ACM Symposium on Theory of, Computing, pp. 741–750 (2002)
Kleinberg, J.: Two algorithms for nearest-neighbor search in high dimensions. In: 29th ACM Symposium on Theory of, Computing (1997)
Krauthgamer, R., Lee, J.: Navigating nets: simple algorithms for proximity search. In: ACM-SIAM Symposium on Discrete Algorithms (2004)
Liu, T., Moore, A., Gray, A., Yang, K.: An investigation of practical approximate nearest neighbor algorithms. In: Advances in Neural Information Processing Systems (2004)
Maneewongvatana, S., Mount, D.: The analysis of a probabilistic approach to nearest neighbor searching. In: Seventh International Worshop on Algorithms and Data Structures, pp. 276–286 (2001)
McFee, B., Lanckriet, G.: Large-scale music similarity search with spatial trees. In: 12th Conference of the International Society for Music Retrieval (2011)
Stone, C.: Consistent nonparametric regression. Ann. Stat. 5, 595–645 (1977)
Acknowledgments
The authors are grateful to the National Science Foundation for support under grant IIS-1162581, and to the anonymous reviewers for their detailed feedback.
Author information
Authors and Affiliations
Corresponding author
Additional information
A preliminary abstract of this work appeared in [11].
Appendices
Appendix A: Summation Lemma
Lemma 6
Suppose that for some constants \(A, B > 0\) and \(d_o \ge 1\),
for all integers \(m \ge n_o\). Pick any \(0 < \beta < 1\) and define \(\ell = \log _{1/\beta } (n/n_o)\). Assume for convenience that \(\ell \) is an integer. Then:
and, if \(n_o \ge B(A/2)^{d_o}\),
Proof
Writing the first series in reverse,
The last inequality is obtained by using
to get \((1 - (1-\beta )/d_o)^{d_o} \ge \beta \) and thus \(1-\beta ^{1/d_o} \ge (1-\beta )/d_o\).
Now we move on to the second bound. The lower bound on \(n_o\) implies that \(A (B/m)^{1/d_o} \le 2\) for all \(m \ge n_o\). Since \(x \ln (2e/x)\) is increasing when \(x \le 2\), we have
The lemma now follows from algebraic manipulations that invoke the first bound as well as the inequality
which in turn follows from
Appendix B: Clarkson’s Lemma
Suppose we are given a finite set of points \(S \subset {\mathbb {R}}^d\). How many of these points can have a specific \(x \in S\) as one of their \(\ell \) nearest neighbors? Stone [19] showed that the answer is \(\le \ell \gamma _d\), where \(\gamma _d\) is a constant exponential in \(d\) but independent of \(|S|\) and \(\ell \). This was a key step towards establishing the universal consistency of nearest neighbor classification in Euclidean spaces.
Clarkson [9] extended this result to metric spaces of bounded doubling dimension and to approximate nearest neighbors. Before stating his result, we introduce some notation. For any point \(z \in {\mathbb {R}}^d\), any set \(A \subset {\mathbb {R}}^d\), and any integer \(\ell \ge 1\), let \(\text{ NN }_\ell (z,A)\) denote the \(\ell \)th nearest neighbor of \(z\) in \(A\), breaking ties arbitrarily. For \(\gamma \ge 1\), we say \(x \in A\) is an \((\ell ,\gamma )\)-NN of \(z\) in \(A\) if
that is, \(x\) is at most \(\gamma \) times further away than \(z\)’s \(\ell \)th nearest neighbor.
Recall also that we define the aspect ratio of a finite set \(S \subset {\mathbb {R}}^d\) to be
The following is shown in [9, Lemma 5.1].
Lemma 7
Pick any integer \(\ell \ge 1\) and any \(\gamma \ge 1\). If a finite set \(S \subset {\mathbb {R}}^d\) has doubling dimension \(d_o\), then any \(s \in S\) can be an \((\ell ,\gamma )\)-NN nearest neighbor of at most \((8\gamma )^{d_o} \ell \log _2 \varDelta (S)\) other points of \(S\).
Proof
Pick any \(s \in S\) and any \(r > 0\). Consider the annulus \(A_r = \{x \in S: r < \Vert x - s\Vert \le 2r\}\). By Lemma 8, \(A_r\) can be covered by \(\le (8 \gamma )^{d_o}\) balls of radius \(r/(2\gamma )\). Consider any such ball \(B\): if \(B \cap A_r\) contains \(\ge \ell +1\) points, then each of these points has \(\ell \) neighbors within distance \(r/\gamma \), and thus does not have \(s\) as an \((\ell , \gamma )\)-NN. Therefore, there are at most \(\ell (8 \gamma )^{d_o}\) points in \(A_r\) that have \(s\) as an \((\ell , \gamma )\)-NN.
We finish by noticing that by the definition of aspect ratio, \(S\) can be covered by \(\log _2 \varDelta (S)\) annuli \(A_r\), with successively doubling radii.
Lemma 8
Suppose \(S \subset {\mathbb {R}}^d\) has doubling dimension \(d_o\). Pick any \(r \ge \epsilon > 0\). If \(B\) is a ball of radius \(r\), then \(S \cap B\) can covered by \((2r/\epsilon )^{d_o}\) balls of radius \(\epsilon \).
Proof
By the definition of doubling dimension, \(S \cap B\) can be covered by \(2^{d_o}\) balls of radius \(r/2\), and thus \(2^{2d_o}\) balls of radius \(r/4\), and so on. More generally, \(S \cap B\) can be covered by \(2^{\ell d_o}\) balls of radius \(r/2^\ell \) for any integer \(\ell \ge 0\). Now take \(\ell = \lceil \log _2 (r/\epsilon ) \rceil \le \log _2 (2r/\epsilon )\).
Rights and permissions
About this article
Cite this article
Dasgupta, S., Sinha, K. Randomized Partition Trees for Nearest Neighbor Search. Algorithmica 72, 237–263 (2015). https://doi.org/10.1007/s00453-014-9885-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-014-9885-5