skip to main content
10.1145/1265530.1265545acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
Article

Finding near neighbors through cluster pruning

Published: 11 June 2007 Publication History

Abstract

Finding near(est) neighbors is a classic, difficult problem in data management and retrieval, with applications in text and image search,in finding similar objects and matching patterns. Here we study cluster pruning, an extremely simple randomized technique. During preprocessing we randomly choose a subset of data points to be leaders the remaining data points are partitioned by which leader is the closest. For query processing, we find the leader(s) closest to the query point. We then seek the nearest neighbors for the query point among only the points in the clusters of the closest leader(s). Recursion may be used in both preprocessing and in search. Such schemes seek approximate nearest neighbors that are "almost as good" as the nearest neighbors. How good are these approximations and how much do they save in computation.
Our contributions are: (1) we quantify metrics that allow us to study the tradeoff between processing and the quality of the approximate nearest neighbors; (2) we give rigorous theoretical analysis of our schemes, under natural generative processes (generalizing Gaussian mixtures) for the data points; (3) experiments on both synthetic data from such generative processes, as well as on from a document corpus, confirming that we save orders of magnitude in query processing cost at modest compromises in the quality of retrieved points. In particular, we show that p-spheres, a state-of-the-art solution, is outperformed by our simple scheme whether the data points are stored in main or in external memo.

Supplementary Material

Low Resolution (p103-chierichetti_56k.mp4)
High Resolution (p103-chierichetti_768k.mp4)

References

[1]
S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Wu. An optimal algorithm for nearest neighbor searching. In SODA'94.
[2]
S. Berchtold, K. Keim, and H. -P. Kriegel. The X-Tree: An index structure for high dimensional data. In VLDB'96.
[3]
E. Kushilevitz, R. Ostrovsky, and Y. Rabani. Efficient search for approximate nearest neighbor in high dimensional spaces. SIAM Journal on Computing, 30(2):451--474, 2000.
[4]
M. Bern. Approximate closest point queries in high dimensions. Information Processing Letters, 45, 1993.
[5]
T Bozkaya and M. Ozsoyoglu. Distance-based indexing for high-dimensional metric spaces. In PODS'97.
[6]
K. Clarkson. Nearest neighbor queries in metric spaces. In STOC'97.
[7]
R. Motwani, P. Indyk. Approximate nearest neighbor - towards removing the curse of dimensionality. In STOC'98.
[8]
Vladimir Pestov. On the geometry of similarity search: dimensionality curse and concentration of measure. Information Processing Letters, To Appear.
[9]
K. S. Beyer, J. Goldstein, R. Ramakrishnan, and Uri Shaft. When is "nearest neighbor" meaningful? In ICDT '99.
[10]
K. P. Bennett, U. Fayyad, and D. Geiger. Density-based indexing for approximate nearest-neighbor queries. In KDD '99.
[11]
Sergey Brin. Near neighbor search in large metric spaces. In The VLDB Journal, 574--584, 1995.
[12]
R. Fagin, R. Kumar, and D. Sivakumar Efficient similarity search and classification via rank aggregation. SIGMOD '03.
[13]
A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In The VLDB Journal, 1999.
[14]
H. Edelsbrunner. Algorithms in Combinatorial Geometry. Springer-Verlag, New York, NY, 1987.
[15]
L. Ertz, M. Steinbach, and V. Kumar. Finding topics in collections of documents: A shared nearest neighbor approach. In Text Mine '01.
[16]
C. Buckley and A. F. Lewit. Optimization of inverted vector searches. In SIGIR '85, 97--110.
[17]
M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. In KDD Workshop on Text Mining, 2000.
[18]
P. Willet. Recent trends in hierarchical document clustering: a critical review. In Information Processing and Management, vol. 24(5), 577--597, 1988.
[19]
D. Comer. The ubiquitous b-tree. In ACM Computing Surveys, 11(2):121--137, 1979.
[20]
A. Guttman. R-trees: A dynamic index structure for spatial searching. In SIGMOD '84 .
[21]
G. Karypis, E -H Han, and V. Kumar. Chameleon: Hierarchical Clustering Using Dynamic Modeling. IEEE Computer, 32(8):68--75, August 1999.
[22]
N. Katayama and S. Satoh. The sr-tree: An index structure for high-dimensional nearest neighbor queries. In SIGMOD'97.
[23]
A. N. Papadopoulos Y. Manolopoulos, A. Nanopoulos and Y. Theodoridis. R-trees have grown everywhere. In Technical Report available at http://www.rtreeportal.org/, 2003.
[24]
F. J. MacWilliams and N. J. A. Sloane. The Theory of Error Correcting Codes. Amsterdam: North-Holland, 1977.
[25]
J. Goldstein and Raghu Ramakrishnan. Contrast Plots and P-Sphere Trees: Space vs. Time in Nearest Neighbour Searches. In VLDB'00.

Cited By

View all
  • (2024)Early Exit Strategies for Approximate k-NN Search in Dense RetrievalProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679903(3647-3652)Online publication date: 21-Oct-2024
  • (2024)A Learning-to-Rank Formulation of Clustering-Based Approximate Nearest Neighbor SearchProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657931(2261-2265)Online publication date: 10-Jul-2024
  • (2024)Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse RepresentationsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657769(152-162)Online publication date: 10-Jul-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PODS '07: Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
June 2007
328 pages
ISBN:9781595936851
DOI:10.1145/1265530
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. clustering
  2. generative model
  3. nearest neighbor

Qualifiers

  • Article

Conference

SIGMOD/PODS07
Sponsor:

Acceptance Rates

PODS '07 Paper Acceptance Rate 28 of 187 submissions, 15%;
Overall Acceptance Rate 642 of 2,707 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)18
  • Downloads (Last 6 weeks)0
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Early Exit Strategies for Approximate k-NN Search in Dense RetrievalProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679903(3647-3652)Online publication date: 21-Oct-2024
  • (2024)A Learning-to-Rank Formulation of Clustering-Based Approximate Nearest Neighbor SearchProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657931(2261-2265)Online publication date: 10-Jul-2024
  • (2024)Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse RepresentationsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657769(152-162)Online publication date: 10-Jul-2024
  • (2023)Self-Supervised Object Detection from Egocentric Videos2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.00482(5202-5214)Online publication date: 1-Oct-2023
  • (2022)MSPPIR: Multi-Source Privacy-Preserving Image Retrieval in cloud computingFuture Generation Computer Systems10.1016/j.future.2022.03.040134(78-92)Online publication date: Sep-2022
  • (2020)Keyphrase generation for Vietnamese administrative documents: a collaborative approach2020 12th International Conference on Knowledge and Systems Engineering (KSE)10.1109/KSE50997.2020.9287477(43-48)Online publication date: 12-Nov-2020
  • (2019)Index Maintenance Strategy and Cost Model for Extended Cluster PruningSimilarity Search and Applications10.1007/978-3-030-32047-8_3(32-39)Online publication date: 23-Sep-2019
  • (2018)Alternative patterns of the multidimensional Hilbert curveMultimedia Tools and Applications10.1007/s11042-017-4744-477:7(8419-8440)Online publication date: 1-Apr-2018
  • (2017)PIC: Enable Large-Scale Privacy Preserving Content-Based Image Search on CloudIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.271214828:11(3258-3271)Online publication date: 6-Oct-2017
  • (2015)PICProceedings of the 2015 44th International Conference on Parallel Processing (ICPP)10.1109/ICPP.2015.104(949-958)Online publication date: 1-Sep-2015
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media