Article

Finding near neighbors through cluster pruning

Authors:

Flavio Chierichetti,

Alessandro Panconesi,

Prabhakar Raghavan,

Alessandro Tiberi,

Eli UpfalAuthors Info & Claims

PODS '07: Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

Pages 103 - 112

https://doi.org/10.1145/1265530.1265545

Published: 11 June 2007 Publication History

Abstract

Finding near(est) neighbors is a classic, difficult problem in data management and retrieval, with applications in text and image search,in finding similar objects and matching patterns. Here we study cluster pruning, an extremely simple randomized technique. During preprocessing we randomly choose a subset of data points to be leaders the remaining data points are partitioned by which leader is the closest. For query processing, we find the leader(s) closest to the query point. We then seek the nearest neighbors for the query point among only the points in the clusters of the closest leader(s). Recursion may be used in both preprocessing and in search. Such schemes seek approximate nearest neighbors that are "almost as good" as the nearest neighbors. How good are these approximations and how much do they save in computation.

Our contributions are: (1) we quantify metrics that allow us to study the tradeoff between processing and the quality of the approximate nearest neighbors; (2) we give rigorous theoretical analysis of our schemes, under natural generative processes (generalizing Gaussian mixtures) for the data points; (3) experiments on both synthetic data from such generative processes, as well as on from a document corpus, confirming that we save orders of magnitude in query processing cost at modest compromises in the quality of retrieved points. In particular, we show that p-spheres, a state-of-the-art solution, is outperformed by our simple scheme whether the data points are stored in main or in external memo.

Supplementary Material

Low Resolution (p103-chierichetti_56k.mp4)

Download
31.37 MB

High Resolution (p103-chierichetti_768k.mp4)

Download
142.63 MB

References

[1]

S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Wu. An optimal algorithm for nearest neighbor searching. In SODA'94.

Digital Library

[2]

S. Berchtold, K. Keim, and H. -P. Kriegel. The X-Tree: An index structure for high dimensional data. In VLDB'96.

Digital Library

[3]

E. Kushilevitz, R. Ostrovsky, and Y. Rabani. Efficient search for approximate nearest neighbor in high dimensional spaces. SIAM Journal on Computing, 30(2):451--474, 2000.

Digital Library

[4]

M. Bern. Approximate closest point queries in high dimensions. Information Processing Letters, 45, 1993.

Digital Library

[5]

T Bozkaya and M. Ozsoyoglu. Distance-based indexing for high-dimensional metric spaces. In PODS'97.

Digital Library

[6]

K. Clarkson. Nearest neighbor queries in metric spaces. In STOC'97.

Digital Library

[7]

R. Motwani, P. Indyk. Approximate nearest neighbor - towards removing the curse of dimensionality. In STOC'98.

Digital Library

[8]

Vladimir Pestov. On the geometry of similarity search: dimensionality curse and concentration of measure. Information Processing Letters, To Appear.

Digital Library

[9]

K. S. Beyer, J. Goldstein, R. Ramakrishnan, and Uri Shaft. When is "nearest neighbor" meaningful? In ICDT '99.

[10]

K. P. Bennett, U. Fayyad, and D. Geiger. Density-based indexing for approximate nearest-neighbor queries. In KDD '99.

Digital Library

[11]

Sergey Brin. Near neighbor search in large metric spaces. In The VLDB Journal, 574--584, 1995.

Digital Library

[12]

R. Fagin, R. Kumar, and D. Sivakumar Efficient similarity search and classification via rank aggregation. SIGMOD '03.

Digital Library

[13]

A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In The VLDB Journal, 1999.

Digital Library

[14]

H. Edelsbrunner. Algorithms in Combinatorial Geometry. Springer-Verlag, New York, NY, 1987.

[15]

L. Ertz, M. Steinbach, and V. Kumar. Finding topics in collections of documents: A shared nearest neighbor approach. In Text Mine '01.

[16]

C. Buckley and A. F. Lewit. Optimization of inverted vector searches. In SIGIR '85, 97--110.

Digital Library

[17]

M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. In KDD Workshop on Text Mining, 2000.

[18]

P. Willet. Recent trends in hierarchical document clustering: a critical review. In Information Processing and Management, vol. 24(5), 577--597, 1988.

Digital Library

[19]

D. Comer. The ubiquitous b-tree. In ACM Computing Surveys, 11(2):121--137, 1979.

Digital Library

[20]

A. Guttman. R-trees: A dynamic index structure for spatial searching. In SIGMOD '84 .

Digital Library

[21]

G. Karypis, E -H Han, and V. Kumar. Chameleon: Hierarchical Clustering Using Dynamic Modeling. IEEE Computer, 32(8):68--75, August 1999.

Digital Library

[22]

N. Katayama and S. Satoh. The sr-tree: An index structure for high-dimensional nearest neighbor queries. In SIGMOD'97.

Digital Library

[23]

A. N. Papadopoulos Y. Manolopoulos, A. Nanopoulos and Y. Theodoridis. R-trees have grown everywhere. In Technical Report available at http://www.rtreeportal.org/, 2003.

[24]

F. J. MacWilliams and N. J. A. Sloane. The Theory of Error Correcting Codes. Amsterdam: North-Holland, 1977.

[25]

J. Goldstein and Raghu Ramakrishnan. Contrast Plots and P-Sphere Trees: Space vs. Time in Nearest Neighbour Searches. In VLDB'00.

Digital Library

Cited By

Busolin FLucchese CNardini FOrlando SPerego RTrani SSerra ESpezzano F(2024)Early Exit Strategies for Approximate k-NN Search in Dense RetrievalProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679903(3647-3652)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679903
Vecchiato TLucchese CNardini FBruch SHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)A Learning-to-Rank Formulation of Clustering-Based Approximate Nearest Neighbor SearchProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657931(2261-2265)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657931
Bruch SNardini FRulli CVenturini RHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse RepresentationsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657769(152-162)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657769
Show More Cited By

Index Terms

Finding near neighbors through cluster pruning
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing
    2. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering
2. Theory of computation
  1. Design and analysis of algorithms
    1. Data structures design and analysis
      1. Sorting and searching
  2. Randomness, geometry and discrete structures
    1. Computational geometry

Recommendations

Surrounding influenced K-nearest neighbors: a new distance based classifier
ADMA'10: Proceedings of the 6th international conference on Advanced data mining and applications: Part I

The nearest neighbor classification method assigns to an unclassified point the class of the nearest of a set of previously classified points. An extension to this approach is the K-NN method, in which the classification is made taking into account the ...
Improving Locality Sensitive Hashing by Efficiently Finding Projected Nearest Neighbors
Similarity Search and Applications
Abstract
Similarity search in high-dimensional spaces is an important task for many multimedia applications. Due to the notorious curse of dimensionality, approximate nearest neighbor techniques are preferred over exact searching techniques since they can ...
On kernel difference-weighted k-nearest neighbor classification
Special Issue: Non-parametric distance-based classification techniques and their applications

Nearest neighbor (NN) rule is one of the simplest and the most important methods in pattern recognition. In this paper, we propose a kernel difference-weighted k-nearest neighbor (KDF-KNN) method for pattern classification. The proposed method defines ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PODS '07: Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

June 2007

328 pages

ISBN:9781595936851

DOI:10.1145/1265530

General Chair:
Phokion Kolaitis
IBM Almaden
,
Program Chair:
Leonid Libkin
University of Edinburgh

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SIGMOD/PODS07

Sponsor:

SIGMOD/PODS07: International Conference on Management of Data

June 11 - 13, 2007

Beijing, China

Acceptance Rates

PODS '07 Paper Acceptance Rate 28 of 187 submissions, 15%;

Overall Acceptance Rate 642 of 2,707 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

23
Total Citations
View Citations
1,022
Total Downloads

Downloads (Last 12 months)18
Downloads (Last 6 weeks)0

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Busolin FLucchese CNardini FOrlando SPerego RTrani SSerra ESpezzano F(2024)Early Exit Strategies for Approximate k-NN Search in Dense RetrievalProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679903(3647-3652)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679903
Vecchiato TLucchese CNardini FBruch SHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)A Learning-to-Rank Formulation of Clustering-Based Approximate Nearest Neighbor SearchProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657931(2261-2265)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657931
Bruch SNardini FRulli CVenturini RHui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse RepresentationsProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657769(152-162)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657769
Akiva PHuang JLiang KKovvuri RChen XFeiszli MDana KHassner T(2023)Self-Supervised Object Detection from Egocentric Videos2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.00482(5202-5214)Online publication date: 1-Oct-2023
https://doi.org/10.1109/ICCV51070.2023.00482
Gu QXia ZSun X(2022)MSPPIR: Multi-Source Privacy-Preserving Image Retrieval in cloud computingFuture Generation Computer Systems10.1016/j.future.2022.03.040134(78-92)Online publication date: Sep-2022
https://doi.org/10.1016/j.future.2022.03.040
Nguyen TVuong TTran VNguyen LPhan X(2020)Keyphrase generation for Vietnamese administrative documents: a collaborative approach2020 12th International Conference on Knowledge and Systems Engineering (KSE)10.1109/KSE50997.2020.9287477(43-48)Online publication date: 12-Nov-2020
https://doi.org/10.1109/KSE50997.2020.9287477
Højsgaard AJónsson BBonnet P(2019)Index Maintenance Strategy and Cost Model for Extended Cluster PruningSimilarity Search and Applications10.1007/978-3-030-32047-8_3(32-39)Online publication date: 23-Sep-2019
https://doi.org/10.1007/978-3-030-32047-8_3
Franco PNguyen GMullot ROgier J(2018)Alternative patterns of the multidimensional Hilbert curveMultimedia Tools and Applications10.1007/s11042-017-4744-477:7(8419-8440)Online publication date: 1-Apr-2018
https://dl.acm.org/doi/10.1007/s11042-017-4744-4
Zhang LJung TLiu KLi XDing XGu JLiu Y(2017)PIC: Enable Large-Scale Privacy Preserving Content-Based Image Search on CloudIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2017.271214828:11(3258-3271)Online publication date: 6-Oct-2017
https://dl.acm.org/doi/10.1109/TPDS.2017.2712148
Zhang LJung TFeng PLiu KLi XLiu Y(2015)PICProceedings of the 2015 44th International Conference on Parallel Processing (ICPP)10.1109/ICPP.2015.104(949-958)Online publication date: 1-Sep-2015
https://dl.acm.org/doi/10.1109/ICPP.2015.104
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten