Conferences >2024 IEEE International Confe...

CanDE: A Lightweight Locality-Sensitive Hashing Add-on for Candidate-Based Distribution Estimation

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Locality sensitive hashing (LSH) is a widely used technique for approximate nearest neighbor search (ANNS). In an LSH-based solution for ANNS, the computation of query-to...Show More

Metadata

Abstract:

Locality sensitive hashing (LSH) is a widely used technique for approximate nearest neighbor search (ANNS). In an LSH-based solution for ANNS, the computation of query-to-data (Q2D) distances accounts for a considerable fraction of the query time, but such distance information is thrown away after nearest neighbors are identified. In this paper, we propose CanDE (Candidate-based Distribution Estimation), a lightweight add-on to LSH that reuses such information for a wide range of analytics tasks including Q2D distance distribution estimation (QDDE), kernel density estimation (KDE), and query-time recall estimation (QTRE). This allows for significant savings in indexing costs and query time for multiple tasks associated with the original query.The main technical hurdle that CanDE addresses is the accurate estimation of some important statistics of the dataset via importance sampling. We discover that the existing estimators of these statistics are not accurate, because they approximate the actual number of collisions (called collision rate) in the LSH index using the theoretical collision probability (of the LSH function family), and this approximation is crude. To address this issue, we propose more accurate estimators based on a novel scheme called inferred collision rate (ICR), which gives a much better approximation to the actual collision rate. Furthermore, we propose an efficient algorithm for computing ICR from the nearest neighbor candidates returned by ANNS. Our evaluation shows that CanDE outperforms existing solutions on multiple analytics tasks while adding only about 8% to 19% query time overhead to ANNS.

Published in: 2024 IEEE International Conference on Big Data (BigData)

Date of Conference: 15-18 December 2024

Date Added to IEEE Xplore: 16 January 2025

ISBN Information:

ISSN Information:

DOI: 10.1109/BigData62323.2024.10826065

Conference Location: Washington, DC, USA

Funding Agency:

Contents

References is not available for this document.

CanDE: A Lightweight Locality-Sensitive Hashing Add-on for Candidate-Based Distribution Estimation

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

References

IEEE Account

Purchase Details

Profile Information

Need Help?

CanDE: A Lightweight Locality-Sensitive Hashing Add-on for Candidate-Based Distribution Estimation

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

References

IEEE Account

Purchase Details

Profile Information

Need Help?