Abstract:
Locality sensitive hashing (LSH) is a widely used technique for approximate nearest neighbor search (ANNS). In an LSH-based solution for ANNS, the computation of query-to...Show MoreMetadata
Abstract:
Locality sensitive hashing (LSH) is a widely used technique for approximate nearest neighbor search (ANNS). In an LSH-based solution for ANNS, the computation of query-to-data (Q2D) distances accounts for a considerable fraction of the query time, but such distance information is thrown away after nearest neighbors are identified. In this paper, we propose CanDE (Candidate-based Distribution Estimation), a lightweight add-on to LSH that reuses such information for a wide range of analytics tasks including Q2D distance distribution estimation (QDDE), kernel density estimation (KDE), and query-time recall estimation (QTRE). This allows for significant savings in indexing costs and query time for multiple tasks associated with the original query.The main technical hurdle that CanDE addresses is the accurate estimation of some important statistics of the dataset via importance sampling. We discover that the existing estimators of these statistics are not accurate, because they approximate the actual number of collisions (called collision rate) in the LSH index using the theoretical collision probability (of the LSH function family), and this approximation is crude. To address this issue, we propose more accurate estimators based on a novel scheme called inferred collision rate (ICR), which gives a much better approximation to the actual collision rate. Furthermore, we propose an efficient algorithm for computing ICR from the nearest neighbor candidates returned by ANNS. Our evaluation shows that CanDE outperforms existing solutions on multiple analytics tasks while adding only about 8% to 19% query time overhead to ANNS.
Published in: 2024 IEEE International Conference on Big Data (BigData)
Date of Conference: 15-18 December 2024
Date Added to IEEE Xplore: 16 January 2025
ISBN Information: