research-article

Bidirectionally Densifying LSH Sketches with Empty Bins

Authors:

Xiaohong GuanAuthors Info & Claims

SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data

Pages 830 - 842

https://doi.org/10.1145/3448016.3452833

Published: 18 June 2021 Publication History

Abstract

As an efficient tool for approximate similarity computation and search, Locality Sensitive Hashing (LSH) has been widely used in many research areas including databases, data mining, information retrieval, and machine learning. Classical LSH methods typically require to perform hundreds or even thousands of hashing operations when computing the LSH sketch for each input item (e.g., a set or a vector); however, this complexity is still too expensive and even impractical for applications requiring processing data in real-time. To address this issue, several fast methods such as OPH and BCWS have been proposed to efficiently compute the LSH sketches; however, these methods may generate many sketches with empty bins, which may introduce large errors for similarity estimation and also limit their usage for fast similarity search. To solve this issue, we propose a novel densification method, i.e., BiDens. Compared with existing densification methods, our BiDens is more efficient to fill a sketch's empty bins with values of its non-empty bins in either the forward or backward directions. Furthermore, it also densifies empty bins to satisfy the densification principle (i.e., the LSH property). Theoretical analysis and experimental results on similarity estimation, fast similarity search, and kernel linearization using real-world datasets demonstrate that our BiDens is up to 106 times faster than state-of-the-art methods while achieving the same or even better accuracy.

Supplementary Material

MP4 File (3448016.3452833.mp4)

As an efficient tool for approximate similarity computation and search, Locality Sensitive Hashing (LSH) has been widely used in many research areas including databases, data mining, information retrieval, and machine learning. Classical LSH methods typically require to perform hundreds or even thousands of hashing operations when computing the LSH sketch for each input item (e.g., a set or a vector); however, this complexity is still too expensive and even impractical for applications requiring processing data in real-time. To address this issue, several fast methods such as OPH and BCWS have been proposed to efficiently compute the LSH sketches; however, these methods may generate many sketches with empty bins, which may introduce large errors for similarity estimation and also limit their usage for fast similarity search. To solve this issue, we propose a novel densification method, i.e., BiDens. Compared with existing densification methods, our BiDens is more efficient to fill a sketch's empty bins with values of its non-empty bins in either the forward or backward directions. Furthermore, it also densifies empty bins to satisfy the densification principle (i.e., the LSH property). Theoretical analysis and experimental results on similarity estimation, fast similarity search, and kernel linearization using real-world datasets demonstrate that our BiDens is up to 106 times faster than state-of-the-art methods while achieving the same or even better accuracy.

Download
36.64 MB

References

[1]

Léon Bottou and Olivier Bousquet. 2008. The tradeoffs of large scale learning. In NIPS. 161--168.

[2]

Andrei~Z Broder. 1997. On the resemblance and containment of documents. In SEQUENCES. 21--29.

[3]

Andrei~Z Broder, Moses Charikar, Alan~M Frieze, and Michael Mitzenmacher. 2000. Min-Wise Independent Permutations. JCSS, Vol. 60, 3 (2000), 630--659.

Digital Library

[4]

Moses Charikar. 2002. Similarity estimation techniques from rounding algorithms. In STOC. 380--388.

[5]

Beidi Chen, Tharun Medini, James Farwell, Sameh Gobriel, Charlie Tai, and Anshumali Shrivastava. 2020 b. Slide: In defense of smart algorithms over hardware acceleration for large-scale deep learning systems. In MLSys. 1--16.

[6]

Beidi Chen and Anshumali Shrivastava. 2018. Densified Winner Take All (WTA) Hashing for Sparse Datasets. In UAI . 906--916.

[7]

Beidi Chen, Yingchen Xu, and Anshumali Shrivastava. 2019. Fast and Accurate Stochastic Gradient Estimation. In NIPS. 12339--12349.

[8]

Jihong Chen, Wei Chen, Jinjing Huang, Jinhua Fang, Zhixu Li, An Liu, and Lei Zhao. 2020 a. Co-purchaser Recommendation for Online Group Buying. DSE, Vol. 5, 3 (2020), 280--292.

[9]

Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine learning, Vol. 20, 3 (1995), 273--297.

[10]

Søren Dahlgaard, Mathias Bæk~Tejs Knudsen, and Mikkel Thorup. 2017. Fast Similarity Sketching. In FOCS. 663--671.

[11]

Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab~S. Mirrokni. 2004. Locality-sensitive hashing scheme based on p-stable distributions. In SOCG . 253--262.

[12]

Thomas~L. Dean, Mark~A. Ruzon, Mark Segal, Jonathon Shlens, Sudheendra Vijayanarasimhan, and Jay Yagnik. 2013. Fast, Accurate Detection of 100, 000 Object Classes on a Single Machine. In CVPR . 1814--1821.

[13]

Otmar Ertl. 2019. BagMinHash - Minwise Hashing Algorithm for Weighted Sets. In SIGKDD. 1368--1377.

[14]

Mark Everingham, Luc Van~Gool, Christopher~KI Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (voc) challenge. IJCV, Vol. 88, 2 (2010), 303--338.

Digital Library

[15]

Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A library for large linear classification. JMLR, Vol. 9, 8 (2008), 1871--1874.

Digital Library

[16]

Raul~Castro Fernandez, Jisoo Min, Demitri Nava, and Samuel Madden. 2019. Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment. In ICDE. 1190--1201.

[17]

Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. 2013. Optimized product quantization for approximate nearest neighbor search. In CVPR . 2946--2953.

[18]

Aristides Gionis, Piotr Indyk, and Rajeev Motwani. 1999. Similarity Search in High Dimensions via Hashing. In PVLDB. 518--529.

[19]

Sreenivas Gollapudi and Rina Panigrahy. 2006. Exploiting asymmetry in hierarchical topic extraction. In CIKM. 475--482.

[20]

Long Gong, Huayi Wang, Mitsunori Ogihara, and Jun Xu. 2020. iDEC: Indexable Distance Estimating Codes for Approximate Nearest Neighbor Search. PVLDB, Vol. 13, 9 (2020), 1483--1497.

Digital Library

[21]

Bernhard Haeupler, Mark~S. Manasse, and Kunal Talwar. 2014. Consistent Weighted Sampling Made Fast, Small, and Easy. CoRR, Vol. abs/1410.4266 (2014).

[22]

Piotr Indyk and Rajeev Motwani. 1998. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In STOC . 604--613.

[23]

Sergey Ioffe. 2010. Improved Consistent Sampling, Weighted Minhash and L1 Sketching. In ICDM . 246--255.

[24]

Peng Jia, Pinghui Wang, Jing Tao, and Xiaohong Guan. 2019. A Fast Sketch Method for Mining User Similarities over Fully Dynamic Graph Streams. In ICDE . 1682--1685.

[25]

Jon Kleinberg and Eva Tardos. 1999. Approximation Algorithms for Classification Problems with Pairwise Relationships: Metric Labeling and Markov Random Fields. In FOCS. 14--14.

[26]

Iasonas Kokkinos, Michael Bronstein, and Alan Yuille. 2012. Dense scale invariant descriptors for images and surfaces . Ph.D. Dissertation. INRIA.

[27]

David~D Lewis, Yiming Yang, Tony~G Rose, and Fan Li. 2004. Rcv1: A new benchmark collection for text categorization research. JMLR, Vol. 5, 4 (2004), 361--397.

Digital Library

[28]

Ping Li. 2015. 0-Bit Consistent Weighted Sampling. In SIGKDD. 665--674.

[29]

Ping Li. 2017. Linearized gmm kernels and normalized random fourier features. In SIGKDD . 315--324.

[30]

Ping Li. 2018. Several tunable GMM kernels. arXiv preprint arXiv:1805.02830 (2018).

[31]

Ping Li and Arnd~Christian Kö nig. 2010. b-Bit minwise hashing. In WWW. 671--680.

[32]

Ping Li, Xiaoyun Li, and Cun-Hui Zhang. 2019. Re-randomized Densification for One Permutation Hashing and Bin-wise Consistent Weighted Sampling. In NIPS. 15900--15910.

[33]

Ping Li, Art~B. Owen, and Cun-Hui Zhang. 2012. One Permutation Hashing. In NIPS. 3122--3130.

[34]

Ping Li, Anshumali Shrivastava, Joshua~L Moore, and Arnd~C König. 2011a. Hashing algorithms for large-scale learning. In NIPS. 2672--2680.

[35]

Ping Li, Anshumali Shrivastava, Joshua~L. Moore, and Arnd~Christian Kö nig. 2011b. Hashing Algorithms for Large-Scale Learning. In NIPS. 2672--2680.

[36]

Ping Li and Cun-Hui Zhang. 2017. Theory of the GMM Kernel. In WWW. 1053--1062.

[37]

Wanqi Liu, Hanchen Wang, Ying Zhang, Wei Wang, and Lu Qin. 2019. I-LSH: I/O efficient c-approximate nearest neighbor search in high-dimensional space. In ICDE. 1670--1673.

[38]

Tung Mai, Anup Rao, Matt Kapilevich, Ryan Rossi, Yasin Abbasi-Yadkori, and Ritwik Sinha. 2019. On Densification for Minwise Hashing. In UAI. 302--311.

[39]

Yury~A Malkov and Dmitry~A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. TPAMI (2018).

[40]

Mark Manasse, Frank McSherry, and Kunal Talwar. 2010. Consistent Weighted Sampling . Technical Report.

[41]

Julian McAuley, Rahul Pandey, and Jure Leskovec. 2015. Inferring networks of substitutable and complementary products. In KDD . 785--794.

[42]

Michael Mitzenmacher, Rasmus Pagh, and Ninh Pham. 2014. Efficient estimation for high similarities using odd sketches. In WWW. 109--118.

[43]

Devi Parikh and Kristen Grauman. 2011. Relative attributes. In ICCV. 503--510.

[44]

Ali Rahimi and Benjamin Recht. 2008. Random features for large-scale kernel machines. In NIPS. 1177--1184.

[45]

Anshumali Shrivastava. 2016. Simple and Efficient Weighted Minwise Hashing. In NIPS. 1498--1506.

[46]

Anshumali Shrivastava. 2017. Optimal Densification for Fast and Accurate Minwise Hashing. In ICML. 3154--3163.

[47]

Anshumali Shrivastava and Ping Li. 2014a. Densifying One Permutation Hashing via Rotation for Fast Near Neighbor Search. In ICML . 557--565.

[48]

Anshumali Shrivastava and Ping Li. 2014b. Improved Densification of One Permutation Hashing. In UAI. 732--741.

[49]

Ryan Spring and Anshumali Shrivastava. 2017. Scalable and sustainable deep learning via randomized hashing. In SIGKDD . 445--454.

[50]

Ryan Spring and Anshumali Shrivastava. 2020. Mutual information estimation using LSH sampling. In AAAI. 1--9.

[51]

Pinghui Wang, Yiyan Qi, Yuanming Zhang, Qiaozhu Zhai, Chenxu Wang, John C. S. Lui, and Xiaohong Guan. 2019. A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets. In SIGKDD . 25--33.

[52]

Wei Wu, Bin Li, Ling Chen, and Chengqi Zhang. 2016. Canonical Consistent Weighted Sampling for Real-Value Weighted Min-Hash. In ICDM . 1287--1292.

[53]

Wei Wu, Bin Li, Ling Chen, and Chengqi Zhang. 2017. Consistent Weighted Sampling Made More Practical. In WWW. 1035--1043.

[54]

Da-Feng Xia, Sen-Lin Xu, and Feng Qi. 1999. A proof of the arithmetic mean-geometric mean-harmonic mean inequalities. RGMIA, Vol. 2, 1 (1999).

[55]

Jay Yagnik, Dennis Strelow, David~A. Ross, and Ruei-Sung Lin. 2019. The power of comparative reasoning. In ICCV. 2431--2438.

Cited By

Chen YWu HRen X(2025)Efficient and Secure Traffic Scheduling Based on Private SketchMathematics10.3390/math1302028813:2(288)Online publication date: 17-Jan-2025
https://doi.org/10.3390/math13020288
Li PZhao WOosterhuis HBast HXiong C(2024)Pb-Hash: Partitioned b-bit HashingProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672523(239-246)Online publication date: 2-Aug-2024
https://dl.acm.org/doi/10.1145/3664190.3672523
Liu ZWang XWu YYang TYang KZhang HTu YCui B(2024)A Unified Framework for Mining Batch and Periodic Batch in Data StreamsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.339902436:11(5544-5561)Online publication date: Nov-2024
https://doi.org/10.1109/TKDE.2024.3399024
Show More Cited By

Index Terms

Bidirectionally Densifying LSH Sketches with Empty Bins

Recommendations

Multi-probe LSH: efficient indexing for high-dimensional similarity search
VLDB '07: Proceedings of the 33rd international conference on Very large data bases

Similarity indices for high-dimensional data are very desirable for building content-based search systems for feature-rich data such as audio, images, videos, and other sensor data. Recently, locality sensitive hashing (LSH) and its variations have been ...
Dynamic Multi-probe LSH: An I/O Efficient Index Structure for Approximate Nearest Neighbor Search
DEXA 2013: Proceedings of the 24th International Conference on Database and Expert Systems Applications - Volume 8055

Locality-Sensitive Hashing LSH is widely used to solve approximate nearest neighbor search problems in high-dimensional spaces. The basic idea is to map the "nearby" objects into a same hash bucket with high probability. A significant drawback is that ...
Asymmetric distance estimation with sketches for similarity search in high-dimensional spaces
SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

Efficient similarity search in high-dimensional spaces is important to content-based retrieval systems. Recent studies have shown that sketches can effectively approximate L₁ distance in high-dimensional spaces, and that filtering with sketches can ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data

June 2021

2969 pages

ISBN:9781450383431

DOI:10.1145/3448016

General Chairs:
Guoliang Li
Tsinghua University (China)
,
Zhanhuai Li
Northwestern Polytechnical University (China)
,
Program Chairs:
Stratos Idreos
Harvard University (USA)
,
Divesh Srivastava
AT&T (USA)

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Natural Science Basic Research Plan in Zhejiang Province of China
National Natural Science Foundation of China
Shenzhen Basic Research Grant
MoE-CMCC Artifical Intelligence Project
Natural Science Basic Research Plan in Shaanxi Province of China

Conference

SIGMOD/PODS '21

Sponsor:

SIGMOD

SIGMOD/PODS '21: International Conference on Management of Data

June 20 - 25, 2021

Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
426
Total Downloads

Downloads (Last 12 months)34
Downloads (Last 6 weeks)2

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Chen YWu HRen X(2025)Efficient and Secure Traffic Scheduling Based on Private SketchMathematics10.3390/math1302028813:2(288)Online publication date: 17-Jan-2025
https://doi.org/10.3390/math13020288
Li PZhao WOosterhuis HBast HXiong C(2024)Pb-Hash: Partitioned b-bit HashingProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672523(239-246)Online publication date: 2-Aug-2024
https://dl.acm.org/doi/10.1145/3664190.3672523
Liu ZWang XWu YYang TYang KZhang HTu YCui B(2024)A Unified Framework for Mining Batch and Periodic Batch in Data StreamsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.339902436:11(5544-5561)Online publication date: Nov-2024
https://doi.org/10.1109/TKDE.2024.3399024
Wu YLi ZYi PQiu RZhao YYang T(2024)Priority Sketch: A Priority-aware Measurement Framework2024 International Conference on Satellite Internet (SAT-NET)10.1109/SAT-NET62854.2024.00012(18-23)Online publication date: 25-Oct-2024
https://doi.org/10.1109/SAT-NET62854.2024.00012
Shi QJia CLi WLiu ZYang TJi JXie GZhang WYu M(2024)BitMatcher: Bit-level Counter Adjustment for Sketches2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00366(4815-4827)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00366
Chen QHong YWu YYang TCui B(2024)CodingSketch: A Hierarchical Sketch with Efficient Encoding and Recursive Decoding2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00130(1592-1605)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00130
Zhao YHan WZhong ZZhang YYang TCui B(2023)Double-Anonymous Sketch: Achieving Top-K-fairness for Finding Global Top-K Frequent ItemsProceedings of the ACM on Management of Data10.1145/35889331:1(1-26)Online publication date: 30-May-2023
https://doi.org/10.1145/3588933
Liu ZKong CYang KYang TMiao RChen QZhao YTu YCui B(2023)HyperCalm Sketch: One-Pass Mining Periodic Batches in Data Streams2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00009(14-26)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00009
Li HChen QZhang YYang TCui B(2022)Stingy sketchProceedings of the VLDB Endowment10.14778/3523210.352322015:7(1426-1438)Online publication date: 1-Mar-2022
https://dl.acm.org/doi/10.14778/3523210.3523220

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten