skip to main content
10.1145/3448016.3452833acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Bidirectionally Densifying LSH Sketches with Empty Bins

Published: 18 June 2021 Publication History

Abstract

As an efficient tool for approximate similarity computation and search, Locality Sensitive Hashing (LSH) has been widely used in many research areas including databases, data mining, information retrieval, and machine learning. Classical LSH methods typically require to perform hundreds or even thousands of hashing operations when computing the LSH sketch for each input item (e.g., a set or a vector); however, this complexity is still too expensive and even impractical for applications requiring processing data in real-time. To address this issue, several fast methods such as OPH and BCWS have been proposed to efficiently compute the LSH sketches; however, these methods may generate many sketches with empty bins, which may introduce large errors for similarity estimation and also limit their usage for fast similarity search. To solve this issue, we propose a novel densification method, i.e., BiDens. Compared with existing densification methods, our BiDens is more efficient to fill a sketch's empty bins with values of its non-empty bins in either the forward or backward directions. Furthermore, it also densifies empty bins to satisfy the densification principle (i.e., the LSH property). Theoretical analysis and experimental results on similarity estimation, fast similarity search, and kernel linearization using real-world datasets demonstrate that our BiDens is up to 106 times faster than state-of-the-art methods while achieving the same or even better accuracy.

Supplementary Material

MP4 File (3448016.3452833.mp4)
As an efficient tool for approximate similarity computation and search, Locality Sensitive Hashing (LSH) has been widely used in many research areas including databases, data mining, information retrieval, and machine learning. Classical LSH methods typically require to perform hundreds or even thousands of hashing operations when computing the LSH sketch for each input item (e.g., a set or a vector); however, this complexity is still too expensive and even impractical for applications requiring processing data in real-time. To address this issue, several fast methods such as OPH and BCWS have been proposed to efficiently compute the LSH sketches; however, these methods may generate many sketches with empty bins, which may introduce large errors for similarity estimation and also limit their usage for fast similarity search. To solve this issue, we propose a novel densification method, i.e., BiDens. Compared with existing densification methods, our BiDens is more efficient to fill a sketch's empty bins with values of its non-empty bins in either the forward or backward directions. Furthermore, it also densifies empty bins to satisfy the densification principle (i.e., the LSH property). Theoretical analysis and experimental results on similarity estimation, fast similarity search, and kernel linearization using real-world datasets demonstrate that our BiDens is up to 106 times faster than state-of-the-art methods while achieving the same or even better accuracy.

References

[1]
Léon Bottou and Olivier Bousquet. 2008. The tradeoffs of large scale learning. In NIPS. 161--168.
[2]
Andrei~Z Broder. 1997. On the resemblance and containment of documents. In SEQUENCES. 21--29.
[3]
Andrei~Z Broder, Moses Charikar, Alan~M Frieze, and Michael Mitzenmacher. 2000. Min-Wise Independent Permutations. JCSS, Vol. 60, 3 (2000), 630--659.
[4]
Moses Charikar. 2002. Similarity estimation techniques from rounding algorithms. In STOC. 380--388.
[5]
Beidi Chen, Tharun Medini, James Farwell, Sameh Gobriel, Charlie Tai, and Anshumali Shrivastava. 2020 b. Slide: In defense of smart algorithms over hardware acceleration for large-scale deep learning systems. In MLSys. 1--16.
[6]
Beidi Chen and Anshumali Shrivastava. 2018. Densified Winner Take All (WTA) Hashing for Sparse Datasets. In UAI . 906--916.
[7]
Beidi Chen, Yingchen Xu, and Anshumali Shrivastava. 2019. Fast and Accurate Stochastic Gradient Estimation. In NIPS. 12339--12349.
[8]
Jihong Chen, Wei Chen, Jinjing Huang, Jinhua Fang, Zhixu Li, An Liu, and Lei Zhao. 2020 a. Co-purchaser Recommendation for Online Group Buying. DSE, Vol. 5, 3 (2020), 280--292.
[9]
Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine learning, Vol. 20, 3 (1995), 273--297.
[10]
Søren Dahlgaard, Mathias Bæk~Tejs Knudsen, and Mikkel Thorup. 2017. Fast Similarity Sketching. In FOCS. 663--671.
[11]
Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab~S. Mirrokni. 2004. Locality-sensitive hashing scheme based on p-stable distributions. In SOCG . 253--262.
[12]
Thomas~L. Dean, Mark~A. Ruzon, Mark Segal, Jonathon Shlens, Sudheendra Vijayanarasimhan, and Jay Yagnik. 2013. Fast, Accurate Detection of 100, 000 Object Classes on a Single Machine. In CVPR . 1814--1821.
[13]
Otmar Ertl. 2019. BagMinHash - Minwise Hashing Algorithm for Weighted Sets. In SIGKDD. 1368--1377.
[14]
Mark Everingham, Luc Van~Gool, Christopher~KI Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (voc) challenge. IJCV, Vol. 88, 2 (2010), 303--338.
[15]
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A library for large linear classification. JMLR, Vol. 9, 8 (2008), 1871--1874.
[16]
Raul~Castro Fernandez, Jisoo Min, Demitri Nava, and Samuel Madden. 2019. Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment. In ICDE. 1190--1201.
[17]
Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. 2013. Optimized product quantization for approximate nearest neighbor search. In CVPR . 2946--2953.
[18]
Aristides Gionis, Piotr Indyk, and Rajeev Motwani. 1999. Similarity Search in High Dimensions via Hashing. In PVLDB. 518--529.
[19]
Sreenivas Gollapudi and Rina Panigrahy. 2006. Exploiting asymmetry in hierarchical topic extraction. In CIKM. 475--482.
[20]
Long Gong, Huayi Wang, Mitsunori Ogihara, and Jun Xu. 2020. iDEC: Indexable Distance Estimating Codes for Approximate Nearest Neighbor Search. PVLDB, Vol. 13, 9 (2020), 1483--1497.
[21]
Bernhard Haeupler, Mark~S. Manasse, and Kunal Talwar. 2014. Consistent Weighted Sampling Made Fast, Small, and Easy. CoRR, Vol. abs/1410.4266 (2014).
[22]
Piotr Indyk and Rajeev Motwani. 1998. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In STOC . 604--613.
[23]
Sergey Ioffe. 2010. Improved Consistent Sampling, Weighted Minhash and L1 Sketching. In ICDM . 246--255.
[24]
Peng Jia, Pinghui Wang, Jing Tao, and Xiaohong Guan. 2019. A Fast Sketch Method for Mining User Similarities over Fully Dynamic Graph Streams. In ICDE . 1682--1685.
[25]
Jon Kleinberg and Eva Tardos. 1999. Approximation Algorithms for Classification Problems with Pairwise Relationships: Metric Labeling and Markov Random Fields. In FOCS. 14--14.
[26]
Iasonas Kokkinos, Michael Bronstein, and Alan Yuille. 2012. Dense scale invariant descriptors for images and surfaces . Ph.D. Dissertation. INRIA.
[27]
David~D Lewis, Yiming Yang, Tony~G Rose, and Fan Li. 2004. Rcv1: A new benchmark collection for text categorization research. JMLR, Vol. 5, 4 (2004), 361--397.
[28]
Ping Li. 2015. 0-Bit Consistent Weighted Sampling. In SIGKDD. 665--674.
[29]
Ping Li. 2017. Linearized gmm kernels and normalized random fourier features. In SIGKDD . 315--324.
[30]
Ping Li. 2018. Several tunable GMM kernels. arXiv preprint arXiv:1805.02830 (2018).
[31]
Ping Li and Arnd~Christian Kö nig. 2010. b-Bit minwise hashing. In WWW. 671--680.
[32]
Ping Li, Xiaoyun Li, and Cun-Hui Zhang. 2019. Re-randomized Densification for One Permutation Hashing and Bin-wise Consistent Weighted Sampling. In NIPS. 15900--15910.
[33]
Ping Li, Art~B. Owen, and Cun-Hui Zhang. 2012. One Permutation Hashing. In NIPS. 3122--3130.
[34]
Ping Li, Anshumali Shrivastava, Joshua~L Moore, and Arnd~C König. 2011a. Hashing algorithms for large-scale learning. In NIPS. 2672--2680.
[35]
Ping Li, Anshumali Shrivastava, Joshua~L. Moore, and Arnd~Christian Kö nig. 2011b. Hashing Algorithms for Large-Scale Learning. In NIPS. 2672--2680.
[36]
Ping Li and Cun-Hui Zhang. 2017. Theory of the GMM Kernel. In WWW. 1053--1062.
[37]
Wanqi Liu, Hanchen Wang, Ying Zhang, Wei Wang, and Lu Qin. 2019. I-LSH: I/O efficient c-approximate nearest neighbor search in high-dimensional space. In ICDE. 1670--1673.
[38]
Tung Mai, Anup Rao, Matt Kapilevich, Ryan Rossi, Yasin Abbasi-Yadkori, and Ritwik Sinha. 2019. On Densification for Minwise Hashing. In UAI. 302--311.
[39]
Yury~A Malkov and Dmitry~A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. TPAMI (2018).
[40]
Mark Manasse, Frank McSherry, and Kunal Talwar. 2010. Consistent Weighted Sampling . Technical Report.
[41]
Julian McAuley, Rahul Pandey, and Jure Leskovec. 2015. Inferring networks of substitutable and complementary products. In KDD . 785--794.
[42]
Michael Mitzenmacher, Rasmus Pagh, and Ninh Pham. 2014. Efficient estimation for high similarities using odd sketches. In WWW. 109--118.
[43]
Devi Parikh and Kristen Grauman. 2011. Relative attributes. In ICCV. 503--510.
[44]
Ali Rahimi and Benjamin Recht. 2008. Random features for large-scale kernel machines. In NIPS. 1177--1184.
[45]
Anshumali Shrivastava. 2016. Simple and Efficient Weighted Minwise Hashing. In NIPS. 1498--1506.
[46]
Anshumali Shrivastava. 2017. Optimal Densification for Fast and Accurate Minwise Hashing. In ICML. 3154--3163.
[47]
Anshumali Shrivastava and Ping Li. 2014a. Densifying One Permutation Hashing via Rotation for Fast Near Neighbor Search. In ICML . 557--565.
[48]
Anshumali Shrivastava and Ping Li. 2014b. Improved Densification of One Permutation Hashing. In UAI. 732--741.
[49]
Ryan Spring and Anshumali Shrivastava. 2017. Scalable and sustainable deep learning via randomized hashing. In SIGKDD . 445--454.
[50]
Ryan Spring and Anshumali Shrivastava. 2020. Mutual information estimation using LSH sampling. In AAAI. 1--9.
[51]
Pinghui Wang, Yiyan Qi, Yuanming Zhang, Qiaozhu Zhai, Chenxu Wang, John C. S. Lui, and Xiaohong Guan. 2019. A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets. In SIGKDD . 25--33.
[52]
Wei Wu, Bin Li, Ling Chen, and Chengqi Zhang. 2016. Canonical Consistent Weighted Sampling for Real-Value Weighted Min-Hash. In ICDM . 1287--1292.
[53]
Wei Wu, Bin Li, Ling Chen, and Chengqi Zhang. 2017. Consistent Weighted Sampling Made More Practical. In WWW. 1035--1043.
[54]
Da-Feng Xia, Sen-Lin Xu, and Feng Qi. 1999. A proof of the arithmetic mean-geometric mean-harmonic mean inequalities. RGMIA, Vol. 2, 1 (1999).
[55]
Jay Yagnik, Dennis Strelow, David~A. Ross, and Ruei-Sung Lin. 2019. The power of comparative reasoning. In ICCV. 2431--2438.

Cited By

View all
  • (2025)Efficient and Secure Traffic Scheduling Based on Private SketchMathematics10.3390/math1302028813:2(288)Online publication date: 17-Jan-2025
  • (2024)Pb-Hash: Partitioned b-bit HashingProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672523(239-246)Online publication date: 2-Aug-2024
  • (2024)A Unified Framework for Mining Batch and Periodic Batch in Data StreamsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.339902436:11(5544-5561)Online publication date: Nov-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data
June 2021
2969 pages
ISBN:9781450383431
DOI:10.1145/3448016
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. locality similarity hashing
  2. similarity
  3. sketch

Qualifiers

  • Research-article

Funding Sources

  • Natural Science Basic Research Plan in Zhejiang Province of China
  • National Natural Science Foundation of China
  • Shenzhen Basic Research Grant
  • MoE-CMCC Artifical Intelligence Project
  • Natural Science Basic Research Plan in Shaanxi Province of China

Conference

SIGMOD/PODS '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)34
  • Downloads (Last 6 weeks)2
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Efficient and Secure Traffic Scheduling Based on Private SketchMathematics10.3390/math1302028813:2(288)Online publication date: 17-Jan-2025
  • (2024)Pb-Hash: Partitioned b-bit HashingProceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval10.1145/3664190.3672523(239-246)Online publication date: 2-Aug-2024
  • (2024)A Unified Framework for Mining Batch and Periodic Batch in Data StreamsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.339902436:11(5544-5561)Online publication date: Nov-2024
  • (2024)Priority Sketch: A Priority-aware Measurement Framework2024 International Conference on Satellite Internet (SAT-NET)10.1109/SAT-NET62854.2024.00012(18-23)Online publication date: 25-Oct-2024
  • (2024)BitMatcher: Bit-level Counter Adjustment for Sketches2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00366(4815-4827)Online publication date: 13-May-2024
  • (2024)CodingSketch: A Hierarchical Sketch with Efficient Encoding and Recursive Decoding2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00130(1592-1605)Online publication date: 13-May-2024
  • (2023)Double-Anonymous Sketch: Achieving Top-K-fairness for Finding Global Top-K Frequent ItemsProceedings of the ACM on Management of Data10.1145/35889331:1(1-26)Online publication date: 30-May-2023
  • (2023)HyperCalm Sketch: One-Pass Mining Periodic Batches in Data Streams2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00009(14-26)Online publication date: Apr-2023
  • (2022)Stingy sketchProceedings of the VLDB Endowment10.14778/3523210.352322015:7(1426-1438)Online publication date: 1-Mar-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media