skip to main content
10.1145/3448016.3452833acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Bidirectionally Densifying LSH Sketches with Empty Bins

Authors Info & Claims
Published:18 June 2021Publication History

ABSTRACT

As an efficient tool for approximate similarity computation and search, Locality Sensitive Hashing (LSH) has been widely used in many research areas including databases, data mining, information retrieval, and machine learning. Classical LSH methods typically require to perform hundreds or even thousands of hashing operations when computing the LSH sketch for each input item (e.g., a set or a vector); however, this complexity is still too expensive and even impractical for applications requiring processing data in real-time. To address this issue, several fast methods such as OPH and BCWS have been proposed to efficiently compute the LSH sketches; however, these methods may generate many sketches with empty bins, which may introduce large errors for similarity estimation and also limit their usage for fast similarity search. To solve this issue, we propose a novel densification method, i.e., BiDens. Compared with existing densification methods, our BiDens is more efficient to fill a sketch's empty bins with values of its non-empty bins in either the forward or backward directions. Furthermore, it also densifies empty bins to satisfy the densification principle (i.e., the LSH property). Theoretical analysis and experimental results on similarity estimation, fast similarity search, and kernel linearization using real-world datasets demonstrate that our BiDens is up to 106 times faster than state-of-the-art methods while achieving the same or even better accuracy.

Skip Supplemental Material Section

Supplemental Material

3448016.3452833.mp4

mp4

36.6 MB

References

  1. Léon Bottou and Olivier Bousquet. 2008. The tradeoffs of large scale learning. In NIPS. 161--168.Google ScholarGoogle Scholar
  2. Andrei~Z Broder. 1997. On the resemblance and containment of documents. In SEQUENCES. 21--29.Google ScholarGoogle Scholar
  3. Andrei~Z Broder, Moses Charikar, Alan~M Frieze, and Michael Mitzenmacher. 2000. Min-Wise Independent Permutations. JCSS , Vol. 60, 3 (2000), 630--659.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Moses Charikar. 2002. Similarity estimation techniques from rounding algorithms. In STOC. 380--388.Google ScholarGoogle Scholar
  5. Beidi Chen, Tharun Medini, James Farwell, Sameh Gobriel, Charlie Tai, and Anshumali Shrivastava. 2020 b. Slide: In defense of smart algorithms over hardware acceleration for large-scale deep learning systems. In MLSys. 1--16.Google ScholarGoogle Scholar
  6. Beidi Chen and Anshumali Shrivastava. 2018. Densified Winner Take All (WTA) Hashing for Sparse Datasets. In UAI . 906--916.Google ScholarGoogle Scholar
  7. Beidi Chen, Yingchen Xu, and Anshumali Shrivastava. 2019. Fast and Accurate Stochastic Gradient Estimation. In NIPS. 12339--12349.Google ScholarGoogle Scholar
  8. Jihong Chen, Wei Chen, Jinjing Huang, Jinhua Fang, Zhixu Li, An Liu, and Lei Zhao. 2020 a. Co-purchaser Recommendation for Online Group Buying. DSE , Vol. 5, 3 (2020), 280--292.Google ScholarGoogle Scholar
  9. Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine learning , Vol. 20, 3 (1995), 273--297.Google ScholarGoogle Scholar
  10. Søren Dahlgaard, Mathias Bæk~Tejs Knudsen, and Mikkel Thorup. 2017. Fast Similarity Sketching. In FOCS. 663--671.Google ScholarGoogle Scholar
  11. Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab~S. Mirrokni. 2004. Locality-sensitive hashing scheme based on p-stable distributions. In SOCG . 253--262.Google ScholarGoogle Scholar
  12. Thomas~L. Dean, Mark~A. Ruzon, Mark Segal, Jonathon Shlens, Sudheendra Vijayanarasimhan, and Jay Yagnik. 2013. Fast, Accurate Detection of 100, 000 Object Classes on a Single Machine. In CVPR . 1814--1821.Google ScholarGoogle Scholar
  13. Otmar Ertl. 2019. BagMinHash - Minwise Hashing Algorithm for Weighted Sets. In SIGKDD. 1368--1377.Google ScholarGoogle Scholar
  14. Mark Everingham, Luc Van~Gool, Christopher~KI Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (voc) challenge. IJCV , Vol. 88, 2 (2010), 303--338.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A library for large linear classification. JMLR , Vol. 9, 8 (2008), 1871--1874.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Raul~Castro Fernandez, Jisoo Min, Demitri Nava, and Samuel Madden. 2019. Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment. In ICDE. 1190--1201.Google ScholarGoogle Scholar
  17. Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. 2013. Optimized product quantization for approximate nearest neighbor search. In CVPR . 2946--2953.Google ScholarGoogle Scholar
  18. Aristides Gionis, Piotr Indyk, and Rajeev Motwani. 1999. Similarity Search in High Dimensions via Hashing. In PVLDB. 518--529.Google ScholarGoogle Scholar
  19. Sreenivas Gollapudi and Rina Panigrahy. 2006. Exploiting asymmetry in hierarchical topic extraction. In CIKM. 475--482.Google ScholarGoogle Scholar
  20. Long Gong, Huayi Wang, Mitsunori Ogihara, and Jun Xu. 2020. iDEC: Indexable Distance Estimating Codes for Approximate Nearest Neighbor Search. PVLDB , Vol. 13, 9 (2020), 1483--1497.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Bernhard Haeupler, Mark~S. Manasse, and Kunal Talwar. 2014. Consistent Weighted Sampling Made Fast, Small, and Easy. CoRR , Vol. abs/1410.4266 (2014).Google ScholarGoogle Scholar
  22. Piotr Indyk and Rajeev Motwani. 1998. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In STOC . 604--613.Google ScholarGoogle Scholar
  23. Sergey Ioffe. 2010. Improved Consistent Sampling, Weighted Minhash and L1 Sketching. In ICDM . 246--255.Google ScholarGoogle Scholar
  24. Peng Jia, Pinghui Wang, Jing Tao, and Xiaohong Guan. 2019. A Fast Sketch Method for Mining User Similarities over Fully Dynamic Graph Streams. In ICDE . 1682--1685.Google ScholarGoogle Scholar
  25. Jon Kleinberg and Eva Tardos. 1999. Approximation Algorithms for Classification Problems with Pairwise Relationships: Metric Labeling and Markov Random Fields. In FOCS. 14--14.Google ScholarGoogle Scholar
  26. Iasonas Kokkinos, Michael Bronstein, and Alan Yuille. 2012. Dense scale invariant descriptors for images and surfaces . Ph.D. Dissertation. INRIA.Google ScholarGoogle Scholar
  27. David~D Lewis, Yiming Yang, Tony~G Rose, and Fan Li. 2004. Rcv1: A new benchmark collection for text categorization research. JMLR , Vol. 5, 4 (2004), 361--397.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Ping Li. 2015. 0-Bit Consistent Weighted Sampling. In SIGKDD. 665--674.Google ScholarGoogle Scholar
  29. Ping Li. 2017. Linearized gmm kernels and normalized random fourier features. In SIGKDD . 315--324.Google ScholarGoogle Scholar
  30. Ping Li. 2018. Several tunable GMM kernels. arXiv preprint arXiv:1805.02830 (2018).Google ScholarGoogle Scholar
  31. Ping Li and Arnd~Christian Kö nig. 2010. b-Bit minwise hashing. In WWW. 671--680.Google ScholarGoogle Scholar
  32. Ping Li, Xiaoyun Li, and Cun-Hui Zhang. 2019. Re-randomized Densification for One Permutation Hashing and Bin-wise Consistent Weighted Sampling. In NIPS. 15900--15910.Google ScholarGoogle Scholar
  33. Ping Li, Art~B. Owen, and Cun-Hui Zhang. 2012. One Permutation Hashing. In NIPS. 3122--3130.Google ScholarGoogle Scholar
  34. Ping Li, Anshumali Shrivastava, Joshua~L Moore, and Arnd~C König. 2011a. Hashing algorithms for large-scale learning. In NIPS. 2672--2680.Google ScholarGoogle Scholar
  35. Ping Li, Anshumali Shrivastava, Joshua~L. Moore, and Arnd~Christian Kö nig. 2011b. Hashing Algorithms for Large-Scale Learning. In NIPS. 2672--2680.Google ScholarGoogle Scholar
  36. Ping Li and Cun-Hui Zhang. 2017. Theory of the GMM Kernel. In WWW. 1053--1062.Google ScholarGoogle Scholar
  37. Wanqi Liu, Hanchen Wang, Ying Zhang, Wei Wang, and Lu Qin. 2019. I-LSH: I/O efficient c-approximate nearest neighbor search in high-dimensional space. In ICDE. 1670--1673.Google ScholarGoogle Scholar
  38. Tung Mai, Anup Rao, Matt Kapilevich, Ryan Rossi, Yasin Abbasi-Yadkori, and Ritwik Sinha. 2019. On Densification for Minwise Hashing. In UAI. 302--311.Google ScholarGoogle Scholar
  39. Yury~A Malkov and Dmitry~A Yashunin. 2018. Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. TPAMI (2018).Google ScholarGoogle Scholar
  40. Mark Manasse, Frank McSherry, and Kunal Talwar. 2010. Consistent Weighted Sampling . Technical Report.Google ScholarGoogle Scholar
  41. Julian McAuley, Rahul Pandey, and Jure Leskovec. 2015. Inferring networks of substitutable and complementary products. In KDD . 785--794.Google ScholarGoogle Scholar
  42. Michael Mitzenmacher, Rasmus Pagh, and Ninh Pham. 2014. Efficient estimation for high similarities using odd sketches. In WWW. 109--118.Google ScholarGoogle Scholar
  43. Devi Parikh and Kristen Grauman. 2011. Relative attributes. In ICCV. 503--510.Google ScholarGoogle Scholar
  44. Ali Rahimi and Benjamin Recht. 2008. Random features for large-scale kernel machines. In NIPS. 1177--1184.Google ScholarGoogle Scholar
  45. Anshumali Shrivastava. 2016. Simple and Efficient Weighted Minwise Hashing. In NIPS. 1498--1506.Google ScholarGoogle Scholar
  46. Anshumali Shrivastava. 2017. Optimal Densification for Fast and Accurate Minwise Hashing. In ICML. 3154--3163.Google ScholarGoogle Scholar
  47. Anshumali Shrivastava and Ping Li. 2014a. Densifying One Permutation Hashing via Rotation for Fast Near Neighbor Search. In ICML . 557--565.Google ScholarGoogle Scholar
  48. Anshumali Shrivastava and Ping Li. 2014b. Improved Densification of One Permutation Hashing. In UAI. 732--741.Google ScholarGoogle Scholar
  49. Ryan Spring and Anshumali Shrivastava. 2017. Scalable and sustainable deep learning via randomized hashing. In SIGKDD . 445--454.Google ScholarGoogle Scholar
  50. Ryan Spring and Anshumali Shrivastava. 2020. Mutual information estimation using LSH sampling. In AAAI. 1--9.Google ScholarGoogle Scholar
  51. Pinghui Wang, Yiyan Qi, Yuanming Zhang, Qiaozhu Zhai, Chenxu Wang, John C. S. Lui, and Xiaohong Guan. 2019. A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets. In SIGKDD . 25--33.Google ScholarGoogle Scholar
  52. Wei Wu, Bin Li, Ling Chen, and Chengqi Zhang. 2016. Canonical Consistent Weighted Sampling for Real-Value Weighted Min-Hash. In ICDM . 1287--1292.Google ScholarGoogle Scholar
  53. Wei Wu, Bin Li, Ling Chen, and Chengqi Zhang. 2017. Consistent Weighted Sampling Made More Practical. In WWW. 1035--1043.Google ScholarGoogle Scholar
  54. Da-Feng Xia, Sen-Lin Xu, and Feng Qi. 1999. A proof of the arithmetic mean-geometric mean-harmonic mean inequalities. RGMIA , Vol. 2, 1 (1999).Google ScholarGoogle Scholar
  55. Jay Yagnik, Dennis Strelow, David~A. Ross, and Ruei-Sung Lin. 2019. The power of comparative reasoning. In ICCV. 2431--2438.Google ScholarGoogle Scholar

Index Terms

  1. Bidirectionally Densifying LSH Sketches with Empty Bins

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data
          June 2021
          2969 pages
          ISBN:9781450383431
          DOI:10.1145/3448016

          Copyright © 2021 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 18 June 2021

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate785of4,003submissions,20%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader