skip to main content
10.1145/3514221.3517880acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

HAP: An Efficient Hamming Space Index Based on Augmented Pigeonhole Principle

Published:11 June 2022Publication History

ABSTRACT

The emerging deep learning techniques prefer mapping complex data objects (e.g., images, documents) to compact binary vectors (i.e., hash codes) for efficient similarity search. In this paper, we study the problem of indexing large-scale binary databases to support fast Hamming distance-based similarity queries. Existing Hamming space indices usually divide long binary vectors into short disjoint pieces and apply the Pigeonhole Principle to prune unnecessary candidates. In our work, we relax the disjoint partition constraint by allowing dimension redundancy, which yields a tighter pruning bound named Augmented Pigeonhole Principle (APP). Intuitively, APP enables more optimization opportunities by capturing the correlation between database and query workloads. Based on APP, we propose HAP, an efficient Hamming space index framework to support both Hamming range queries and k-NN queries.

To guide index construction and run-time query optimization, we introduce a novel DL-base query cardinality estimator named SimCardNet. To further reduce the index space cost, we propose a learned index compression scheme by combining the piece-wise linear approximation (PLA) and Elias-Fano encoding. In addition, we also study the problem of optimizing the execution time of a batch of queries using our index framework. The experimental results on large-scale binary databases reveal that our indexing scheme outperforms the state-of-the-art baselines in terms of both space and time efficiency.

References

  1. Antonio Boffa, Paolo Ferragina, and Giorgio Vinciguerra. 2021. A "Learned" Approach to Quicken and Compress Rank/Select Dictionaries. In ALENEX. SIAM, 46--59.Google ScholarGoogle Scholar
  2. Zhangjie Cao, Mingsheng Long, Jianmin Wang, and Philip S. Yu. 2017. HashNet: Deep Learning to Hash by Continuation. In ICCV. IEEE Computer Society, 5609--5618.Google ScholarGoogle Scholar
  3. Francisco M. Castro, Manuel J. Mar'i n-Jimé nez, Nicolá s Guil, Cordelia Schmid, and Karteek Alahari. 2018. End-to-End Incremental Learning. In ECCV (12) (Lecture Notes in Computer Science, Vol. 11216). Springer, 241--257.Google ScholarGoogle Scholar
  4. Chih-Chung Chang and Chih-Jen Lin. 2021. libsvm. https://www.csie.ntu.edu.tw/ cjlin/libsvm/ Retrieved 2021-05--25 fromGoogle ScholarGoogle Scholar
  5. Moses Charikar. 2002. Similarity estimation techniques from rounding algorithms. In STOC. ACM, 380--388.Google ScholarGoogle Scholar
  6. Surajit Chaudhuri, Bolin Ding, and Srikanth Kandula. 2017. Approximate Query Processing: No Silver Bullet. In SIGMOD Conference. ACM, 511--519.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Dong Deng, Guoliang Li, He Wen, and Jianhua Feng. 2015. An Efficient Partition Based Method for Exact Set Similarity Joins. Proc. VLDB Endow., Vol. 9, 4 (2015), 360--371.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Jialin Ding, Vikram Nathan, Mohammad Alizadeh, and Tim Kraska. 2020. Tsunami: A Learned Multi-dimensional Index for Correlated Data and Skewed Workloads. Proc. VLDB Endow., Vol. 14, 2 (2020), 74--86.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Paolo Ferragina and Giorgio Vinciguerra. 2020. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds. Proc. VLDB Endow., Vol. 13, 8 (2020), 1162--1175.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Phillip B. Gibbons and Yossi Matias. 1999. Synopsis Data Structures for Massive Data Sets. In SODA. ACM/SIAM, 909--910.Google ScholarGoogle Scholar
  11. Daniel H. Greene, Michal Parnas, and F. Frances Yao. 1994. Multi-Index Hashing for Information Retrieval. In FOCS. IEEE Computer Society, 722--731.Google ScholarGoogle Scholar
  12. Antonin Guttman. 1984. R-Trees: A Dynamic Index Structure for Spatial Searching. In SIGMOD Conference. ACM Press, 47--57.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2020. DeepDB: Learn from Data, not from Queries! Proc. VLDB Endow., Vol. 13, 7 (2020), 992--1005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Facebook Inc. 2021. fastText. https://fasttext.cc/ Retrieved 2021-05--25 fromGoogle ScholarGoogle Scholar
  15. Piotr Indyk and Rajeev Motwani. 1998. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In STOC. ACM, 604--613.Google ScholarGoogle Scholar
  16. Hervé Jé gou, Matthijs Douze, and Cordelia Schmid. 2011a. Product Quantization for Nearest Neighbor Search. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 33, 1 (2011), 117--128.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Hervé Jé gou, Romain Tavenard, Matthijs Douze, and Laurent Amsaleg. 2011b. Searching in one billion vectors: Re-rank with source coding. In ICASSP. IEEE, 861--864.Google ScholarGoogle Scholar
  18. Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter A. Boncz, and Alfons Kemper. 2019. Learned Cardinalities: Estimating Correlated Joins with Deep Learning. In CIDR. www.cidrdb.org.Google ScholarGoogle Scholar
  19. Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The Case for Learned Index Structures. In SIGMOD Conference. ACM, 489--504.Google ScholarGoogle Scholar
  20. Ani Kristo, Kapil Vaidya, Ugur cC etintemel, Sanchit Misra, and Tim Kraska. 2020. The Case for a Learned Sorting Algorithm. In SIGMOD Conference. ACM, 1001--1016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Kevin Lin, Huei-Fang Yang, Jen-Hao Hsiao, and Chu-Song Chen. 2015. Deep learning of binary hash codes for fast image retrieval. In CVPR Workshops. IEEE Computer Society, 27--35.Google ScholarGoogle ScholarCross RefCross Ref
  22. Haomiao Liu, Ruiping Wang, Shiguang Shan, and Xilin Chen. 2019. Deep Supervised Hashing for Fast Image Retrieval. Int. J. Comput. Vis., Vol. 127, 9 (2019), 1217--1234.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Qiyu Liu, Libin Zheng, Yanyan Shen, and Lei Chen. 2020. Stable Learned Bloom Filters for Data Streams. Proc. VLDB Endow., Vol. 13, 11 (2020), 2355--2367.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. 2007. Detecting near-duplicates for web crawling. In WWW. ACM, 141--150.Google ScholarGoogle Scholar
  25. Michael Mitzenmacher. 2018. A Model for Learned Bloom Filters and Optimizing by Sandwiching. In NeurIPS. 462--471.Google ScholarGoogle Scholar
  26. Guido F. Montú far, Razvan Pascanu, KyungHyun Cho, and Yoshua Bengio. 2014. On the Number of Linear Regions of Deep Neural Networks. In NIPS. 2924--2932.Google ScholarGoogle Scholar
  27. Vikram Nathan, Jialin Ding, Mohammad Alizadeh, and Tim Kraska. 2020. Learning Multi-Dimensional Indexes. In SIGMOD Conference. ACM, 985--1000.Google ScholarGoogle Scholar
  28. Mohammad Norouzi and David J. Fleet. 2011. Minimal Loss Hashing for Compact Binary Codes. In ICML. Omnipress, 353--360.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Mohammad Norouzi, Ali Punjani, and David J. Fleet. 2012. Fast search in Hamming space with multi-index hashing. In CVPR. IEEE Computer Society, 3108--3115.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Mohammad Norouzi, Ali Punjani, and David J. Fleet. 2014. Fast Exact Search in Hamming Space With Multi-Index Hashing. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 36, 6 (2014), 1107--1119.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Joseph O'Rourke. 1981. An On-Line Algorithm for Fitting Straight Lines Between Data Ranges. Commun. ACM, Vol. 24, 9 (1981), 574--578.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Giuseppe Ottaviano and Rossano Venturini. 2014. Partitioned Elias-Fano indexes. In SIGIR. ACM, 273--282.Google ScholarGoogle Scholar
  33. Jianbin Qin, Yaoshu Wang, Chuan Xiao, Wei Wang, Xuemin Lin, and Yoshiharu Ishikawa. 2018. GPH: Similarity Search in Hamming Space. In ICDE. IEEE Computer Society, 29--40.Google ScholarGoogle Scholar
  34. Jianbin Qin, Chuan Xiao, Yaoshu Wang, Wei Wang, Xuemin Lin, Yoshiharu Ishikawa, and Guoren Wang. 2021. Generalizing the Pigeonhole Principle for Similarity Search in Hamming Space. IEEE Trans. Knowl. Data Eng., Vol. 33, 2 (2021), 489--505.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Connor Shorten and Taghi M. Khoshgoftaar. 2019. A survey on Image Data Augmentation for Deep Learning. J. Big Data, Vol. 6 (2019), 60.Google ScholarGoogle ScholarCross RefCross Ref
  36. Craig Silverstein, Monika Rauch Henzinger, Hannes Marais, and Michael Moricz. 1999. Analysis of a Very Large Web Search Engine Query Log. SIGIR Forum, Vol. 33, 1 (1999), 6--12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. sparsehash. 2021. sparsehash. https://github.com/sparsehash/sparsehash Retrieved 2021-05--25 fromGoogle ScholarGoogle Scholar
  38. Ji Sun, Guoliang Li, and Nan Tang. 2021. Learned Cardinality Estimation for Similarity Queries. In SIGMOD Conference. ACM, 1745--1757.Google ScholarGoogle Scholar
  39. Maxim Sviridenko. 2004. A note on maximizing a submodular set function subject to a knapsack constraint. Oper. Res. Lett., Vol. 32, 1 (2004), 41--43.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Vijay V Vazirani. 2013. Approximation algorithms .Springer Science & Business Media.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Sebastiano Vigna. 2013. Quasi-succinct indices. In WSDM. ACM, 83--92.Google ScholarGoogle Scholar
  42. Ji Wan, Sheng Tang, Yongdong Zhang, Lei Huang, and Jintao Li. 2013. Data driven multi-index hashing. In ICIP. IEEE, 2670--2673.Google ScholarGoogle Scholar
  43. Jingdong Wang, Ting Zhang, Jingkuan Song, Nicu Sebe, and Heng Tao Shen. 2018. A Survey on Learning to Hash. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 40, 4 (2018), 769--790.Google ScholarGoogle ScholarCross RefCross Ref
  44. Xiaoying Wang, Changbo Qu, Weiyuan Wu, Jiannan Wang, and Qingqing Zhou. 2021. Are We Ready For Learned Cardinality Estimation? Proc. VLDB Endow., Vol. 14, 9 (2021), 1640--1654.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Yaoshu Wang, Chuan Xiao, Jianbin Qin, Xin Cao, Yifang Sun, Wei Wang, and Makoto Onizuka. 2020. Monotonic Cardinality Estimation of Similarity Selection: A Deep Learning Approach. In SIGMOD Conference. ACM, 1197--1212.Google ScholarGoogle Scholar
  46. Hao Yan, Shuai Ding, and Torsten Suel. 2009. Inverted index compression and query processing with optimized document ordering. In WWW. ACM, 401--410.Google ScholarGoogle Scholar
  47. Zongheng Yang, Eric Liang, Amog Kamsetty, Chenggang Wu, Yan Duan, Peter Chen, Pieter Abbeel, Joseph M. Hellerstein, Sanjay Krishnan, and Ion Stoica. 2019. Deep Unsupervised Cardinality Estimation. Proc. VLDB Endow., Vol. 13, 3 (2019), 279--292.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Dan Zhang, Fei Wang, and Luo Si. 2011. Composite hashing with multiple information sources. In SIGIR. ACM, 225--234.Google ScholarGoogle Scholar
  49. Dell Zhang, Jun Wang, Deng Cai, and Jinsong Lu. 2010. Self-taught hashing for fast similarity search. In SIGIR. ACM, 18--25.Google ScholarGoogle Scholar
  50. Xiaoyang Zhang, Jianbin Qin, Wei Wang, Yifang Sun, and Jiaheng Lu. 2013. HmSearch: an efficient hamming distance query processing algorithm. In SSDBM. ACM, 19:1--19:12.Google ScholarGoogle Scholar

Index Terms

  1. HAP: An Efficient Hamming Space Index Based on Augmented Pigeonhole Principle

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data
      June 2022
      2597 pages
      ISBN:9781450392495
      DOI:10.1145/3514221

      Copyright © 2022 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 11 June 2022

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader