research-article

HAP: An Efficient Hamming Space Index Based on Augmented Pigeonhole Principle

Authors:
Qiyu Liu

HKUST, Hong Kong, Hong Kong

HKUST, Hong Kong, Hong Kong
View Profile

,
Yanyan Shen

Shanghai Jiao Tong University, Shanghai, China

Shanghai Jiao Tong University, Shanghai, China
View Profile

,
Lei Chen

HKUST, Hong Kong, Hong Kong

HKUST, Hong Kong, Hong Kong
View Profile

SIGMOD '22: Proceedings of the 2022 International Conference on Management of DataJune 2022Pages 917–930https://doi.org/10.1145/3514221.3517880

Published:11 June 2022Publication History

SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data

Pages 917–930

ABSTRACT

The emerging deep learning techniques prefer mapping complex data objects (e.g., images, documents) to compact binary vectors (i.e., hash codes) for efficient similarity search. In this paper, we study the problem of indexing large-scale binary databases to support fast Hamming distance-based similarity queries. Existing Hamming space indices usually divide long binary vectors into short disjoint pieces and apply the Pigeonhole Principle to prune unnecessary candidates. In our work, we relax the disjoint partition constraint by allowing dimension redundancy, which yields a tighter pruning bound named Augmented Pigeonhole Principle (APP). Intuitively, APP enables more optimization opportunities by capturing the correlation between database and query workloads. Based on APP, we propose HAP, an efficient Hamming space index framework to support both Hamming range queries and k-NN queries.

To guide index construction and run-time query optimization, we introduce a novel DL-base query cardinality estimator named SimCardNet. To further reduce the index space cost, we propose a learned index compression scheme by combining the piece-wise linear approximation (PLA) and Elias-Fano encoding. In addition, we also study the problem of optimizing the execution time of a batch of queries using our index framework. The experimental results on large-scale binary databases reveal that our indexing scheme outperforms the state-of-the-art baselines in terms of both space and time efficiency.

References

Antonio Boffa, Paolo Ferragina, and Giorgio Vinciguerra. 2021. A "Learned" Approach to Quicken and Compress Rank/Select Dictionaries. In ALENEX. SIAM, 46--59.Google Scholar
Zhangjie Cao, Mingsheng Long, Jianmin Wang, and Philip S. Yu. 2017. HashNet: Deep Learning to Hash by Continuation. In ICCV. IEEE Computer Society, 5609--5618.Google Scholar
Francisco M. Castro, Manuel J. Mar'i n-Jimé nez, Nicolá s Guil, Cordelia Schmid, and Karteek Alahari. 2018. End-to-End Incremental Learning. In ECCV (12) (Lecture Notes in Computer Science, Vol. 11216). Springer, 241--257.Google Scholar
Chih-Chung Chang and Chih-Jen Lin. 2021. libsvm. https://www.csie.ntu.edu.tw/ cjlin/libsvm/ Retrieved 2021-05--25 fromGoogle Scholar
Moses Charikar. 2002. Similarity estimation techniques from rounding algorithms. In STOC. ACM, 380--388.Google Scholar
Surajit Chaudhuri, Bolin Ding, and Srikanth Kandula. 2017. Approximate Query Processing: No Silver Bullet. In SIGMOD Conference. ACM, 511--519.Google ScholarDigital Library
Dong Deng, Guoliang Li, He Wen, and Jianhua Feng. 2015. An Efficient Partition Based Method for Exact Set Similarity Joins. Proc. VLDB Endow., Vol. 9, 4 (2015), 360--371.Google ScholarDigital Library
Jialin Ding, Vikram Nathan, Mohammad Alizadeh, and Tim Kraska. 2020. Tsunami: A Learned Multi-dimensional Index for Correlated Data and Skewed Workloads. Proc. VLDB Endow., Vol. 14, 2 (2020), 74--86.Google ScholarDigital Library
Paolo Ferragina and Giorgio Vinciguerra. 2020. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds. Proc. VLDB Endow., Vol. 13, 8 (2020), 1162--1175.Google ScholarDigital Library
Phillip B. Gibbons and Yossi Matias. 1999. Synopsis Data Structures for Massive Data Sets. In SODA. ACM/SIAM, 909--910.Google Scholar
Daniel H. Greene, Michal Parnas, and F. Frances Yao. 1994. Multi-Index Hashing for Information Retrieval. In FOCS. IEEE Computer Society, 722--731.Google Scholar
Antonin Guttman. 1984. R-Trees: A Dynamic Index Structure for Spatial Searching. In SIGMOD Conference. ACM Press, 47--57.Google ScholarDigital Library
Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2020. DeepDB: Learn from Data, not from Queries! Proc. VLDB Endow., Vol. 13, 7 (2020), 992--1005.Google ScholarDigital Library
Facebook Inc. 2021. fastText. https://fasttext.cc/ Retrieved 2021-05--25 fromGoogle Scholar
Piotr Indyk and Rajeev Motwani. 1998. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In STOC. ACM, 604--613.Google Scholar
Hervé Jé gou, Matthijs Douze, and Cordelia Schmid. 2011a. Product Quantization for Nearest Neighbor Search. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 33, 1 (2011), 117--128.Google ScholarDigital Library
Hervé Jé gou, Romain Tavenard, Matthijs Douze, and Laurent Amsaleg. 2011b. Searching in one billion vectors: Re-rank with source coding. In ICASSP. IEEE, 861--864.Google Scholar
Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter A. Boncz, and Alfons Kemper. 2019. Learned Cardinalities: Estimating Correlated Joins with Deep Learning. In CIDR. www.cidrdb.org.Google Scholar
Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The Case for Learned Index Structures. In SIGMOD Conference. ACM, 489--504.Google Scholar
Ani Kristo, Kapil Vaidya, Ugur cC etintemel, Sanchit Misra, and Tim Kraska. 2020. The Case for a Learned Sorting Algorithm. In SIGMOD Conference. ACM, 1001--1016.Google ScholarDigital Library
Kevin Lin, Huei-Fang Yang, Jen-Hao Hsiao, and Chu-Song Chen. 2015. Deep learning of binary hash codes for fast image retrieval. In CVPR Workshops. IEEE Computer Society, 27--35.Google ScholarCross Ref
Haomiao Liu, Ruiping Wang, Shiguang Shan, and Xilin Chen. 2019. Deep Supervised Hashing for Fast Image Retrieval. Int. J. Comput. Vis., Vol. 127, 9 (2019), 1217--1234.Google ScholarDigital Library
Qiyu Liu, Libin Zheng, Yanyan Shen, and Lei Chen. 2020. Stable Learned Bloom Filters for Data Streams. Proc. VLDB Endow., Vol. 13, 11 (2020), 2355--2367.Google ScholarDigital Library
Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. 2007. Detecting near-duplicates for web crawling. In WWW. ACM, 141--150.Google Scholar
Michael Mitzenmacher. 2018. A Model for Learned Bloom Filters and Optimizing by Sandwiching. In NeurIPS. 462--471.Google Scholar
Guido F. Montú far, Razvan Pascanu, KyungHyun Cho, and Yoshua Bengio. 2014. On the Number of Linear Regions of Deep Neural Networks. In NIPS. 2924--2932.Google Scholar
Vikram Nathan, Jialin Ding, Mohammad Alizadeh, and Tim Kraska. 2020. Learning Multi-Dimensional Indexes. In SIGMOD Conference. ACM, 985--1000.Google Scholar
Mohammad Norouzi and David J. Fleet. 2011. Minimal Loss Hashing for Compact Binary Codes. In ICML. Omnipress, 353--360.Google ScholarDigital Library
Mohammad Norouzi, Ali Punjani, and David J. Fleet. 2012. Fast search in Hamming space with multi-index hashing. In CVPR. IEEE Computer Society, 3108--3115.Google ScholarDigital Library
Mohammad Norouzi, Ali Punjani, and David J. Fleet. 2014. Fast Exact Search in Hamming Space With Multi-Index Hashing. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 36, 6 (2014), 1107--1119.Google ScholarDigital Library
Joseph O'Rourke. 1981. An On-Line Algorithm for Fitting Straight Lines Between Data Ranges. Commun. ACM, Vol. 24, 9 (1981), 574--578.Google ScholarDigital Library
Giuseppe Ottaviano and Rossano Venturini. 2014. Partitioned Elias-Fano indexes. In SIGIR. ACM, 273--282.Google Scholar
Jianbin Qin, Yaoshu Wang, Chuan Xiao, Wei Wang, Xuemin Lin, and Yoshiharu Ishikawa. 2018. GPH: Similarity Search in Hamming Space. In ICDE. IEEE Computer Society, 29--40.Google Scholar
Jianbin Qin, Chuan Xiao, Yaoshu Wang, Wei Wang, Xuemin Lin, Yoshiharu Ishikawa, and Guoren Wang. 2021. Generalizing the Pigeonhole Principle for Similarity Search in Hamming Space. IEEE Trans. Knowl. Data Eng., Vol. 33, 2 (2021), 489--505.Google ScholarDigital Library
Connor Shorten and Taghi M. Khoshgoftaar. 2019. A survey on Image Data Augmentation for Deep Learning. J. Big Data, Vol. 6 (2019), 60.Google ScholarCross Ref
Craig Silverstein, Monika Rauch Henzinger, Hannes Marais, and Michael Moricz. 1999. Analysis of a Very Large Web Search Engine Query Log. SIGIR Forum, Vol. 33, 1 (1999), 6--12.Google ScholarDigital Library
sparsehash. 2021. sparsehash. https://github.com/sparsehash/sparsehash Retrieved 2021-05--25 fromGoogle Scholar
Ji Sun, Guoliang Li, and Nan Tang. 2021. Learned Cardinality Estimation for Similarity Queries. In SIGMOD Conference. ACM, 1745--1757.Google Scholar
Maxim Sviridenko. 2004. A note on maximizing a submodular set function subject to a knapsack constraint. Oper. Res. Lett., Vol. 32, 1 (2004), 41--43.Google ScholarDigital Library
Vijay V Vazirani. 2013. Approximation algorithms .Springer Science & Business Media.Google ScholarDigital Library
Sebastiano Vigna. 2013. Quasi-succinct indices. In WSDM. ACM, 83--92.Google Scholar
Ji Wan, Sheng Tang, Yongdong Zhang, Lei Huang, and Jintao Li. 2013. Data driven multi-index hashing. In ICIP. IEEE, 2670--2673.Google Scholar
Jingdong Wang, Ting Zhang, Jingkuan Song, Nicu Sebe, and Heng Tao Shen. 2018. A Survey on Learning to Hash. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 40, 4 (2018), 769--790.Google ScholarCross Ref
Xiaoying Wang, Changbo Qu, Weiyuan Wu, Jiannan Wang, and Qingqing Zhou. 2021. Are We Ready For Learned Cardinality Estimation? Proc. VLDB Endow., Vol. 14, 9 (2021), 1640--1654.Google ScholarDigital Library
Yaoshu Wang, Chuan Xiao, Jianbin Qin, Xin Cao, Yifang Sun, Wei Wang, and Makoto Onizuka. 2020. Monotonic Cardinality Estimation of Similarity Selection: A Deep Learning Approach. In SIGMOD Conference. ACM, 1197--1212.Google Scholar
Hao Yan, Shuai Ding, and Torsten Suel. 2009. Inverted index compression and query processing with optimized document ordering. In WWW. ACM, 401--410.Google Scholar
Zongheng Yang, Eric Liang, Amog Kamsetty, Chenggang Wu, Yan Duan, Peter Chen, Pieter Abbeel, Joseph M. Hellerstein, Sanjay Krishnan, and Ion Stoica. 2019. Deep Unsupervised Cardinality Estimation. Proc. VLDB Endow., Vol. 13, 3 (2019), 279--292.Google ScholarDigital Library
Dan Zhang, Fei Wang, and Luo Si. 2011. Composite hashing with multiple information sources. In SIGIR. ACM, 225--234.Google Scholar
Dell Zhang, Jun Wang, Deng Cai, and Jinsong Lu. 2010. Self-taught hashing for fast similarity search. In SIGIR. ACM, 18--25.Google Scholar
Xiaoyang Zhang, Jianbin Qin, Wei Wang, Yifang Sun, and Jiaheng Lu. 2013. HmSearch: an efficient hamming distance query processing algorithm. In SSDBM. ACM, 19:1--19:12.Google Scholar

Index Terms

HAP: An Efficient Hamming Space Index Based on Augmented Pigeonhole Principle
1. Information systems
  1. Data management systems
    1. Data structures

Recommendations

A space-partitioning-based indexing method for multidimensional non-ordered discrete data spaces

There is an increasing demand for similarity searches in a multidimensional non-ordered discrete data space (NDDS) from application areas such as bioinformatics and data mining. The non-ordered and discrete nature of an NDDS raises new challenges for ...
Read More
Efficient k-nearest neighbor searching in nonordered discrete data spaces

Numerous techniques have been proposed in the past for supporting efficient k-nearest neighbor (k-NN) queries in continuous data spaces. Limited work has been reported in the literature for k-NN queries in a nonordered discrete data space (NDDS). ...
Read More
D-Index: Distance Searching Index for Metric Data Sets

In order to speedup retrieval in large collections of data, index structures partition the data into subsets so that query requests can be evaluated without examining the entire collection. As the complexity of modern data types grows, metric spaces ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data
June 2022
2597 pages
ISBN:9781450392495
DOI:10.1145/3514221
General Chair:
Zachary Ives
University of Pennsylvania (USA)
,
Program Chairs:
Angela Bonifati
Lyon 1 University (France)
,
Amr El Abbadi
University of California, Santa Barbara (USA)
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 June 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
hamming space index
pigeonhole principle
similarity search
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 597
  Total Downloads
- Downloads (Last 12 months)172
- Downloads (Last 6 weeks)17
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HAP: An Efficient Hamming Space Index Based on Augmented Pigeonhole Principle

SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

A space-partitioning-based indexing method for multidimensional non-ordered discrete data spaces

Efficient k-nearest neighbor searching in nonordered discrete data spaces

D-Index: Distance Searching Index for Metric Data Sets