ABSTRACT
The emerging deep learning techniques prefer mapping complex data objects (e.g., images, documents) to compact binary vectors (i.e., hash codes) for efficient similarity search. In this paper, we study the problem of indexing large-scale binary databases to support fast Hamming distance-based similarity queries. Existing Hamming space indices usually divide long binary vectors into short disjoint pieces and apply the Pigeonhole Principle to prune unnecessary candidates. In our work, we relax the disjoint partition constraint by allowing dimension redundancy, which yields a tighter pruning bound named Augmented Pigeonhole Principle (APP). Intuitively, APP enables more optimization opportunities by capturing the correlation between database and query workloads. Based on APP, we propose HAP, an efficient Hamming space index framework to support both Hamming range queries and k-NN queries.
To guide index construction and run-time query optimization, we introduce a novel DL-base query cardinality estimator named SimCardNet. To further reduce the index space cost, we propose a learned index compression scheme by combining the piece-wise linear approximation (PLA) and Elias-Fano encoding. In addition, we also study the problem of optimizing the execution time of a batch of queries using our index framework. The experimental results on large-scale binary databases reveal that our indexing scheme outperforms the state-of-the-art baselines in terms of both space and time efficiency.
- Antonio Boffa, Paolo Ferragina, and Giorgio Vinciguerra. 2021. A "Learned" Approach to Quicken and Compress Rank/Select Dictionaries. In ALENEX. SIAM, 46--59.Google Scholar
- Zhangjie Cao, Mingsheng Long, Jianmin Wang, and Philip S. Yu. 2017. HashNet: Deep Learning to Hash by Continuation. In ICCV. IEEE Computer Society, 5609--5618.Google Scholar
- Francisco M. Castro, Manuel J. Mar'i n-Jimé nez, Nicolá s Guil, Cordelia Schmid, and Karteek Alahari. 2018. End-to-End Incremental Learning. In ECCV (12) (Lecture Notes in Computer Science, Vol. 11216). Springer, 241--257.Google Scholar
- Chih-Chung Chang and Chih-Jen Lin. 2021. libsvm. https://www.csie.ntu.edu.tw/ cjlin/libsvm/ Retrieved 2021-05--25 fromGoogle Scholar
- Moses Charikar. 2002. Similarity estimation techniques from rounding algorithms. In STOC. ACM, 380--388.Google Scholar
- Surajit Chaudhuri, Bolin Ding, and Srikanth Kandula. 2017. Approximate Query Processing: No Silver Bullet. In SIGMOD Conference. ACM, 511--519.Google ScholarDigital Library
- Dong Deng, Guoliang Li, He Wen, and Jianhua Feng. 2015. An Efficient Partition Based Method for Exact Set Similarity Joins. Proc. VLDB Endow., Vol. 9, 4 (2015), 360--371.Google ScholarDigital Library
- Jialin Ding, Vikram Nathan, Mohammad Alizadeh, and Tim Kraska. 2020. Tsunami: A Learned Multi-dimensional Index for Correlated Data and Skewed Workloads. Proc. VLDB Endow., Vol. 14, 2 (2020), 74--86.Google ScholarDigital Library
- Paolo Ferragina and Giorgio Vinciguerra. 2020. The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds. Proc. VLDB Endow., Vol. 13, 8 (2020), 1162--1175.Google ScholarDigital Library
- Phillip B. Gibbons and Yossi Matias. 1999. Synopsis Data Structures for Massive Data Sets. In SODA. ACM/SIAM, 909--910.Google Scholar
- Daniel H. Greene, Michal Parnas, and F. Frances Yao. 1994. Multi-Index Hashing for Information Retrieval. In FOCS. IEEE Computer Society, 722--731.Google Scholar
- Antonin Guttman. 1984. R-Trees: A Dynamic Index Structure for Spatial Searching. In SIGMOD Conference. ACM Press, 47--57.Google ScholarDigital Library
- Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2020. DeepDB: Learn from Data, not from Queries! Proc. VLDB Endow., Vol. 13, 7 (2020), 992--1005.Google ScholarDigital Library
- Facebook Inc. 2021. fastText. https://fasttext.cc/ Retrieved 2021-05--25 fromGoogle Scholar
- Piotr Indyk and Rajeev Motwani. 1998. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. In STOC. ACM, 604--613.Google Scholar
- Hervé Jé gou, Matthijs Douze, and Cordelia Schmid. 2011a. Product Quantization for Nearest Neighbor Search. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 33, 1 (2011), 117--128.Google ScholarDigital Library
- Hervé Jé gou, Romain Tavenard, Matthijs Douze, and Laurent Amsaleg. 2011b. Searching in one billion vectors: Re-rank with source coding. In ICASSP. IEEE, 861--864.Google Scholar
- Andreas Kipf, Thomas Kipf, Bernhard Radke, Viktor Leis, Peter A. Boncz, and Alfons Kemper. 2019. Learned Cardinalities: Estimating Correlated Joins with Deep Learning. In CIDR. www.cidrdb.org.Google Scholar
- Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The Case for Learned Index Structures. In SIGMOD Conference. ACM, 489--504.Google Scholar
- Ani Kristo, Kapil Vaidya, Ugur cC etintemel, Sanchit Misra, and Tim Kraska. 2020. The Case for a Learned Sorting Algorithm. In SIGMOD Conference. ACM, 1001--1016.Google ScholarDigital Library
- Kevin Lin, Huei-Fang Yang, Jen-Hao Hsiao, and Chu-Song Chen. 2015. Deep learning of binary hash codes for fast image retrieval. In CVPR Workshops. IEEE Computer Society, 27--35.Google ScholarCross Ref
- Haomiao Liu, Ruiping Wang, Shiguang Shan, and Xilin Chen. 2019. Deep Supervised Hashing for Fast Image Retrieval. Int. J. Comput. Vis., Vol. 127, 9 (2019), 1217--1234.Google ScholarDigital Library
- Qiyu Liu, Libin Zheng, Yanyan Shen, and Lei Chen. 2020. Stable Learned Bloom Filters for Data Streams. Proc. VLDB Endow., Vol. 13, 11 (2020), 2355--2367.Google ScholarDigital Library
- Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. 2007. Detecting near-duplicates for web crawling. In WWW. ACM, 141--150.Google Scholar
- Michael Mitzenmacher. 2018. A Model for Learned Bloom Filters and Optimizing by Sandwiching. In NeurIPS. 462--471.Google Scholar
- Guido F. Montú far, Razvan Pascanu, KyungHyun Cho, and Yoshua Bengio. 2014. On the Number of Linear Regions of Deep Neural Networks. In NIPS. 2924--2932.Google Scholar
- Vikram Nathan, Jialin Ding, Mohammad Alizadeh, and Tim Kraska. 2020. Learning Multi-Dimensional Indexes. In SIGMOD Conference. ACM, 985--1000.Google Scholar
- Mohammad Norouzi and David J. Fleet. 2011. Minimal Loss Hashing for Compact Binary Codes. In ICML. Omnipress, 353--360.Google ScholarDigital Library
- Mohammad Norouzi, Ali Punjani, and David J. Fleet. 2012. Fast search in Hamming space with multi-index hashing. In CVPR. IEEE Computer Society, 3108--3115.Google ScholarDigital Library
- Mohammad Norouzi, Ali Punjani, and David J. Fleet. 2014. Fast Exact Search in Hamming Space With Multi-Index Hashing. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 36, 6 (2014), 1107--1119.Google ScholarDigital Library
- Joseph O'Rourke. 1981. An On-Line Algorithm for Fitting Straight Lines Between Data Ranges. Commun. ACM, Vol. 24, 9 (1981), 574--578.Google ScholarDigital Library
- Giuseppe Ottaviano and Rossano Venturini. 2014. Partitioned Elias-Fano indexes. In SIGIR. ACM, 273--282.Google Scholar
- Jianbin Qin, Yaoshu Wang, Chuan Xiao, Wei Wang, Xuemin Lin, and Yoshiharu Ishikawa. 2018. GPH: Similarity Search in Hamming Space. In ICDE. IEEE Computer Society, 29--40.Google Scholar
- Jianbin Qin, Chuan Xiao, Yaoshu Wang, Wei Wang, Xuemin Lin, Yoshiharu Ishikawa, and Guoren Wang. 2021. Generalizing the Pigeonhole Principle for Similarity Search in Hamming Space. IEEE Trans. Knowl. Data Eng., Vol. 33, 2 (2021), 489--505.Google ScholarDigital Library
- Connor Shorten and Taghi M. Khoshgoftaar. 2019. A survey on Image Data Augmentation for Deep Learning. J. Big Data, Vol. 6 (2019), 60.Google ScholarCross Ref
- Craig Silverstein, Monika Rauch Henzinger, Hannes Marais, and Michael Moricz. 1999. Analysis of a Very Large Web Search Engine Query Log. SIGIR Forum, Vol. 33, 1 (1999), 6--12.Google ScholarDigital Library
- sparsehash. 2021. sparsehash. https://github.com/sparsehash/sparsehash Retrieved 2021-05--25 fromGoogle Scholar
- Ji Sun, Guoliang Li, and Nan Tang. 2021. Learned Cardinality Estimation for Similarity Queries. In SIGMOD Conference. ACM, 1745--1757.Google Scholar
- Maxim Sviridenko. 2004. A note on maximizing a submodular set function subject to a knapsack constraint. Oper. Res. Lett., Vol. 32, 1 (2004), 41--43.Google ScholarDigital Library
- Vijay V Vazirani. 2013. Approximation algorithms .Springer Science & Business Media.Google ScholarDigital Library
- Sebastiano Vigna. 2013. Quasi-succinct indices. In WSDM. ACM, 83--92.Google Scholar
- Ji Wan, Sheng Tang, Yongdong Zhang, Lei Huang, and Jintao Li. 2013. Data driven multi-index hashing. In ICIP. IEEE, 2670--2673.Google Scholar
- Jingdong Wang, Ting Zhang, Jingkuan Song, Nicu Sebe, and Heng Tao Shen. 2018. A Survey on Learning to Hash. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 40, 4 (2018), 769--790.Google ScholarCross Ref
- Xiaoying Wang, Changbo Qu, Weiyuan Wu, Jiannan Wang, and Qingqing Zhou. 2021. Are We Ready For Learned Cardinality Estimation? Proc. VLDB Endow., Vol. 14, 9 (2021), 1640--1654.Google ScholarDigital Library
- Yaoshu Wang, Chuan Xiao, Jianbin Qin, Xin Cao, Yifang Sun, Wei Wang, and Makoto Onizuka. 2020. Monotonic Cardinality Estimation of Similarity Selection: A Deep Learning Approach. In SIGMOD Conference. ACM, 1197--1212.Google Scholar
- Hao Yan, Shuai Ding, and Torsten Suel. 2009. Inverted index compression and query processing with optimized document ordering. In WWW. ACM, 401--410.Google Scholar
- Zongheng Yang, Eric Liang, Amog Kamsetty, Chenggang Wu, Yan Duan, Peter Chen, Pieter Abbeel, Joseph M. Hellerstein, Sanjay Krishnan, and Ion Stoica. 2019. Deep Unsupervised Cardinality Estimation. Proc. VLDB Endow., Vol. 13, 3 (2019), 279--292.Google ScholarDigital Library
- Dan Zhang, Fei Wang, and Luo Si. 2011. Composite hashing with multiple information sources. In SIGIR. ACM, 225--234.Google Scholar
- Dell Zhang, Jun Wang, Deng Cai, and Jinsong Lu. 2010. Self-taught hashing for fast similarity search. In SIGIR. ACM, 18--25.Google Scholar
- Xiaoyang Zhang, Jianbin Qin, Wei Wang, Yifang Sun, and Jiaheng Lu. 2013. HmSearch: an efficient hamming distance query processing algorithm. In SSDBM. ACM, 19:1--19:12.Google Scholar
Index Terms
- HAP: An Efficient Hamming Space Index Based on Augmented Pigeonhole Principle
Recommendations
A space-partitioning-based indexing method for multidimensional non-ordered discrete data spaces
There is an increasing demand for similarity searches in a multidimensional non-ordered discrete data space (NDDS) from application areas such as bioinformatics and data mining. The non-ordered and discrete nature of an NDDS raises new challenges for ...
Efficient k-nearest neighbor searching in nonordered discrete data spaces
Numerous techniques have been proposed in the past for supporting efficient k-nearest neighbor (k-NN) queries in continuous data spaces. Limited work has been reported in the literature for k-NN queries in a nonordered discrete data space (NDDS). ...
D-Index: Distance Searching Index for Metric Data Sets
In order to speedup retrieval in large collections of data, index structures partition the data into subsets so that query requests can be evaluated without examining the entire collection. As the complexity of modern data types grows, metric spaces ...
Comments