Abstract
In massive multimedia era, the dimension curse and the I/O performance bottleneck have become two major challenges for disk-based Approximate Nearest Neighbor (ANN) search. Hashing is a popular solution to overcome the dimension curse, one promising hashing technique is Locality Sensitive Hashing (LSH). However, most existing LSH indexings incur significant I/O cost during the search due to their low NN candidate hits in each I/O access. We recommend a novel method SC-LSH (SortingCodes-LSH) which combines LSH with another hashing technique (i.e., the discriminative short codes) to lift the hit of NN candidates so as to further boost the ANN search performance. Firstly, we intensify an LSH index and sort all the compound hashing keys according to a linear order to make similar NN candidates distributed locally. Then we generate product quantization (PQ) codes to use them as candidates instead of the original data points. These space-efficient short codes can enable us acquire significantly candidates via much less I/O operations. Moreover, based on theoretical and empirical studies among series of space-filling curves, we finally choose the Gray curve as the linear order to produce better local distribution of candidate data. All these above significantly increase the NN hits during each I/O, which greatly reduce the amount of necessary I/O access. Meanwhile, with the good similarity preserving ability, PQ codes are precise enough to discriminate NNs and thus guarantee the accuracy. Empirical study demonstrates that, comparing with four state-of-the-arts, SC-LSH achieves the best accuracy with significantly smaller I/O cost and space consumption. In fact, depending on the datasets, the I/O cost (resp., space consumption) of our scheme is only 5%-20% (resp., 1%-20%) of the other methods.
Similar content being viewed by others
Notes
In this work, we suppose that each component of the data point is a one word long integer/float.
References
Babenko A, Lempitsky V (2012) The inverted multi-index. In: CVPR. IEEE, pp 3069–3076
Böhm C (2000) A cost model for query processing in high dimensional data spaces. ACM Trans Database Syst 25(2):129–178
Datar M, Immorlica N, Indyk P, Mirrokni VS (2004) Locality-sensitive hashing scheme based on p-stable distributions. In: SoCG, pp 253–262
Faloutsos C, Roseman S (1989) Fractals for secondary key retrieval. In: PODS, pp 247–252
Gaede V, Günther O (1998) Multidimensional access methods. ACM Comput Surv 30(2):170–231
Gan J, Feng J, Fang Q, Ng W (2012) Locality sensitive hashing scheme based on dynamic collision counting. In: SIGMOD, pp 541–552
Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: VLDB, pp 518–529
Gong Y, Lazebnik S (2011) Iterative quantization: a procrustean approach to learning binary codes. In: CVPR. pp 817–824
He S, Ye G, Hu M, Yang Y, Shen F, Shen HT, Li X (2018) Learning binary codes with local and inner data structure. Neurocomputing 282:32–41
Huang Q, Feng J, Zhang Y, Fang Q, Ng W (2015) Query-aware locality-sensitive hashing for approximate nearest neighbor search. PVLDB 9(1):1–12
Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: STOC, pp 604–613
Jegou H, Douze M, Schmid C (2011) Product quantization for nearest neighbor search. IEEE Trans Pattern Anal Mach Intell 33(1):117–128
Joly A, Buisson O (2008) A posteriori multi-probe locality sensitive hashing. In: ACM multimedia, pp 209–218
Kalantidis Y, Avrithis YS (2014) Locally optimized product quantization for approximate nearest neighbor search. In: CVPR, pp 2329–2336
Li Z, Nie F, Chang X, Yang Y (2017) Beyond trace ratio: weighted harmonic mean of trace ratios for multiclass discriminant analysis. IEEE Trans Knowl Data Eng 29(10):2100–2110
Liu Y, Cui J, Huang Z, Li H, Shen HT (2014) SK-LSH: An efficient index structure for approximate nearest neighbor search. PVLDB 7(9):745–756
Lowe DG (1999) Object recognition from local scale-invariant features. In: The proceedings of the 7th IEEE international conference on computer vision, 1999, vol 2. IEEE, pp 1150–1157
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Luo M, Chang X, Li Z, Nie L, Hauptmann AG, Zheng Q (2017) Simple to complex cross-modal learning to rank. Comput Vis Image Underst 163:67–77
Luo X, Nie L, He X, Wu Y, Chen ZD, Xu XS (2018) Fast scalable supervised hashing. In: SIGIR, pp 735–744
Lv Q, Josephson W, Wang Z, Charikar M, Li K (2007) Multi-probe lsh: efficient indexing for high-dimensional similarity search. In: VLDB, pp 950–961
Nie L, Wang M, Zha ZJ, Chua TS (2012) Oracle in image search: a content-based approach to performance prediction. ACM Trans Inf Syst (TOIS) 30 (2):13
Nie L, Yan S, Wang M, Hong R, Chua TS (2012) Harvesting visual concepts for image search with complex queries. In: Proceedings of the 20th ACM international conference on multimedia. ACM, pp 59–68
Norouzi M, Fleet DJ (2013) Cartesian k-means. In: CVPR, pp 3017–3024
Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. Int J Comput Vis 42(3):145–175
Panigrahy R (2006) Entropy based nearest neighbor search in high dimensions. In: SODA, pp 1186–1195
Park Y, Cafarella MJ, Mozafari B (2015) Neighbor-sensitive hashing. PVLDB 9(3):144–155
Shen F, Zhou X, Yang Y, Song J, Shen HT, Tao D (2016) A fast optimization method for general binary code learning. IEEE Trans Image Process 25 (12):5610–5621
Shen F, Yang Y, Liu L, Liu W, Dacheng Tao HTS (2017) Asymmetric binary coding for image search. IEEE Trans Multimed 19(9):2022–2032
Shen F, Xu Y, Liu L, Yang Y, Huang Z, Shen HT (2018) Unsupervised deep hashing with similarity-adaptive and discrete optimization. IEEE Trans Pattern Anal Mach Intell
Sun Y, Wang W, Qin J, Zhang Y, Lin X (2014) SRS: Solving c-approximate nearest neighbor queries in high dimensional euclidean space with a tiny index. PVLDB 8(1):1–12
Tao Y, Yi K, Sheng C, Kalnis P (2009) Quality and efficiency in high dimensional nearest neighbor search. In: SIGMOD, pp 563–576
Vitter JS (2008) Algorithms and data structures for external memory. Foundations TrendsⓇ, Theor Comput Sci 2(4):305–474
Wang J, Shen HT, Song J, Ji J (2014) Hashing for similarity search: a survey. CoRR 1408.2927
Weber R, Schek HJ, Blott S (1998) A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: VLDB, vol 98, pp 194–205
Weiss Y, Torralba A, Fergus R (2008) Spectral hashing. In: Proceedings of the 22nd annual conference on neural information processing systems, advances in neural information processing systems 21, Vancouver, British Columbia, Canada, December 8-11, 2008, pp 1753–1760
Zhang PF, Li CX, Liu MY, Nie L, Xu XS (2017) Semi-relaxation supervised hashing for cross-modal retrieval. In: Proceedings of the 2017 ACM on multimedia conference. ACM, pp 1762–1770
Acknowledgements
This work is supported by the National Natural Science Foundation of China (Nos. 61472298, 61672408, 61702403, U1135002), China 111 Project (No. B16037), China Postdoctoral Science Foundation (No. 2018M633473), Natural Science Basic Research Plan in Shaanxi Province of China (Program No. 2015JQ6227), SRF for ROCS, SEM, the Fundamental Research Funds for the Central Universities (No. JB170308, etc.) and the Innovation Fund of Xidian University.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Xiaokang, F., Jiangtao, C., Hui, L. et al. An efficient LSH indexing on discriminative short codes for high-dimensional nearest neighbors. Multimed Tools Appl 78, 24407–24429 (2019). https://doi.org/10.1007/s11042-018-6987-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-6987-0