An improved method of locality sensitive hashing for indexing large-scale and high-dimensional features
Highlights
► Analyse the “Non-Uniform” problem of Euclidean LSH formally. ► Propose a pivot-based algorithm to accelerate the query process of Euclidean LSH. ► Provide an effective method to get the optimal pivot point. ► Our method can significantly accelerate the query process of Euclidean LSH. ► Verify the feasibility of accelerating our algorithm through sampling.
Introduction
Nearest Neighbors (NNs) search takes an important role in computer vision, machine learning, data mining, and information retrieval. However, for high-dimensional data, all known techniques to solve the similarity search problem will fall prey to the curse of dimensionality [1]. In some cases, such as high-dimensional learning for video annotation, the curse of dimensionality can be solved by multimodality learning [2]. In most cases, Approximate Nearest Neighbour (ANN) algorithms have been shown to be effective approaches to drastically improve the search speed while maintaining good precision. Locality sensitive hashing [3], [4], [5], [6] is one of the most popular ANN algorithms. Until now, LSH has been successfully used for image retrieval [7], [8] and 3D object indexing [9], [10]. Euclidean LSH [4] is the most successful variation of basic LSH because it uses popular Euclidean distance as similarity metric. However, some distance metrics beyond Euclidean distance metric must be used in practical applications [11], [12], [13], [14]. Therefore, some other variations are proposed for different distance metrics. Indyk and Thaper [15] have proposed the method of embedding EMD metric into the L2 norm, then using the original LSH scheme to find the nearest neighbor in the Euclidean space. Gorisse et al. [1] present a new LSH scheme adapted to distance for approximate nearest neighbors search in high-dimensional spaces. Although LSH can get perfect results in theory, there are some drawbacks when using LSH in practical.
The main limitation of LSH is that its memory consumption is too large. The reason is that many hash tables are needed to keep both high recall and high precision. In [3], [6], a large number of hash tables are used. When the scale of dataset is large, if the number of hash tables is also large, the index structure cannot be loaded into main memory to get the best performance. To reduce the storage space of LSH, some variations [16], [17], [18] based on multi-probe strategy have been brought forward. The first multi-probe strategy is proposed by entropy based LSH [16]. Through randomly generating neighbor points near the query point, additional hash buckets will be probed and all probe results are merged. As a result, more points will be returned and the recall will be improved. By using this method, less hash tables are needed. Multi-probe LSH [17] is inspired by and improves upon entropy based LSH and query adaptive method [19]. It proposes a more efficient algorithm to generate an optimal probe sequence of hash buckets that are likely to contain similar points to the query. Unlike multi-probe LSH which is based on likelihood criteria, posteriori multi-probe LSH [18] puts forward a more reliable posteriori model by taking account some prior knowledge about the query and the searched points. This prior knowledge helps to do a better quality control and accurately select the hash buckets to be probed. The multi-probe algorithms proposed in [16], [17], [18] can save much storage space for LSH while keeping the comparable query precision and recall. In recent years, some new hashing-based methods which convert each database item into a compact binary code are proposed to get faster query time with much less storage. Spectral hashing [20] learns data-dependent directions via principal component analysis to generate short binary codes. Wang et al. [21] propose a data-dependent projection learning method such that each hash function is designed to correct the errors made by the previous one sequentially. Motivated by Weiss et al. [20], Liu et al. [22] propose a graph-based hashing method which automatically discovers the neighborhood structure inherent in the data to learn appropriate compact codes in an unsupervised manner. Semi-Supervised Hashing (SSH) [23] is proposed to learn efficient hash codes which can handlesemantic similarity/dissimilarity among the data points. It is also much faster than existing supervised hashing methods and can be easily scaled to large datasets.
Another limitation of LSH is that the query accuracy strongly depends on the selection of parameters. LSH forest [24] is proposed to eliminate the different data-dependent parameters for which LSH must be constantly hand-tuned. LSH forest can guarantee LSH's performance for skewed data distributions while retaining the same storage and query overhead. This characteristic makes LSH forest suitable for large-scale dataset. Specially, LSH forest can be constructed in main memory, on disk, in parallel system, and in peer-to-peer systems.
LSH is efficient to index high-dimensional data. Its variations discussed above can make it index large-scale dataset. As a result, LSH has been a popular index structure for large-scale and high-dimensional dataset. However, whatever method is used, a final filtering process based on exact similarity measure is inevitable. When the scale of dataset becomes very large, the number of points needed to be filtered becomes large too. In this case, the cost of filtering is the main factor that influences the query speed. Specially, the most popular Euclidean LSH [4] uses the quantized projection of a data point on a randomly selected direction as the hash value, which makes the number of points in some buckets as significantly larger than others. At the same time, a query will also be mapped to these buckets with a high probability. This kind of phenomenon, which we call “Non-Uniform”, makes the cost of filtering process significantly higher. When the data scale becomes very large, the problem will be even worse. In this work, we propose a pivot-based algorithm which uses triangle inequality to accelerate the filtering process of Euclidean LSH. The selection of pivot point can significantly influence the acceleration effectiveness. Some index structures [25], [26], [27], [28] use some base points, which are similar to our pivots, to accelerate the search process. These base points are all selected randomly. However, random selection cannot guarantee to get the optimal result. In fact, there is no explicit criterion put forward for base point selection until now.
This paper extends our previous work [29] and has the following contributions: (1) provide a formal analysis of “Non-Uniform” problem of Euclidean LSH which can significantly decrease filtering efficiency; (2) propose a pivot-based algorithm using triangle inequality to accelerate the filtering process of Euclidean LSH; (3) propose an effective method to get an optimal pivot point which is superior to other methods. The difference between this paper and our previous work [29] lies in the following: (1) more complete analysis of related works; (2) more rigorous and aborative theoretical deduction; (3) analyse the factors that affect the performance of the proposed algorithm; (4) do more experiments on new dataset to validate the efficiency of the proposed method; (5) verify the feasibility of accelerating the proposed algorithm through sampling.
The rest of this paper is organised as follows. Section 2 introduces the background of our research. In Section 3, we propose our pivot-based filtering algorithm and the method to get an optimal pivot. Section 4 describes our experiments and Section 5 concludes this paper.
Section snippets
“Non-Uniform” phenomenon of Euclidean LSH
The basic idea of LSH is to hash similar points to same bucket with higher probability than dissimilar points. Let be the domain of data points and be the distance measure. A function family is called for if for any :where and to ensure the function family H is useful. Hash function used in LSH is defined as: where and . The hash value of
Pivot-based filtering algorithm
In this section, we present a method to get an optimal pivot and propose our pivot-based algorithm.
Experiment
To evaluate the performance of the proposed method, we conduct experiments on two benchmark datasets: ANN_SIFT1M (1 million points) [31] and NUS-WIDE (270k points) [41]. ANN_SIFT1M dataset contains 1 million local SIFT descriptors extracted from random images. It also provides a query set which contains 10,000 data points. We directly use this query set as our query set. As the original SIFT points are not normalized and the norm of each data point is too large (almost 500), we normalize these
Conclusion
LSH is efficient to index high-dimensional data and its variations can make it index large-scale dataset, thus it has been a popular index structure for large-scale and high-dimensional dataset. However, a final filtering process based on exact similarity measure is inevitable in query process. In this paper, we analyse the phenomenon we call “Non-Uniform” that dramatically degrades the query performance of the most popular Euclidean LSH. “Non-Uniform” will make a large proportion of queries
Acknowledgements
This work is supported by National Key Technology Research and Development Program of China (2012BAH39B02); Co-building Program of Beijing Municipal Education Commission.
References (41)
- et al.
Semi-supervised kernel density estimation for video annotation
Computer Vision and Image Understanding
(2009) Satisfying general proximity/similarity queries with metric trees
Information Processing Letters
(1991)- et al.
Locality-sensitive hashing for chi2 distance
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2012) - et al.
Unified video annotation via multigraph learning
IEEE Transactions on Circuits and Systems for Video Technology
(2009) - A. Gionis, P. Indyk, R. Motwani, Similarity search in high dimensions via hashing, in: Proceedings of 25th...
- M. Datar, N. Immorlica, P. Indyk, V. Mirrokni, Locality-sensitive hashing scheme based on p-stable distributions, in:...
- A. Andoni, P. Indyk, E2LSH: Exact Euclidean Locality Sensitive Hashing, 2004. URL...
- et al.
Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions
Communications of the ACM
(2008) - B. Kulis, K. Grauman, Kernelized locality-sensitive hashing for scalable image search, in: Proceedings of 12th...
- Y. Ke, R. Sukthankar, L. Huston, Efficient near-duplicate detection and sub-image retrieval, in: ACM Conference on...
Nearest-Neighbor Methods in Learning and Vision: Theory and Practice
Rapid object indexing using locality sensitive hashing and joint 3D-signature space estimation
IEEE Transactions on Pattern Analysis and Machine Intelligence
Beyond distance measurement: constructing neighborhood similarity for video annotation
IEEE Transactions on Multimedia
Towards a relevant and diverse search of social images
IEEE Transactions on Multimedia
Cited by (16)
Fast image similarity search by distributed locality sensitive hashing
2019, Pattern Recognition LettersCitation Excerpt :For large datasets to achieve satisfactory recall rates a lot of hash tables should be used [3,4,14,34–36]. LSH is one of the fundamental methods used for hashing [15,18,23,37,41,45–47]. LSH has been used on many large-scale real-world applications such as image retrieval, object recognition and image mapping [29,37].
A zero-watermarking scheme for three-dimensional mesh models based on multi-features
2019, Multimedia Tools and ApplicationsFast image searching in large scale image database
2017, 2017 25th Signal Processing and Communications Applications Conference, SIU 2017A neural network model for detecting DDoS attacks using darknet traffic features
2016, Proceedings of the International Joint Conference on Neural Networks