Elsevier

Signal Processing

Volume 93, Issue 8, August 2013, Pages 2244-2255
Signal Processing

An improved method of locality sensitive hashing for indexing large-scale and high-dimensional features

https://doi.org/10.1016/j.sigpro.2012.07.014Get rights and content

Abstract

In recent years, Locality sensitive hashing (LSH) has been popularly used as an effective and efficient index structure of multimedia signals. LSH is originally proposed for resolving the high-dimensional approximate similarity search problem. Until now, many kinds of variations of LSH have been proposed for large-scale indexing. Much of the interest is focused on improving the query accuracy for skewed data distribution and reducing the storage space. However, when using LSH, a final filtering process based on exact similarity measure is needed. When the dataset is large-scale, the number of points to be filtered becomes large. As a result, the filtering speed becomes the bottleneck of improving the query speed when the scale of data becomes larger and larger. Furthermore, we observe a “Non-Uniform” phenomenon in the most popular Euclidean LSH which can degrade the filtering speed dramatically. In this paper, a pivot-based algorithm is proposed to improve the filtering speed by using triangle inequality to prune the search process. Furthermore, a novel method to select an optimal pivot for even larger improvement is provided. The experimental results on two open large-scale datasets show that our method can significantly improve the query speed of Euclidean LSH.

Highlights

► Analyse the “Non-Uniform” problem of Euclidean LSH formally. ► Propose a pivot-based algorithm to accelerate the query process of Euclidean LSH. ► Provide an effective method to get the optimal pivot point. ► Our method can significantly accelerate the query process of Euclidean LSH. ► Verify the feasibility of accelerating our algorithm through sampling.

Introduction

Nearest Neighbors (NNs) search takes an important role in computer vision, machine learning, data mining, and information retrieval. However, for high-dimensional data, all known techniques to solve the similarity search problem will fall prey to the curse of dimensionality [1]. In some cases, such as high-dimensional learning for video annotation, the curse of dimensionality can be solved by multimodality learning [2]. In most cases, Approximate Nearest Neighbour (ANN) algorithms have been shown to be effective approaches to drastically improve the search speed while maintaining good precision. Locality sensitive hashing [3], [4], [5], [6] is one of the most popular ANN algorithms. Until now, LSH has been successfully used for image retrieval [7], [8] and 3D object indexing [9], [10]. Euclidean LSH [4] is the most successful variation of basic LSH because it uses popular Euclidean distance as similarity metric. However, some distance metrics beyond Euclidean distance metric must be used in practical applications [11], [12], [13], [14]. Therefore, some other variations are proposed for different distance metrics. Indyk and Thaper [15] have proposed the method of embedding EMD metric into the L2 norm, then using the original LSH scheme to find the nearest neighbor in the Euclidean space. Gorisse et al. [1] present a new LSH scheme adapted to χ2 distance for approximate nearest neighbors search in high-dimensional spaces. Although LSH can get perfect results in theory, there are some drawbacks when using LSH in practical.

The main limitation of LSH is that its memory consumption is too large. The reason is that many hash tables are needed to keep both high recall and high precision. In [3], [6], a large number of hash tables are used. When the scale of dataset is large, if the number of hash tables is also large, the index structure cannot be loaded into main memory to get the best performance. To reduce the storage space of LSH, some variations [16], [17], [18] based on multi-probe strategy have been brought forward. The first multi-probe strategy is proposed by entropy based LSH [16]. Through randomly generating neighbor points near the query point, additional hash buckets will be probed and all probe results are merged. As a result, more points will be returned and the recall will be improved. By using this method, less hash tables are needed. Multi-probe LSH [17] is inspired by and improves upon entropy based LSH and query adaptive method [19]. It proposes a more efficient algorithm to generate an optimal probe sequence of hash buckets that are likely to contain similar points to the query. Unlike multi-probe LSH which is based on likelihood criteria, posteriori multi-probe LSH [18] puts forward a more reliable posteriori model by taking account some prior knowledge about the query and the searched points. This prior knowledge helps to do a better quality control and accurately select the hash buckets to be probed. The multi-probe algorithms proposed in [16], [17], [18] can save much storage space for LSH while keeping the comparable query precision and recall. In recent years, some new hashing-based methods which convert each database item into a compact binary code are proposed to get faster query time with much less storage. Spectral hashing [20] learns data-dependent directions via principal component analysis to generate short binary codes. Wang et al. [21] propose a data-dependent projection learning method such that each hash function is designed to correct the errors made by the previous one sequentially. Motivated by Weiss et al. [20], Liu et al. [22] propose a graph-based hashing method which automatically discovers the neighborhood structure inherent in the data to learn appropriate compact codes in an unsupervised manner. Semi-Supervised Hashing (SSH) [23] is proposed to learn efficient hash codes which can handlesemantic similarity/dissimilarity among the data points. It is also much faster than existing supervised hashing methods and can be easily scaled to large datasets.

Another limitation of LSH is that the query accuracy strongly depends on the selection of parameters. LSH forest [24] is proposed to eliminate the different data-dependent parameters for which LSH must be constantly hand-tuned. LSH forest can guarantee LSH's performance for skewed data distributions while retaining the same storage and query overhead. This characteristic makes LSH forest suitable for large-scale dataset. Specially, LSH forest can be constructed in main memory, on disk, in parallel system, and in peer-to-peer systems.

LSH is efficient to index high-dimensional data. Its variations discussed above can make it index large-scale dataset. As a result, LSH has been a popular index structure for large-scale and high-dimensional dataset. However, whatever method is used, a final filtering process based on exact similarity measure is inevitable. When the scale of dataset becomes very large, the number of points needed to be filtered becomes large too. In this case, the cost of filtering is the main factor that influences the query speed. Specially, the most popular Euclidean LSH [4] uses the quantized projection of a data point on a randomly selected direction as the hash value, which makes the number of points in some buckets as significantly larger than others. At the same time, a query will also be mapped to these buckets with a high probability. This kind of phenomenon, which we call “Non-Uniform”, makes the cost of filtering process significantly higher. When the data scale becomes very large, the problem will be even worse. In this work, we propose a pivot-based algorithm which uses triangle inequality to accelerate the filtering process of Euclidean LSH. The selection of pivot point can significantly influence the acceleration effectiveness. Some index structures [25], [26], [27], [28] use some base points, which are similar to our pivots, to accelerate the search process. These base points are all selected randomly. However, random selection cannot guarantee to get the optimal result. In fact, there is no explicit criterion put forward for base point selection until now.

This paper extends our previous work [29] and has the following contributions: (1) provide a formal analysis of “Non-Uniform” problem of Euclidean LSH which can significantly decrease filtering efficiency; (2) propose a pivot-based algorithm using triangle inequality to accelerate the filtering process of Euclidean LSH; (3) propose an effective method to get an optimal pivot point which is superior to other methods. The difference between this paper and our previous work [29] lies in the following: (1) more complete analysis of related works; (2) more rigorous and aborative theoretical deduction; (3) analyse the factors that affect the performance of the proposed algorithm; (4) do more experiments on new dataset to validate the efficiency of the proposed method; (5) verify the feasibility of accelerating the proposed algorithm through sampling.

The rest of this paper is organised as follows. Section 2 introduces the background of our research. In Section 3, we propose our pivot-based filtering algorithm and the method to get an optimal pivot. Section 4 describes our experiments and Section 5 concludes this paper.

Section snippets

“Non-Uniform” phenomenon of Euclidean LSH

The basic idea of LSH is to hash similar points to same bucket with higher probability than dissimilar points. Let S be the domain of data points and D be the distance measure. A function family H={SU} is called (r1,r2,p1,p2)-sensitive for D if for any p,qS:ifD(p,q)r1thenPr(h(p)=h(q))p1ifD(p,q)r2thenPr(h(p)=h(q))p2where p1>p2 and r1<r2 to ensure the function family H is useful. Hash function used in LSH is defined as: G={g:SUk}where g(p)=(h1(p),h2(p),,hk(p)) and hiH. The hash value of

Pivot-based filtering algorithm

In this section, we present a method to get an optimal pivot and propose our pivot-based algorithm.

Experiment

To evaluate the performance of the proposed method, we conduct experiments on two benchmark datasets: ANN_SIFT1M (1 million points) [31] and NUS-WIDE (270k points) [41]. ANN_SIFT1M dataset contains 1 million local SIFT descriptors extracted from random images. It also provides a query set which contains 10,000 data points. We directly use this query set as our query set. As the original SIFT points are not normalized and the norm of each data point is too large (almost 500), we normalize these

Conclusion

LSH is efficient to index high-dimensional data and its variations can make it index large-scale dataset, thus it has been a popular index structure for large-scale and high-dimensional dataset. However, a final filtering process based on exact similarity measure is inevitable in query process. In this paper, we analyse the phenomenon we call “Non-Uniform” that dramatically degrades the query performance of the most popular Euclidean LSH. “Non-Uniform” will make a large proportion of queries

Acknowledgements

This work is supported by National Key Technology Research and Development Program of China (2012BAH39B02); Co-building Program of Beijing Municipal Education Commission.

References (41)

  • M. Wang et al.

    Semi-supervised kernel density estimation for video annotation

    Computer Vision and Image Understanding

    (2009)
  • J. Uhlmann

    Satisfying general proximity/similarity queries with metric trees

    Information Processing Letters

    (1991)
  • D. Gorisse et al.

    Locality-sensitive hashing for chi2 distance

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2012)
  • M. Wang et al.

    Unified video annotation via multigraph learning

    IEEE Transactions on Circuits and Systems for Video Technology

    (2009)
  • A. Gionis, P. Indyk, R. Motwani, Similarity search in high dimensions via hashing, in: Proceedings of 25th...
  • M. Datar, N. Immorlica, P. Indyk, V. Mirrokni, Locality-sensitive hashing scheme based on p-stable distributions, in:...
  • A. Andoni, P. Indyk, E2LSH: Exact Euclidean Locality Sensitive Hashing, 2004. URL...
  • A. Andoni et al.

    Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions

    Communications of the ACM

    (2008)
  • B. Kulis, K. Grauman, Kernelized locality-sensitive hashing for scalable image search, in: Proceedings of 12th...
  • Y. Ke, R. Sukthankar, L. Huston, Efficient near-duplicate detection and sub-image retrieval, in: ACM Conference on...
  • G. Shakhnarovich et al.

    Nearest-Neighbor Methods in Learning and Vision: Theory and Practice

    (2006)
  • B. Matei et al.

    Rapid object indexing using locality sensitive hashing and joint 3D-signature space estimation

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2006)
  • M. Wang et al.

    Beyond distance measurement: constructing neighborhood similarity for video annotation

    IEEE Transactions on Multimedia

    (2009)
  • M. Wang et al.

    Towards a relevant and diverse search of social images

    IEEE Transactions on Multimedia

    (2010)
  • R. Hong, M. Wang, M. Xu, S. Yan, T.-S. Chua, Dynamic captioning video accessibility enhancement for hearing impairment,...
  • P. Indyk, N. Thaper, Fast image retrieval via embeddings, in: Proceedings of the International Workshop on Statistical...
  • Rina Panigrahy, Entropy based nearest neighbor search in high dimensions, in: Proceedings of the Seventeenth Annual...
  • Q. Lv, W. Josephson, Z. Wang, M. Charikar, K. Li, Multi-probe LSH: efficient indexing for high-dimensional similarity...
  • A. Joly, O. Buisson, A posteriori multi-probe locality sensitive hashing, in: Proceedings of the 16th ACM International...
  • H. Jegou, L. Amsaleg, C. Schmid, P. Gros, Query-adaptative locality sensitive hashing, in: International Conference on...
  • Cited by (16)

    • Fast image similarity search by distributed locality sensitive hashing

      2019, Pattern Recognition Letters
      Citation Excerpt :

      For large datasets to achieve satisfactory recall rates a lot of hash tables should be used [3,4,14,34–36]. LSH is one of the fundamental methods used for hashing [15,18,23,37,41,45–47]. LSH has been used on many large-scale real-world applications such as image retrieval, object recognition and image mapping [29,37].

    • Fast image searching in large scale image database

      2017, 2017 25th Signal Processing and Communications Applications Conference, SIU 2017
    • A neural network model for detecting DDoS attacks using darknet traffic features

      2016, Proceedings of the International Joint Conference on Neural Networks
    View all citing articles on Scopus
    View full text