ABSTRACT
Hash-based similarity search reduces a continuous similarity relation to the binary concept "similar or not similar": two feature vectors are considered as similar if they are mapped on the same hash key. From its runtime performance this principle is unequaled--while being unaffected by dimensionality concerns at the same time. Similarity hashing is applied with great success for near similarity search in large document collections, and it is considered as a key technology for near-duplicate detection and plagiarism analysis. This papers reveals the design principles behind hash-based search methods and presents them in a unified way. We introduce new stress statistics that are suited to analyze the performance of hash-based search methods, and we explain the rationale of their effectiveness. Based on these insights, we show how optimum hash functions for similarity search can be derived. We also present new results of a comparative study between different hash-based search methods.
- R. Ando and L. Lee. Iterative Residual Rescaling: An Analysis and Generalization of LSI. In Proc. 24th conference on research and development in IR, 2001. Google ScholarDigital Library
- G. Aston and L. Burnard. The BNC Handbook. http://www.natcorp.ox.ac.uk/what/, 1998.Google Scholar
- M. Bawa, T. Condie, and P. Ganesan. LSH Forest: Self-Tuning Indexes for Similarity Search. In WWW'05: Proc. of the 14th int. conference on World Wide Web, 2005. Google ScholarDigital Library
- A. Broder, S. Glassman, M. Manasse, and G. Zweig. Syntactic Clustering of the Web. In Selected papers from the sixth int. conference on World Wide Web, 1997. Google ScholarDigital Library
- D. Cai and X. Hee. Orthogonal Locality Preserving Indexing. In Proc. of the 28th conference on Research and development in IR, 2005. Google ScholarDigital Library
- M. S. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In STOC'02: Proc. of the thirty-fourth ACM symposium on theory of computing, 2002. Google ScholarDigital Library
- T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms. MIT Press, Cambridge. 1990.Google Scholar
- M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-Sensitive Hashing Scheme Based on p-Stable Distributions. In SCG'04: Proc. of the twentieth symposium on computational geometry, 2004. Google ScholarDigital Library
- S. Deerwester, S. Dumais, T. Landauer, G. Furnas, and R. Harshman. Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.Google ScholarCross Ref
- C. Eckart and G. Young. The Approximation of one Matrix by Another of Lower Rank. Psychometrika, 1:211--218, 1936.Google ScholarCross Ref
- A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. In The VLDB Journal, 1999. Google ScholarDigital Library
- X. He, D. Cai, H. Liu, and W.-Y. Ma. Locality Preserving Indexing for Document Representation. In Proc. of the 27th conference on research and development in IR, 2001. Google ScholarDigital Library
- M. Henzinger. Finding Near-Duplicate Web Pages: a Large-Scale Evaluation of Algorithms. In Proc. of the 29th conference on research and development in IR, 2006. Google ScholarDigital Library
- N. Higham. Computing a Nearest Symmetric Positive Semidefinite Matrix. Linear Algebra and its App., 1988.Google Scholar
- G. Hinton and R. Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks. Science, 313:504--507, 2006.Google Scholar
- T. Hofmann. Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning, 42:177--196, 2001. Google ScholarDigital Library
- P. Indyk. Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation. In FOCS'00: Proc. of the 41st symposium on foundations of computer science, 2000. IEEE Computer Society. Google ScholarDigital Library
- P. Indyk and R. Motwani. Approximate Nearest Neighbor - Towards Removing the Curse of Dimensionality. In Proc. of the 30th symposium on theory of computing, 1998. Google ScholarDigital Library
- I. Jolliffe. Principal Component Analysis. Springer, 1996.Google Scholar
- J. Kleinberg. Two Algorithms for Nearest-Neighbor Search in High Dimensions. In STOC'97: Proc. of the twenty-ninth ACM symposium on theory of computing, 1997. Google ScholarDigital Library
- J. Kruskal. Multidimensional Scaling by Optimizing Goodness of Fit to a Nonmetric Hypothesis. Psychometrika, 29(1), 1964.Google Scholar
- Y. Matsuo and M. Ishizuka. Keyword Extraction from a Single Document using Word Co-ocurrence Statistical Information. Int. Journal on Artificial Intelligence Tools, 13(1):157--169, 2004.Google ScholarCross Ref
- J. Nolan. Stable Distributions - Models for Heavy Tailed Data. http://academic2.american.edu/~jpnolan/stable/, 2005.Google Scholar
- T. Rose, M. Stevenson, and M. Whitehead. The Reuters Corpus Volume 1. From Yesterday's News to Tomorrow's Language Resources. In Proc. of the third int. conference on language resources and evaluation, 2002.Google Scholar
- S. Rump. Verification of Positive Definiteness. BIT Numerical Mathematics, 46:433--452, 2006.Google ScholarCross Ref
- B. Stein. Fuzzy-Fingerprints for Text-Based IR. In Proc. of the 5th Int. Conference on KnowledgeManagement, Graz, Journal of Universal Computer Science, 2005.Google Scholar
- B. Stein and S. Meyer zu Eißen. Near Similarity Search and Plagiarism Analysis. In From Data and Information Analysis to Knowledge Engineering. Springer, 2006.Google ScholarCross Ref
- R. Weber, H. Schek, and S. Blott. A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-dimensional Spaces. In Proc. of the 24th VLDB conference, 1998. Google ScholarDigital Library
- H. Yang and J. Callan. Near-Duplicate Detection by Instance-level Constrained Clustering. In Proc. of the 29th conference on research and development in IR, 2006. Google ScholarDigital Library
Index Terms
- Principles of hash-based text retrieval
Recommendations
Irrelevance reduction with locality-sensitive hash learning for efficient cross-media retrieval
Cross-media retrieval is an imperative approach to handle the explosive growth of multimodal data on the web. However, existing approaches to cross-media retrieval are computationally expensive due to high dimensionality. To efficiently retrieve in ...
Confirmation Sampling for Exact Nearest Neighbor Search
Similarity Search and ApplicationsAbstractLocality-sensitive hashing (LSH), introduced by Indyk and Motwani in STOC ’98, has been an extremely influential framework for nearest neighbor search in high-dimensional data sets. While theoretical work has focused on the approximate nearest ...
Query-aware locality-sensitive hashing scheme for $$l_p$$lp norm
The problem of c-Approximate Nearest Neighbor (c-ANN) search in high-dimensional space is fundamentally important in many applications, such as image database and data mining. Locality-Sensitive Hashing (LSH) and its variants are the well-known indexing ...
Comments