Article

Principles of hash-based text retrieval

Author:
Benno Stein

Bauhaus University Weimar, Weimar, Germany

Bauhaus University Weimar, Weimar, Germany
View Profile

SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrievalJuly 2007Pages 527–534https://doi.org/10.1145/1277741.1277832

Published:23 July 2007Publication History

SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 527–534

ABSTRACT

Hash-based similarity search reduces a continuous similarity relation to the binary concept "similar or not similar": two feature vectors are considered as similar if they are mapped on the same hash key. From its runtime performance this principle is unequaled--while being unaffected by dimensionality concerns at the same time. Similarity hashing is applied with great success for near similarity search in large document collections, and it is considered as a key technology for near-duplicate detection and plagiarism analysis. This papers reveals the design principles behind hash-based search methods and presents them in a unified way. We introduce new stress statistics that are suited to analyze the performance of hash-based search methods, and we explain the rationale of their effectiveness. Based on these insights, we show how optimum hash functions for similarity search can be derived. We also present new results of a comparative study between different hash-based search methods.

References

R. Ando and L. Lee. Iterative Residual Rescaling: An Analysis and Generalization of LSI. In Proc. 24th conference on research and development in IR, 2001. Google ScholarDigital Library
G. Aston and L. Burnard. The BNC Handbook. http://www.natcorp.ox.ac.uk/what/, 1998.Google Scholar
M. Bawa, T. Condie, and P. Ganesan. LSH Forest: Self-Tuning Indexes for Similarity Search. In WWW'05: Proc. of the 14th int. conference on World Wide Web, 2005. Google ScholarDigital Library
A. Broder, S. Glassman, M. Manasse, and G. Zweig. Syntactic Clustering of the Web. In Selected papers from the sixth int. conference on World Wide Web, 1997. Google ScholarDigital Library
D. Cai and X. Hee. Orthogonal Locality Preserving Indexing. In Proc. of the 28th conference on Research and development in IR, 2005. Google ScholarDigital Library
M. S. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In STOC'02: Proc. of the thirty-fourth ACM symposium on theory of computing, 2002. Google ScholarDigital Library
T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms. MIT Press, Cambridge. 1990.Google Scholar
M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-Sensitive Hashing Scheme Based on p-Stable Distributions. In SCG'04: Proc. of the twentieth symposium on computational geometry, 2004. Google ScholarDigital Library
S. Deerwester, S. Dumais, T. Landauer, G. Furnas, and R. Harshman. Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.Google ScholarCross Ref
C. Eckart and G. Young. The Approximation of one Matrix by Another of Lower Rank. Psychometrika, 1:211--218, 1936.Google ScholarCross Ref
A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. In The VLDB Journal, 1999. Google ScholarDigital Library
X. He, D. Cai, H. Liu, and W.-Y. Ma. Locality Preserving Indexing for Document Representation. In Proc. of the 27th conference on research and development in IR, 2001. Google ScholarDigital Library
M. Henzinger. Finding Near-Duplicate Web Pages: a Large-Scale Evaluation of Algorithms. In Proc. of the 29th conference on research and development in IR, 2006. Google ScholarDigital Library
N. Higham. Computing a Nearest Symmetric Positive Semidefinite Matrix. Linear Algebra and its App., 1988.Google Scholar
G. Hinton and R. Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks. Science, 313:504--507, 2006.Google Scholar
T. Hofmann. Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning, 42:177--196, 2001. Google ScholarDigital Library
P. Indyk. Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation. In FOCS'00: Proc. of the 41st symposium on foundations of computer science, 2000. IEEE Computer Society. Google ScholarDigital Library
P. Indyk and R. Motwani. Approximate Nearest Neighbor - Towards Removing the Curse of Dimensionality. In Proc. of the 30th symposium on theory of computing, 1998. Google ScholarDigital Library
I. Jolliffe. Principal Component Analysis. Springer, 1996.Google Scholar
J. Kleinberg. Two Algorithms for Nearest-Neighbor Search in High Dimensions. In STOC'97: Proc. of the twenty-ninth ACM symposium on theory of computing, 1997. Google ScholarDigital Library
J. Kruskal. Multidimensional Scaling by Optimizing Goodness of Fit to a Nonmetric Hypothesis. Psychometrika, 29(1), 1964.Google Scholar
Y. Matsuo and M. Ishizuka. Keyword Extraction from a Single Document using Word Co-ocurrence Statistical Information. Int. Journal on Artificial Intelligence Tools, 13(1):157--169, 2004.Google ScholarCross Ref
J. Nolan. Stable Distributions - Models for Heavy Tailed Data. http://academic2.american.edu/~jpnolan/stable/, 2005.Google Scholar
T. Rose, M. Stevenson, and M. Whitehead. The Reuters Corpus Volume 1. From Yesterday's News to Tomorrow's Language Resources. In Proc. of the third int. conference on language resources and evaluation, 2002.Google Scholar
S. Rump. Verification of Positive Definiteness. BIT Numerical Mathematics, 46:433--452, 2006.Google ScholarCross Ref
B. Stein. Fuzzy-Fingerprints for Text-Based IR. In Proc. of the 5th Int. Conference on KnowledgeManagement, Graz, Journal of Universal Computer Science, 2005.Google Scholar
B. Stein and S. Meyer zu Eißen. Near Similarity Search and Plagiarism Analysis. In From Data and Information Analysis to Knowledge Engineering. Springer, 2006.Google ScholarCross Ref
R. Weber, H. Schek, and S. Blott. A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-dimensional Spaces. In Proc. of the 24th VLDB conference, 1998. Google ScholarDigital Library
H. Yang and J. Callan. Near-Duplicate Detection by Instance-level Constrained Clustering. In Proc. of the 29th conference on research and development in IR, 2006. Google ScholarDigital Library

Index Terms

Principles of hash-based text retrieval
1. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

Irrelevance reduction with locality-sensitive hash learning for efficient cross-media retrieval

Cross-media retrieval is an imperative approach to handle the explosive growth of multimodal data on the web. However, existing approaches to cross-media retrieval are computationally expensive due to high dimensionality. To efficiently retrieve in ...
Read More
Confirmation Sampling for Exact Nearest Neighbor Search
Similarity Search and Applications
Abstract
Locality-sensitive hashing (LSH), introduced by Indyk and Motwani in STOC ’98, has been an extremely influential framework for nearest neighbor search in high-dimensional data sets. While theoretical work has focused on the approximate nearest ...
Read More
Query-aware locality-sensitive hashing scheme for $$l_p$$lp norm

The problem of c-Approximate Nearest Neighbor (c-ANN) search in high-dimensional space is fundamentally important in many applications, such as image database and data mining. Locality-Sensitive Hashing (LSH) and its variants are the well-known indexing ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
July 2007
946 pages
ISBN:9781595935977
DOI:10.1145/1277741
General Chairs:
Wessel Kraaij
TNO, The Netherlands
,
Arjen P. de Vries
CWI, The Netherlands
,
Program Chairs:
Charles L. A. Clarke
University of Waterloo, Canada
,
Norbert Fuhr
University of Duisburg-Essen, Germany
,
Noriko Kando
National Institute of Informatics, Japan
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 July 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
dimension reduction
hash-based similarity search
locality-sensitive hashing
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 55
  Total Citations
  View Citations
- 1,306
  Total Downloads
- Downloads (Last 12 months)16
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Principles of hash-based text retrieval

SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Irrelevance reduction with locality-sensitive hash learning for efficient cross-media retrieval

Confirmation Sampling for Exact Nearest Neighbor Search

Query-aware locality-sensitive hashing scheme for $$l_p$$lp norm