skip to main content
10.1145/1277741.1277832acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Principles of hash-based text retrieval

Published:23 July 2007Publication History

ABSTRACT

Hash-based similarity search reduces a continuous similarity relation to the binary concept "similar or not similar": two feature vectors are considered as similar if they are mapped on the same hash key. From its runtime performance this principle is unequaled--while being unaffected by dimensionality concerns at the same time. Similarity hashing is applied with great success for near similarity search in large document collections, and it is considered as a key technology for near-duplicate detection and plagiarism analysis. This papers reveals the design principles behind hash-based search methods and presents them in a unified way. We introduce new stress statistics that are suited to analyze the performance of hash-based search methods, and we explain the rationale of their effectiveness. Based on these insights, we show how optimum hash functions for similarity search can be derived. We also present new results of a comparative study between different hash-based search methods.

References

  1. R. Ando and L. Lee. Iterative Residual Rescaling: An Analysis and Generalization of LSI. In Proc. 24th conference on research and development in IR, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. G. Aston and L. Burnard. The BNC Handbook. http://www.natcorp.ox.ac.uk/what/, 1998.Google ScholarGoogle Scholar
  3. M. Bawa, T. Condie, and P. Ganesan. LSH Forest: Self-Tuning Indexes for Similarity Search. In WWW'05: Proc. of the 14th int. conference on World Wide Web, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Broder, S. Glassman, M. Manasse, and G. Zweig. Syntactic Clustering of the Web. In Selected papers from the sixth int. conference on World Wide Web, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Cai and X. Hee. Orthogonal Locality Preserving Indexing. In Proc. of the 28th conference on Research and development in IR, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. S. Charikar. Similarity Estimation Techniques from Rounding Algorithms. In STOC'02: Proc. of the thirty-fourth ACM symposium on theory of computing, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms. MIT Press, Cambridge. 1990.Google ScholarGoogle Scholar
  8. M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-Sensitive Hashing Scheme Based on p-Stable Distributions. In SCG'04: Proc. of the twentieth symposium on computational geometry, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Deerwester, S. Dumais, T. Landauer, G. Furnas, and R. Harshman. Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  10. C. Eckart and G. Young. The Approximation of one Matrix by Another of Lower Rank. Psychometrika, 1:211--218, 1936.Google ScholarGoogle ScholarCross RefCross Ref
  11. A. Gionis, P. Indyk, and R. Motwani. Similarity Search in High Dimensions via Hashing. In The VLDB Journal, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. X. He, D. Cai, H. Liu, and W.-Y. Ma. Locality Preserving Indexing for Document Representation. In Proc. of the 27th conference on research and development in IR, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Henzinger. Finding Near-Duplicate Web Pages: a Large-Scale Evaluation of Algorithms. In Proc. of the 29th conference on research and development in IR, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. N. Higham. Computing a Nearest Symmetric Positive Semidefinite Matrix. Linear Algebra and its App., 1988.Google ScholarGoogle Scholar
  15. G. Hinton and R. Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks. Science, 313:504--507, 2006.Google ScholarGoogle Scholar
  16. T. Hofmann. Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning, 42:177--196, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P. Indyk. Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation. In FOCS'00: Proc. of the 41st symposium on foundations of computer science, 2000. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. P. Indyk and R. Motwani. Approximate Nearest Neighbor - Towards Removing the Curse of Dimensionality. In Proc. of the 30th symposium on theory of computing, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. I. Jolliffe. Principal Component Analysis. Springer, 1996.Google ScholarGoogle Scholar
  20. J. Kleinberg. Two Algorithms for Nearest-Neighbor Search in High Dimensions. In STOC'97: Proc. of the twenty-ninth ACM symposium on theory of computing, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Kruskal. Multidimensional Scaling by Optimizing Goodness of Fit to a Nonmetric Hypothesis. Psychometrika, 29(1), 1964.Google ScholarGoogle Scholar
  22. Y. Matsuo and M. Ishizuka. Keyword Extraction from a Single Document using Word Co-ocurrence Statistical Information. Int. Journal on Artificial Intelligence Tools, 13(1):157--169, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  23. J. Nolan. Stable Distributions - Models for Heavy Tailed Data. http://academic2.american.edu/~jpnolan/stable/, 2005.Google ScholarGoogle Scholar
  24. T. Rose, M. Stevenson, and M. Whitehead. The Reuters Corpus Volume 1. From Yesterday's News to Tomorrow's Language Resources. In Proc. of the third int. conference on language resources and evaluation, 2002.Google ScholarGoogle Scholar
  25. S. Rump. Verification of Positive Definiteness. BIT Numerical Mathematics, 46:433--452, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  26. B. Stein. Fuzzy-Fingerprints for Text-Based IR. In Proc. of the 5th Int. Conference on KnowledgeManagement, Graz, Journal of Universal Computer Science, 2005.Google ScholarGoogle Scholar
  27. B. Stein and S. Meyer zu Eißen. Near Similarity Search and Plagiarism Analysis. In From Data and Information Analysis to Knowledge Engineering. Springer, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  28. R. Weber, H. Schek, and S. Blott. A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-dimensional Spaces. In Proc. of the 24th VLDB conference, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. H. Yang and J. Callan. Near-Duplicate Detection by Instance-level Constrained Clustering. In Proc. of the 29th conference on research and development in IR, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Principles of hash-based text retrieval

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
      July 2007
      946 pages
      ISBN:9781595935977
      DOI:10.1145/1277741

      Copyright © 2007 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 23 July 2007

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader