skip to main content
10.1145/2484838.2484842acmotherconferencesArticle/Chapter ViewAbstractPublication PagesssdbmConference Proceedingsconference-collections
research-article

HmSearch: an efficient hamming distance query processing algorithm

Published:29 July 2013Publication History

ABSTRACT

Hamming distance measures the number of dimensions where two vectors have different values. In applications such as pattern recognition, information retrieval, and databases, we often need to efficiently process Hamming distance query, which retrieves vectors in a database that have no more than k Hamming distance from a given query vector. Existing work on efficient Hamming distance query processing has some of the following limitations, such as only applicable to tiny error threshold values, unable to deal with vectors where the value domain is large, or unable to attain robust performance in the presence of data skew.

In this paper, we propose HmSearch, an efficient query processing method for Hamming distance queries that addresses the above-mentioned limitations. Our method is based on improved enumeration-based signatures, enhanced filtering, and the hierarchical binary filtering-and-verification. We also design an effective dimension rearrangement method to deal with data skew. Extensive experimental results demonstrate that our methods outperform state-of-the-art methods by up to two orders of magnitude.

References

  1. A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. P. Baldi, D. S. Hirschberg, R. J. Nasr, P. Baldi, D. S. Hirschberg, and R. J. Nasr. Speeding up chemical database searches using a proximity filter based on the logical exclusive-or. J. Chem. Inf. Model, pages 1367--1378, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  3. G. S. Brodal and L. Gasieniec. Approximate dictionary queries. In CPM, pages 65--74, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. G. S. Brodal and S. Venkatesh. Improved bounds for dictionary look-up with one error. Inf. Process. Lett., 75(1-2):57--59, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks, 29(8-13):1157--1166, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. B. Chen, D. Wild, and R. Guha. Pubchem as a source of polypharmacology. Journal of Chemical Information and Modeling, 49(9):2044--2055, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  7. J. Chen, S. J. Swamidass, Y. Dou, and P. Baldi. Chemdb: a public database of small molecules and related chemoinformatics resources. Bioinformatics, 21:4133--4139, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Cole, L.-A. Gottlieb, and M. Lewenstein. Dictionary matching and indexing with errors and don't cares. In STOC, pages 91--100, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Symposium on Computational Geometry, pages 253--262, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. R. Flower. On the properties of bit string-based measures of chemical similarity. Journal of Chemical Information and Computer Sciences, 38(3):379--386, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  11. J. Gan, J. Feng, Q. Fang, and W. Ng. Locality-sensitive hashing scheme based on dynamic collision counting. In SIGMOD Conference, pages 541--552, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. X. Liu, K. Shen, and E. Torng. Large scale hamming distance query processing. In ICDE, pages 553--564, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. U. Manber and S. Wu. An algorithm for approximate membership checking with application to password security. Inf. Process. Lett., 50(4):191--197, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. G. S. Manku, A. Jain, and A. D. Sarma. Detecting near-duplicates for web crawling. In WWW, pages 141--150, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Minsky and S. Papert. Perceptrons - an introduction to computational geometry. MIT Press, 1987.Google ScholarGoogle Scholar
  18. R. Nasr, D. Hirschberg, and P. Baldi. Hashing algorithms and data structures for rapid searches of fingerprint vectors. J. Chem. Inf. Model, 50(8):1358--68, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  19. R. Nasr, S. J. Swamidass, and P. Baldi. Large scale study of multiple-molecule queries. J. Cheminformatics, 1:7, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  20. R. Nasr, R. Vernica, C. Li, and P. Baldi. Speeding up chemical searches using the inverted index: The convergence of chemoinformatics and text search methods. J. Chem. Inf. Model, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  21. M. Norouzi, A. Punjani, and D. J. Fleet. Fast search in hamming space with multi-index hashing. In CVPR, pages 3108--3115, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. P. B. R. Nasr, T. Kristensen. Tree and hashing data structures to speedup chemical searches: Analysis and experiments. Molecular Informatics, 30(9):791--800, 2011. Special Issue on Machine Learning Methods in Chemoinformatics/NIPS.Google ScholarGoogle ScholarCross RefCross Ref
  23. S. Swamidass and P. Baldi. Bounds and algorithms for fast exact searches of chemical fingerprints in linear and sublinear time. J Chem Inf Model, 47(2):302--17, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  24. Y. Tabei, T. Uno, M. Sugiyama, and K. Tsuda. Single versus multiple sorting in all pairs similarity search. Journal of Machine Learning Research - Proceedings Track, 13:145--160, 2010.Google ScholarGoogle Scholar
  25. M. Theobald, J. Siddharth, and A. Paepcke. Spotsigs: robust and efficient near duplicate detection in large web collections. In SIGIR, pages 563--570, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. C.-C. Yao and F. F. Yao. Dictionary look-up with one error. J. Algorithms, 25(1):194--202, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    SSDBM '13: Proceedings of the 25th International Conference on Scientific and Statistical Database Management
    July 2013
    401 pages
    ISBN:9781450319218
    DOI:10.1145/2484838

    Copyright © 2013 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 29 July 2013

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    Overall Acceptance Rate56of146submissions,38%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader