ABSTRACT
Hamming distance measures the number of dimensions where two vectors have different values. In applications such as pattern recognition, information retrieval, and databases, we often need to efficiently process Hamming distance query, which retrieves vectors in a database that have no more than k Hamming distance from a given query vector. Existing work on efficient Hamming distance query processing has some of the following limitations, such as only applicable to tiny error threshold values, unable to deal with vectors where the value domain is large, or unable to attain robust performance in the presence of data skew.
In this paper, we propose HmSearch, an efficient query processing method for Hamming distance queries that addresses the above-mentioned limitations. Our method is based on improved enumeration-based signatures, enhanced filtering, and the hierarchical binary filtering-and-verification. We also design an effective dimension rearrangement method to deal with data skew. Extensive experimental results demonstrate that our methods outperform state-of-the-art methods by up to two orders of magnitude.
- A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, 2006. Google ScholarDigital Library
- P. Baldi, D. S. Hirschberg, R. J. Nasr, P. Baldi, D. S. Hirschberg, and R. J. Nasr. Speeding up chemical database searches using a proximity filter based on the logical exclusive-or. J. Chem. Inf. Model, pages 1367--1378, 2008.Google ScholarCross Ref
- G. S. Brodal and L. Gasieniec. Approximate dictionary queries. In CPM, pages 65--74, 1996. Google ScholarDigital Library
- G. S. Brodal and S. Venkatesh. Improved bounds for dictionary look-up with one error. Inf. Process. Lett., 75(1-2):57--59, 2000. Google ScholarDigital Library
- A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks, 29(8-13):1157--1166, 1997. Google ScholarDigital Library
- B. Chen, D. Wild, and R. Guha. Pubchem as a source of polypharmacology. Journal of Chemical Information and Modeling, 49(9):2044--2055, 2009.Google ScholarCross Ref
- J. Chen, S. J. Swamidass, Y. Dou, and P. Baldi. Chemdb: a public database of small molecules and related chemoinformatics resources. Bioinformatics, 21:4133--4139, 2005. Google ScholarDigital Library
- R. Cole, L.-A. Gottlieb, and M. Lewenstein. Dictionary matching and indexing with errors and don't cares. In STOC, pages 91--100, 2004. Google ScholarDigital Library
- M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Symposium on Computational Geometry, pages 253--262, 2004. Google ScholarDigital Library
- D. R. Flower. On the properties of bit string-based measures of chemical similarity. Journal of Chemical Information and Computer Sciences, 38(3):379--386, 1998.Google ScholarCross Ref
- J. Gan, J. Feng, Q. Fang, and W. Ng. Locality-sensitive hashing scheme based on dynamic collision counting. In SIGMOD Conference, pages 541--552, 2012. Google ScholarDigital Library
- P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, 1998. Google ScholarDigital Library
- C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, 2008. Google ScholarDigital Library
- A. X. Liu, K. Shen, and E. Torng. Large scale hamming distance query processing. In ICDE, pages 553--564, 2011. Google ScholarDigital Library
- U. Manber and S. Wu. An algorithm for approximate membership checking with application to password security. Inf. Process. Lett., 50(4):191--197, 1994. Google ScholarDigital Library
- G. S. Manku, A. Jain, and A. D. Sarma. Detecting near-duplicates for web crawling. In WWW, pages 141--150, 2007. Google ScholarDigital Library
- M. Minsky and S. Papert. Perceptrons - an introduction to computational geometry. MIT Press, 1987.Google Scholar
- R. Nasr, D. Hirschberg, and P. Baldi. Hashing algorithms and data structures for rapid searches of fingerprint vectors. J. Chem. Inf. Model, 50(8):1358--68, 2010.Google ScholarCross Ref
- R. Nasr, S. J. Swamidass, and P. Baldi. Large scale study of multiple-molecule queries. J. Cheminformatics, 1:7, 2009.Google ScholarCross Ref
- R. Nasr, R. Vernica, C. Li, and P. Baldi. Speeding up chemical searches using the inverted index: The convergence of chemoinformatics and text search methods. J. Chem. Inf. Model, 2012.Google ScholarCross Ref
- M. Norouzi, A. Punjani, and D. J. Fleet. Fast search in hamming space with multi-index hashing. In CVPR, pages 3108--3115, 2012. Google ScholarDigital Library
- P. B. R. Nasr, T. Kristensen. Tree and hashing data structures to speedup chemical searches: Analysis and experiments. Molecular Informatics, 30(9):791--800, 2011. Special Issue on Machine Learning Methods in Chemoinformatics/NIPS.Google ScholarCross Ref
- S. Swamidass and P. Baldi. Bounds and algorithms for fast exact searches of chemical fingerprints in linear and sublinear time. J Chem Inf Model, 47(2):302--17, 2007.Google ScholarCross Ref
- Y. Tabei, T. Uno, M. Sugiyama, and K. Tsuda. Single versus multiple sorting in all pairs similarity search. Journal of Machine Learning Research - Proceedings Track, 13:145--160, 2010.Google Scholar
- M. Theobald, J. Siddharth, and A. Paepcke. Spotsigs: robust and efficient near duplicate detection in large web collections. In SIGIR, pages 563--570, 2008. Google ScholarDigital Library
- A. C.-C. Yao and F. F. Yao. Dictionary look-up with one error. J. Algorithms, 25(1):194--202, 1997. Google ScholarDigital Library
Recommendations
Equivalence and minimization of conjunctive queries under combined semantics
ICDT '12: Proceedings of the 15th International Conference on Database TheoryThe problems of query containment, equivalence, and minimization are fundamental problems in the context of query processing and optimization. In their classic work [2] published in 1977, Chandra and Merlin solved the three problems for the language of ...
Approximating expressive queries on graph-modeled data
We present GeX for the approximate matching of complex queries on graph-modeled data.GeX generalizes existing approaches and allows for querying any graph-based datasets.GeX query language supports queries ranging from keyword-based to complex ones.GeX ...
Scalable and efficient processing of top-k multiple-type integrated queries
AbstractIn this paper, we define a new class of queries, the top-k multiple-type integrated query (simply, top-k MULTI query). It deals with multiple data types and finds the information in the order of relevance between the query and the object. Various ...
Comments