research-article

HmSearch: an efficient hamming distance query processing algorithm

Authors:
Xiaoyang Zhang

University of New South Wales, Australia

University of New South Wales, Australia
View Profile

,
Jianbin Qin

University of New South Wales, Australia

University of New South Wales, Australia
View Profile

,
Wei Wang

University of New South Wales, Australia

University of New South Wales, Australia
View Profile

,
Yifang Sun

University of New South Wales, Australia

University of New South Wales, Australia
View Profile

,
Jiaheng Lu

Renmin University of China, China

Renmin University of China, China
View Profile

SSDBM '13: Proceedings of the 25th International Conference on Scientific and Statistical Database ManagementJuly 2013Article No.: 19Pages 1–12https://doi.org/10.1145/2484838.2484842

Published:29 July 2013Publication History

SSDBM '13: Proceedings of the 25th International Conference on Scientific and Statistical Database Management

Pages 1–12

ABSTRACT

Hamming distance measures the number of dimensions where two vectors have different values. In applications such as pattern recognition, information retrieval, and databases, we often need to efficiently process Hamming distance query, which retrieves vectors in a database that have no more than k Hamming distance from a given query vector. Existing work on efficient Hamming distance query processing has some of the following limitations, such as only applicable to tiny error threshold values, unable to deal with vectors where the value domain is large, or unable to attain robust performance in the presence of data skew.

In this paper, we propose HmSearch, an efficient query processing method for Hamming distance queries that addresses the above-mentioned limitations. Our method is based on improved enumeration-based signatures, enhanced filtering, and the hierarchical binary filtering-and-verification. We also design an effective dimension rearrangement method to deal with data skew. Extensive experimental results demonstrate that our methods outperform state-of-the-art methods by up to two orders of magnitude.

References

A. Arasu, V. Ganti, and R. Kaushik. Efficient exact set-similarity joins. In VLDB, 2006. Google ScholarDigital Library
P. Baldi, D. S. Hirschberg, R. J. Nasr, P. Baldi, D. S. Hirschberg, and R. J. Nasr. Speeding up chemical database searches using a proximity filter based on the logical exclusive-or. J. Chem. Inf. Model, pages 1367--1378, 2008.Google ScholarCross Ref
G. S. Brodal and L. Gasieniec. Approximate dictionary queries. In CPM, pages 65--74, 1996. Google ScholarDigital Library
G. S. Brodal and S. Venkatesh. Improved bounds for dictionary look-up with one error. Inf. Process. Lett., 75(1-2):57--59, 2000. Google ScholarDigital Library
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks, 29(8-13):1157--1166, 1997. Google ScholarDigital Library
B. Chen, D. Wild, and R. Guha. Pubchem as a source of polypharmacology. Journal of Chemical Information and Modeling, 49(9):2044--2055, 2009.Google ScholarCross Ref
J. Chen, S. J. Swamidass, Y. Dou, and P. Baldi. Chemdb: a public database of small molecules and related chemoinformatics resources. Bioinformatics, 21:4133--4139, 2005. Google ScholarDigital Library
R. Cole, L.-A. Gottlieb, and M. Lewenstein. Dictionary matching and indexing with errors and don't cares. In STOC, pages 91--100, 2004. Google ScholarDigital Library
M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Symposium on Computational Geometry, pages 253--262, 2004. Google ScholarDigital Library
D. R. Flower. On the properties of bit string-based measures of chemical similarity. Journal of Chemical Information and Computer Sciences, 38(3):379--386, 1998.Google ScholarCross Ref
J. Gan, J. Feng, Q. Fang, and W. Ng. Locality-sensitive hashing scheme based on dynamic collision counting. In SIGMOD Conference, pages 541--552, 2012. Google ScholarDigital Library
P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, 1998. Google ScholarDigital Library
C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, 2008. Google ScholarDigital Library
A. X. Liu, K. Shen, and E. Torng. Large scale hamming distance query processing. In ICDE, pages 553--564, 2011. Google ScholarDigital Library
U. Manber and S. Wu. An algorithm for approximate membership checking with application to password security. Inf. Process. Lett., 50(4):191--197, 1994. Google ScholarDigital Library
G. S. Manku, A. Jain, and A. D. Sarma. Detecting near-duplicates for web crawling. In WWW, pages 141--150, 2007. Google ScholarDigital Library
M. Minsky and S. Papert. Perceptrons - an introduction to computational geometry. MIT Press, 1987.Google Scholar
R. Nasr, D. Hirschberg, and P. Baldi. Hashing algorithms and data structures for rapid searches of fingerprint vectors. J. Chem. Inf. Model, 50(8):1358--68, 2010.Google ScholarCross Ref
R. Nasr, S. J. Swamidass, and P. Baldi. Large scale study of multiple-molecule queries. J. Cheminformatics, 1:7, 2009.Google ScholarCross Ref
R. Nasr, R. Vernica, C. Li, and P. Baldi. Speeding up chemical searches using the inverted index: The convergence of chemoinformatics and text search methods. J. Chem. Inf. Model, 2012.Google ScholarCross Ref
M. Norouzi, A. Punjani, and D. J. Fleet. Fast search in hamming space with multi-index hashing. In CVPR, pages 3108--3115, 2012. Google ScholarDigital Library
P. B. R. Nasr, T. Kristensen. Tree and hashing data structures to speedup chemical searches: Analysis and experiments. Molecular Informatics, 30(9):791--800, 2011. Special Issue on Machine Learning Methods in Chemoinformatics/NIPS.Google ScholarCross Ref
S. Swamidass and P. Baldi. Bounds and algorithms for fast exact searches of chemical fingerprints in linear and sublinear time. J Chem Inf Model, 47(2):302--17, 2007.Google ScholarCross Ref
Y. Tabei, T. Uno, M. Sugiyama, and K. Tsuda. Single versus multiple sorting in all pairs similarity search. Journal of Machine Learning Research - Proceedings Track, 13:145--160, 2010.Google Scholar
M. Theobald, J. Siddharth, and A. Paepcke. Spotsigs: robust and efficient near duplicate detection in large web collections. In SIGIR, pages 563--570, 2008. Google ScholarDigital Library
A. C.-C. Yao and F. F. Yao. Dictionary look-up with one error. J. Algorithms, 25(1):194--202, 1997. Google ScholarDigital Library

Recommendations

Equivalence and minimization of conjunctive queries under combined semantics
ICDT '12: Proceedings of the 15th International Conference on Database Theory

The problems of query containment, equivalence, and minimization are fundamental problems in the context of query processing and optimization. In their classic work [2] published in 1977, Chandra and Merlin solved the three problems for the language of ...
Read More
Approximating expressive queries on graph-modeled data

We present GeX for the approximate matching of complex queries on graph-modeled data.GeX generalizes existing approaches and allows for querying any graph-based datasets.GeX query language supports queries ranging from keyword-based to complex ones.GeX ...
Read More
Scalable and efficient processing of top-k multiple-type integrated queries
Abstract
In this paper, we define a new class of queries, the top-k multiple-type integrated query (simply, top-k MULTI query). It deals with multiple data types and finds the information in the order of relevance between the query and the object. Various ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SSDBM '13: Proceedings of the 25th International Conference on Scientific and Statistical Database Management
July 2013
401 pages
ISBN:9781450319218
DOI:10.1145/2484838
Editors:
Alex Szalay,
Tamas Budavari,
Magdalena Balazinska,
Alexandra Meliou,
Ahmet Sacan
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 July 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate56of146submissions,38%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 28
  Total Citations
  View Citations
- 244
  Total Downloads
- Downloads (Last 12 months)13
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HmSearch: an efficient hamming distance query processing algorithm

SSDBM '13: Proceedings of the 25th International Conference on Scientific and Statistical Database Management

ABSTRACT

References

Cited By

Recommendations

Equivalence and minimization of conjunctive queries under combined semantics

Approximating expressive queries on graph-modeled data

Scalable and efficient processing of top-k multiple-type integrated queries