Approximative distance computation by random hashing

Mimaroglu, Selim; Yagci, Murat; Simovici, Dan A.

doi:10.1007/s11227-011-0618-0

Approximative distance computation by random hashing

Published: 26 May 2011

Volume 61, pages 572–589, (2012)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Selim Mimaroglu¹,
Murat Yagci &
Dan A. Simovici²

91 Accesses
2 Citations
Explore all metrics

Abstract

We propose an approximate computation technique for inter-object distances of binary data sets. Our approach is based on locality sensitive hashing. We randomly select a number of projections of the data set and group objects into buckets based on the hash values of these projections. For each pair of objects, occurrences in the same bucket are counted and the exact Hamming distance is approximated based on the number of co-occurrences in all buckets. We parallelize the computation using mainly two schemes. The first assigns each random subspace to a processor for calculating the local co-occurrence matrix, where all the local co-occurrence matrices are combined into the final co-occurrence matrix. The second method provides the same distance approximation in longer runtimes by limiting the total message size in a parallel computing environment, which is especially useful for very large data sets generating immense message traffic. Our methods produce very accurate results, scale up well with the number of objects, and tolerate processor failures. Experimental evaluations on supercomputers and workstations with several processors demonstrate the usefulness of our methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual ACM symposium on theory of computing, pp 604–613
Chapter Google Scholar
Andoni A, Indyk P (2008) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun ACM 51(1):117–122
Article Google Scholar
Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: Proceedings of the 25th international conference on very large data bases, pp 518–529
Google Scholar
Haveliwala T, Gionis A, Indyk P (2000) Scalable techniques for clustering the web. In: WebDB (informal proceedings), vol 129, p 134
Google Scholar
Koga H, Ishibashi T, Watanabe T (2007) Fast agglomerative hierarchical clustering algorithm using locality-sensitive hashing. Knowl Inf Syst 12(1):25–53
Article Google Scholar
Sibson R (1973) Slink: an optimally efficient algorithm for the single-link cluster method. Comput J 16(1):30–34
Article MathSciNet Google Scholar
Flynn M (1972) Some computer organizations and their effectiveness. IEEE Trans Comput 21(9):948–960
Article MathSciNet MATH Google Scholar
Mimaroglu S, Simovici DA (2008) Approximate computation of object distances by locality-sensitive hashing. In: DMIN, pp 714–718
Google Scholar
Kambadur P, Gregor D, Lumsdaine A, Dharurkar A (2006) Modernizing the C++ interface to mpi. In: Lecture notes in computer science, vol 4192. Springer, Berlin, p 266
Google Scholar
Tansey W, Tilevich E (2008) Efficient automated marshaling of C++ data structures for mpi applications. In: IPDPS. IEEE Press, New York, pp 1–12
Google Scholar
Thakur R, Rabenseifner R, Gropp W (2005) Optimization of collective communication operations in mpich. Int J High Perform Comput Appl 19(1):49
Article Google Scholar
Dhillon I, Modha DS (2000) A data-clustering algorithm on distributed memory multiprocessors. In: Lecture notes in computer science, vol 1759. Springer, Berlin, pp 245–260
Google Scholar
Gropp W, Huss-Lederman S, Lumsdaine A, Lusk E, Nitzberg B, Saphir W, Snir M (1998) Mpi—the complete reference, vol 2, The mpi-2 extensions. ISBN-10:0-262-57123-4
Tu B, Fan J, Zhan J, Zhao X (2009) Performance analysis and optimization of MPI collective operations on multi-core clusters. J Supercomput. doi:10.1007/s11227-009-0296-3
Google Scholar
Gropp W (2002) Mpich2: a new start for mpi implementations. In: Lecture notes in computer science. Springer, Berlin, pp 7–27
Google Scholar
Karlsson B (2005) Beyond the C++ standard library. Addison-Wesley Professional, New York
Google Scholar
Sokal R, Rohlf F (1962) The comparison of dendrograms by objective methods. Taxon 11(1):30–40
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Engineering Department, Bahcesehir University, Istanbul, Turkey
Selim Mimaroglu
Department of Computer Science, University of Massachusetts at Boston, Boston, MA, USA
Dan A. Simovici

Authors

Selim Mimaroglu
View author publications
You can also search for this author in PubMed Google Scholar
Murat Yagci
View author publications
You can also search for this author in PubMed Google Scholar
Dan A. Simovici
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Selim Mimaroglu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mimaroglu, S., Yagci, M. & Simovici, D.A. Approximative distance computation by random hashing. J Supercomput 61, 572–589 (2012). https://doi.org/10.1007/s11227-011-0618-0

Download citation

Published: 26 May 2011
Issue Date: September 2012
DOI: https://doi.org/10.1007/s11227-011-0618-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Approximative distance computation by random hashing

Abstract

Access this article

Similar content being viewed by others

Fast Locality-Sensitive Hashing Frameworks for Approximate Near Neighbor Search

On the Problem of $$p_1^{-1}$$ in Locality-Sensitive Hashing

Large-Scale Distributed Locality-Sensitive Hashing for General Metric Data

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Approximative distance computation by random hashing

Abstract

Access this article

Similar content being viewed by others

Fast Locality-Sensitive Hashing Frameworks for Approximate Near Neighbor Search

On the Problem of $$p_1^{-1}$$ in Locality-Sensitive Hashing

Large-Scale Distributed Locality-Sensitive Hashing for General Metric Data

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation