Skip to main content
Log in

Approximative distance computation by random hashing

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

We propose an approximate computation technique for inter-object distances of binary data sets. Our approach is based on locality sensitive hashing. We randomly select a number of projections of the data set and group objects into buckets based on the hash values of these projections. For each pair of objects, occurrences in the same bucket are counted and the exact Hamming distance is approximated based on the number of co-occurrences in all buckets. We parallelize the computation using mainly two schemes. The first assigns each random subspace to a processor for calculating the local co-occurrence matrix, where all the local co-occurrence matrices are combined into the final co-occurrence matrix. The second method provides the same distance approximation in longer runtimes by limiting the total message size in a parallel computing environment, which is especially useful for very large data sets generating immense message traffic. Our methods produce very accurate results, scale up well with the number of objects, and tolerate processor failures. Experimental evaluations on supercomputers and workstations with several processors demonstrate the usefulness of our methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual ACM symposium on theory of computing, pp 604–613

    Chapter  Google Scholar 

  2. Andoni A, Indyk P (2008) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun ACM 51(1):117–122

    Article  Google Scholar 

  3. Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: Proceedings of the 25th international conference on very large data bases, pp 518–529

    Google Scholar 

  4. Haveliwala T, Gionis A, Indyk P (2000) Scalable techniques for clustering the web. In: WebDB (informal proceedings), vol 129, p 134

    Google Scholar 

  5. Koga H, Ishibashi T, Watanabe T (2007) Fast agglomerative hierarchical clustering algorithm using locality-sensitive hashing. Knowl Inf Syst 12(1):25–53

    Article  Google Scholar 

  6. Sibson R (1973) Slink: an optimally efficient algorithm for the single-link cluster method. Comput J 16(1):30–34

    Article  MathSciNet  Google Scholar 

  7. Flynn M (1972) Some computer organizations and their effectiveness. IEEE Trans Comput 21(9):948–960

    Article  MathSciNet  MATH  Google Scholar 

  8. Mimaroglu S, Simovici DA (2008) Approximate computation of object distances by locality-sensitive hashing. In: DMIN, pp 714–718

    Google Scholar 

  9. Kambadur P, Gregor D, Lumsdaine A, Dharurkar A (2006) Modernizing the C++ interface to mpi. In: Lecture notes in computer science, vol 4192. Springer, Berlin, p 266

    Google Scholar 

  10. Tansey W, Tilevich E (2008) Efficient automated marshaling of C++ data structures for mpi applications. In: IPDPS. IEEE Press, New York, pp 1–12

    Google Scholar 

  11. Thakur R, Rabenseifner R, Gropp W (2005) Optimization of collective communication operations in mpich. Int J High Perform Comput Appl 19(1):49

    Article  Google Scholar 

  12. Dhillon I, Modha DS (2000) A data-clustering algorithm on distributed memory multiprocessors. In: Lecture notes in computer science, vol 1759. Springer, Berlin, pp 245–260

    Google Scholar 

  13. Gropp W, Huss-Lederman S, Lumsdaine A, Lusk E, Nitzberg B, Saphir W, Snir M (1998) Mpi—the complete reference, vol 2, The mpi-2 extensions. ISBN-10:0-262-57123-4

  14. Tu B, Fan J, Zhan J, Zhao X (2009) Performance analysis and optimization of MPI collective operations on multi-core clusters. J Supercomput. doi:10.1007/s11227-009-0296-3

    Google Scholar 

  15. Gropp W (2002) Mpich2: a new start for mpi implementations. In: Lecture notes in computer science. Springer, Berlin, pp 7–27

    Google Scholar 

  16. Karlsson B (2005) Beyond the C++ standard library. Addison-Wesley Professional, New York

    Google Scholar 

  17. Sokal R, Rohlf F (1962) The comparison of dendrograms by objective methods. Taxon 11(1):30–40

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Selim Mimaroglu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mimaroglu, S., Yagci, M. & Simovici, D.A. Approximative distance computation by random hashing. J Supercomput 61, 572–589 (2012). https://doi.org/10.1007/s11227-011-0618-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-011-0618-0

Keywords

Navigation