Abstract
We propose an approximate computation technique for inter-object distances of binary data sets. Our approach is based on locality sensitive hashing. We randomly select a number of projections of the data set and group objects into buckets based on the hash values of these projections. For each pair of objects, occurrences in the same bucket are counted and the exact Hamming distance is approximated based on the number of co-occurrences in all buckets. We parallelize the computation using mainly two schemes. The first assigns each random subspace to a processor for calculating the local co-occurrence matrix, where all the local co-occurrence matrices are combined into the final co-occurrence matrix. The second method provides the same distance approximation in longer runtimes by limiting the total message size in a parallel computing environment, which is especially useful for very large data sets generating immense message traffic. Our methods produce very accurate results, scale up well with the number of objects, and tolerate processor failures. Experimental evaluations on supercomputers and workstations with several processors demonstrate the usefulness of our methods.
Similar content being viewed by others
References
Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual ACM symposium on theory of computing, pp 604–613
Andoni A, Indyk P (2008) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun ACM 51(1):117–122
Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: Proceedings of the 25th international conference on very large data bases, pp 518–529
Haveliwala T, Gionis A, Indyk P (2000) Scalable techniques for clustering the web. In: WebDB (informal proceedings), vol 129, p 134
Koga H, Ishibashi T, Watanabe T (2007) Fast agglomerative hierarchical clustering algorithm using locality-sensitive hashing. Knowl Inf Syst 12(1):25–53
Sibson R (1973) Slink: an optimally efficient algorithm for the single-link cluster method. Comput J 16(1):30–34
Flynn M (1972) Some computer organizations and their effectiveness. IEEE Trans Comput 21(9):948–960
Mimaroglu S, Simovici DA (2008) Approximate computation of object distances by locality-sensitive hashing. In: DMIN, pp 714–718
Kambadur P, Gregor D, Lumsdaine A, Dharurkar A (2006) Modernizing the C++ interface to mpi. In: Lecture notes in computer science, vol 4192. Springer, Berlin, p 266
Tansey W, Tilevich E (2008) Efficient automated marshaling of C++ data structures for mpi applications. In: IPDPS. IEEE Press, New York, pp 1–12
Thakur R, Rabenseifner R, Gropp W (2005) Optimization of collective communication operations in mpich. Int J High Perform Comput Appl 19(1):49
Dhillon I, Modha DS (2000) A data-clustering algorithm on distributed memory multiprocessors. In: Lecture notes in computer science, vol 1759. Springer, Berlin, pp 245–260
Gropp W, Huss-Lederman S, Lumsdaine A, Lusk E, Nitzberg B, Saphir W, Snir M (1998) Mpi—the complete reference, vol 2, The mpi-2 extensions. ISBN-10:0-262-57123-4
Tu B, Fan J, Zhan J, Zhao X (2009) Performance analysis and optimization of MPI collective operations on multi-core clusters. J Supercomput. doi:10.1007/s11227-009-0296-3
Gropp W (2002) Mpich2: a new start for mpi implementations. In: Lecture notes in computer science. Springer, Berlin, pp 7–27
Karlsson B (2005) Beyond the C++ standard library. Addison-Wesley Professional, New York
Sokal R, Rohlf F (1962) The comparison of dendrograms by objective methods. Taxon 11(1):30–40
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Mimaroglu, S., Yagci, M. & Simovici, D.A. Approximative distance computation by random hashing. J Supercomput 61, 572–589 (2012). https://doi.org/10.1007/s11227-011-0618-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-011-0618-0